TreeBest_Workflow_Example

Purpose

The purpose of this page is to use some example data to explore how TreeBest works and will fit into our prototype and workflows.

Treebest is documented here: http://treesoft.sourceforge.net/treebest.shtml
It is distributed as C source code and runs from the command line, making it very convenient for integration into web-based software.

See this video for info about TreeBest: http://www.newton.ac.uk/cgi/smsflvplay?42-218

Note on 5 character limit to species labels

  • There is none that I can find. As long as the gene_myTaxon convention is used in the alignment file and thee is a matching MyTaxon in the species tree, it runs fin in my hands. For example:

Workflow

Species tree

The species tree is used as a guide to infer the best gene tree given a set of aligned DNA sequences. This is the deafult tree included in the distribution. I use it for the example below. The tree is in NHX (extended newick) format and must have lables for internal nodes.

((ORYSA*-4530.rice,ARATH*-3702)Magnoliophyta-3398,(SCHPO*-4896.S_pombe,YEAST*-4932)Ascomycota-4890,
 ((((((((((((HUMAN*-9606,PANTR*-9598.chimpanzee)Homo/Pan/Gorilla-207598, 
            MACMU*-9544.monkey)Catarrhini-9526, 
            OTOGA-*30611.galago)Primates-9443, 
         ((MOUSE*-10090,RAT*-10116)Murinae-39107,RABIT-9986)Glires-314147)Euarchontoglires-314146, 
        ((BOVIN*-9913.cow,PIG-*9823)Cetartiodactyla-91561, 
         (CANFA*-9615.dog,FELCA-*9685.cat)Carnivora-33554, 
         SORAR-*42254.shrew, 
         MYOLU-*59463.bat)Laurasiatheria-314145, 
        (ECHTE-9371.tenrec,LOXAF-9785.elephant)Afrotheria-311790, 
        DASNO-9361.armadillo)Eutheria-9347,MONDO*-13616.opossum)Theria-32525,
       ORNAN-*9258.platypus)Mammalia-40674,
      CHICK*-9031)Amniota-32524,
     XENTR*-8364.frog)Tetrapoda-32523,
    (BRARE*-7955.zebrafish, 
     ((TETNG*-99883.pufferfish,FUGRU*-31033.pufferfish)Tetraodontidae-31031,
      (GASAC*-69293.stickleback,ORYLA*-8090.ricefish)Smegmamorpha-129949)Percomorpha-32485)Clupeocephala-186625)Euteleostomi-117571,
   (CIOIN*-7719,CIOSA*-51511)Ciona-7718)Chordata-7711,
  (((DROME*-7227.fly,DROPS*-7237.fly)Sophophora-32341,
    (AEDAE*-7159.mosquito,ANOGA*-7165.mosquito)Culicidae-7157)Diptera-7147, 
   APIME-*7460.honeybee)Endopterygota-33392,
  SCHMA*-6183.fluke,
  (CAEEL*-6239.worm,CAEBR*-6238.worm,CAERE*-31234.worm)Caenorhabditis-6237)Bilateria-33213)Eukaryota-2759;

The tree, rendered in phyowidget, looks like this:

Input data (the aligned gene sequences)

  • Input data are expected to be AA-guided DNA alignments. TreeBest has software to do this. Also can be done with BioPerl.

Below is an excerpt from the example multiple sequence alignment file, TB_example1.mfa, which is is FASTA format. It is for a mitochondrial processing peptidase.

>01_CAEEL GENEID=ZC410.2.1 ORI_NAME=ZC410.2.1 SYMBOL=ZC410.2.1 SWCODE=CAEEL TAXID=6239 TAXNAME="Caenorhabditis elegans" "mitochondrial processing peptidase (4J839) [Caenorhabditi
s elegans]. [Source:RefSeq;Acc:NP_501576]"
------------------------------------------------------------
---------------------------------ATGTACAGAAGACTAGCCTCCGGCTTA
TATCAGACTTCCCAAAGA------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------AGAATCGCACAAGTTCAGCCGAAATCAGTATTT
---GTCCCAGAAACAATTGTAACGACACTTCCGAATGGGTTCAGAGTTGCAACAGAA---
AACACAGGTGGATCTACTGCTACTATCGGAGTATTTATTGATGCCGGCAGTCGCTACGAG
AATGAGAAAAACAATGGAACAGCTCATTTTCTAGAGCATATGGCGTTCAAAGGA---ACG
CCTCGCCGGACTCGAATGGGATTGGAGCTTGAAGTTGAAAATATTGGAGCTCATCTAAAT
GCGTACACTTCTAGAGAA------------------------------------------
AGCACGACGTATTATGCTAAATGTTTTACAGAAAAGCTTGACCAATCCGTCGATATTCTT
TCGGATATTCTT---------CTGAACAGCAGTCTTGCCACTAAAGATATCGAAGCAGAA

I wanted to test is a gene list can be used when not all species in the species tree are represented. Below is the operation used to remove a few species from the example file using an ad hoc Perl script (prune_alignments.pl). The species removed do not have duplicated genes and are removed to highlight the gene duplications.

[smckay@halcott treebest]$ perl prune_alignment.pl TB_example1.mfa DROME ARATH SCHPO YEAST CAEBR FUGRU
I will delete DROME ARATH SCHPO YEAST CAEBR FUGRU 
01_CAEEL
02_CAEBR
02_CAEBR has been removed from the alignment 
03_ARATH
03_ARATH has been removed from the alignment 
04_SCHPO
04_SCHPO has been removed from the alignment 
05_YEAST
05_YEAST has been removed from the alignment 
06_BRARE
07_DROME
07_DROME has been removed from the alignment 
08_RAT
09_CHICK
10_RAT
11_MOUSE
12_HUMAN
13_RAT
14_BRARE
15_FUGRU
15_FUGRU has been removed from the alignment 
16_CHICK
17_MOUSE
18_HUMAN

The new alignment file has been written to TB_example1.mfa.truncated

Running TreeBest

WE run treebest from the command line, using the 'best' flag, the species guide tree (see above), and out pruned alignment file. The output is a bootstrapped NJ tree for the genes in NHX format. The NHX contained metadata that encodes branch lengths, boostrap confidence levels, and flags nodes where gene duplication is inferred.

(((((12_HUMAN:0.061482[&&NHX:E=$-PANTR-MACMU:S=HUMAN],
(11_MOUSE:0.046273[&&NHX:S=MOUSE],
(08_RAT:0.118999[&&NHX:S=RAT],
(10_RAT:0.031631[&&NHX:S=RAT],
13_RAT:0.109057[&&NHX:S=RAT]
):0.021453[&&NHX:D=Y:SIS=100:S=RAT:B=79]
):0.014492[&&NHX:D=Y:SIS=100:S=RAT:B=44]
):0.042008[&&NHX:D=N:S=Murinae:B=53]
):0.133246[&&NHX:D=N:E=$-Laurasiatheria-MONDO:S=Euarchontoglires:B=79],
09_CHICK:0.110439[&&NHX:S=CHICK]
):0.058192[&&NHX:D=N:E=$-XENTR:S=Amniota:B=91],
06_BRARE:0.498863[&&NHX:E=$-Percomorpha:S=BRARE]
):0.059906[&&NHX:D=N:S=Euteleostomi:B=80],
(((18_HUMAN:0.08417[&&NHX:E=$-PANTR-MACMU:S=HUMAN],
17_MOUSE:0.077766[&&NHX:E=$-RAT:S=MOUSE]
):0.179518[&&NHX:D=N:E=$-Laurasiatheria-MONDO:S=Euarchontoglires:B=100],
16_CHICK:0.209658[&&NHX:S=CHICK]
):0.114438[&&NHX:D=N:E=$-XENTR:S=Amniota:B=100],
14_BRARE:0.332791[&&NHX:E=$-Percomorpha:S=BRARE]
):0.199862[&&NHX:D=N:S=Euteleostomi:B=98]
):0.173738[&&NHX:D=Y:SIS=80:E=$-Ciona:S=Euteleostomi:B=87],
01_CAEEL:0.751726[&&NHX:E=$-CAEBR-CAERE:S=CAEEL]
)[&&NHX:D=N:S=Bilateria:B=0];

The tree, drawn in phylowidget, looks like this:


Take Home Messages

  • Red nodes are duplication events; blue nodes are speciation events. The info is included as meta data in the NHX tree.
  • The gene names must match the species names but the species tree need not be confined to just the genes in the alignment. This means we can use an inclusive reference species tree rather than a bunch of small ones for each gene family data set
  • It will be fairly trivial to ad support for running AA-guided transcript alignments, if need be.
  • The resulting gene tree is annotated to facilitate markup and rendering

Note on reconciliation

From the TreeBest documentation:
The resultant tree out.tree.nhx will be bootstrapped for 100 times, reconciled with the species tree and rooted by minimizing with the number of duplications and losses. Duplications and losses are also stored in the NHX format.

Note that treebest first determines the topology of resultant tree with a complex procedure, and then performs a hundred times of resampling with an improved neighbour-joining algorithm. Branch lengths are finally estimated with the standard ML method under the HKY model.