TreeBest_Workflow_Example
Purpose
The purpose of this page is to use some example data to explore how TreeBest works and will fit into our prototype and workflows.
Treebest is documented here: http://treesoft.sourceforge.net/treebest.shtml
It is distributed as C source code and runs from the command line, making it very convenient for integration into web-based software.
See this video for info about TreeBest: http://www.newton.ac.uk/cgi/smsflvplay?42-218
Note on 5 character limit to species labels
- There is none that I can find. As long as the gene_myTaxon convention is used in the alignment file and thee is a matching MyTaxon in the species tree, it runs fin in my hands. For example:
Workflow
Species tree
The species tree is used as a guide to infer the best gene tree given a set of aligned DNA sequences. This is the deafult tree included in the distribution. I use it for the example below. The tree is in NHX (extended newick) format and must have lables for internal nodes.
((ORYSA*-4530.rice,ARATH*-3702)Magnoliophyta-3398,(SCHPO*-4896.S_pombe,YEAST*-4932)Ascomycota-4890, ((((((((((((HUMAN*-9606,PANTR*-9598.chimpanzee)Homo/Pan/Gorilla-207598, MACMU*-9544.monkey)Catarrhini-9526, OTOGA-*30611.galago)Primates-9443, ((MOUSE*-10090,RAT*-10116)Murinae-39107,RABIT-9986)Glires-314147)Euarchontoglires-314146, ((BOVIN*-9913.cow,PIG-*9823)Cetartiodactyla-91561, (CANFA*-9615.dog,FELCA-*9685.cat)Carnivora-33554, SORAR-*42254.shrew, MYOLU-*59463.bat)Laurasiatheria-314145, (ECHTE-9371.tenrec,LOXAF-9785.elephant)Afrotheria-311790, DASNO-9361.armadillo)Eutheria-9347,MONDO*-13616.opossum)Theria-32525, ORNAN-*9258.platypus)Mammalia-40674, CHICK*-9031)Amniota-32524, XENTR*-8364.frog)Tetrapoda-32523, (BRARE*-7955.zebrafish, ((TETNG*-99883.pufferfish,FUGRU*-31033.pufferfish)Tetraodontidae-31031, (GASAC*-69293.stickleback,ORYLA*-8090.ricefish)Smegmamorpha-129949)Percomorpha-32485)Clupeocephala-186625)Euteleostomi-117571, (CIOIN*-7719,CIOSA*-51511)Ciona-7718)Chordata-7711, (((DROME*-7227.fly,DROPS*-7237.fly)Sophophora-32341, (AEDAE*-7159.mosquito,ANOGA*-7165.mosquito)Culicidae-7157)Diptera-7147, APIME-*7460.honeybee)Endopterygota-33392, SCHMA*-6183.fluke, (CAEEL*-6239.worm,CAEBR*-6238.worm,CAERE*-31234.worm)Caenorhabditis-6237)Bilateria-33213)Eukaryota-2759;
The tree, rendered in phyowidget, looks like this:
Input data (the aligned gene sequences)
- Input data are expected to be AA-guided DNA alignments. TreeBest has software to do this. Also can be done with BioPerl.
Below is an excerpt from the example multiple sequence alignment file, TB_example1.mfa, which is is FASTA format. It is for a mitochondrial processing peptidase.
>01_CAEEL GENEID=ZC410.2.1 ORI_NAME=ZC410.2.1 SYMBOL=ZC410.2.1 SWCODE=CAEEL TAXID=6239 TAXNAME="Caenorhabditis elegans" "mitochondrial processing peptidase (4J839) [Caenorhabditi s elegans]. [Source:RefSeq;Acc:NP_501576]" ------------------------------------------------------------ ---------------------------------ATGTACAGAAGACTAGCCTCCGGCTTA TATCAGACTTCCCAAAGA------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ---------------------------AGAATCGCACAAGTTCAGCCGAAATCAGTATTT ---GTCCCAGAAACAATTGTAACGACACTTCCGAATGGGTTCAGAGTTGCAACAGAA--- AACACAGGTGGATCTACTGCTACTATCGGAGTATTTATTGATGCCGGCAGTCGCTACGAG AATGAGAAAAACAATGGAACAGCTCATTTTCTAGAGCATATGGCGTTCAAAGGA---ACG CCTCGCCGGACTCGAATGGGATTGGAGCTTGAAGTTGAAAATATTGGAGCTCATCTAAAT GCGTACACTTCTAGAGAA------------------------------------------ AGCACGACGTATTATGCTAAATGTTTTACAGAAAAGCTTGACCAATCCGTCGATATTCTT TCGGATATTCTT---------CTGAACAGCAGTCTTGCCACTAAAGATATCGAAGCAGAA
I wanted to test is a gene list can be used when not all species in the species tree are represented. Below is the operation used to remove a few species from the example file using an ad hoc Perl script (prune_alignments.pl). The species removed do not have duplicated genes and are removed to highlight the gene duplications.
[smckay@halcott treebest]$ perl prune_alignment.pl TB_example1.mfa DROME ARATH SCHPO YEAST CAEBR FUGRU I will delete DROME ARATH SCHPO YEAST CAEBR FUGRU 01_CAEEL 02_CAEBR 02_CAEBR has been removed from the alignment 03_ARATH 03_ARATH has been removed from the alignment 04_SCHPO 04_SCHPO has been removed from the alignment 05_YEAST 05_YEAST has been removed from the alignment 06_BRARE 07_DROME 07_DROME has been removed from the alignment 08_RAT 09_CHICK 10_RAT 11_MOUSE 12_HUMAN 13_RAT 14_BRARE 15_FUGRU 15_FUGRU has been removed from the alignment 16_CHICK 17_MOUSE 18_HUMAN The new alignment file has been written to TB_example1.mfa.truncated
Running TreeBest
WE run treebest from the command line, using the 'best' flag, the species guide tree (see above), and out pruned alignment file. The output is a bootstrapped NJ tree for the genes in NHX format. The NHX contained metadata that encodes branch lengths, boostrap confidence levels, and flags nodes where gene duplication is inferred.
(((((12_HUMAN:0.061482[&&NHX:E=$-PANTR-MACMU:S=HUMAN], (11_MOUSE:0.046273[&&NHX:S=MOUSE], (08_RAT:0.118999[&&NHX:S=RAT], (10_RAT:0.031631[&&NHX:S=RAT], 13_RAT:0.109057[&&NHX:S=RAT] ):0.021453[&&NHX:D=Y:SIS=100:S=RAT:B=79] ):0.014492[&&NHX:D=Y:SIS=100:S=RAT:B=44] ):0.042008[&&NHX:D=N:S=Murinae:B=53] ):0.133246[&&NHX:D=N:E=$-Laurasiatheria-MONDO:S=Euarchontoglires:B=79], 09_CHICK:0.110439[&&NHX:S=CHICK] ):0.058192[&&NHX:D=N:E=$-XENTR:S=Amniota:B=91], 06_BRARE:0.498863[&&NHX:E=$-Percomorpha:S=BRARE] ):0.059906[&&NHX:D=N:S=Euteleostomi:B=80], (((18_HUMAN:0.08417[&&NHX:E=$-PANTR-MACMU:S=HUMAN], 17_MOUSE:0.077766[&&NHX:E=$-RAT:S=MOUSE] ):0.179518[&&NHX:D=N:E=$-Laurasiatheria-MONDO:S=Euarchontoglires:B=100], 16_CHICK:0.209658[&&NHX:S=CHICK] ):0.114438[&&NHX:D=N:E=$-XENTR:S=Amniota:B=100], 14_BRARE:0.332791[&&NHX:E=$-Percomorpha:S=BRARE] ):0.199862[&&NHX:D=N:S=Euteleostomi:B=98] ):0.173738[&&NHX:D=Y:SIS=80:E=$-Ciona:S=Euteleostomi:B=87], 01_CAEEL:0.751726[&&NHX:E=$-CAEBR-CAERE:S=CAEEL] )[&&NHX:D=N:S=Bilateria:B=0];
The tree, drawn in phylowidget, looks like this:
Take Home Messages
- Red nodes are duplication events; blue nodes are speciation events. The info is included as meta data in the NHX tree.
- The gene names must match the species names but the species tree need not be confined to just the genes in the alignment. This means we can use an inclusive reference species tree rather than a bunch of small ones for each gene family data set
- It will be fairly trivial to ad support for running AA-guided transcript alignments, if need be.
- The resulting gene tree is annotated to facilitate markup and rendering
Note on reconciliation
From the TreeBest documentation:
The resultant tree out.tree.nhx will be bootstrapped for 100 times, reconciled with the species tree and rooted by minimizing with the number of duplications and losses. Duplications and losses are also stored in the NHX format.
Note that treebest first determines the topology of resultant tree with a complex procedure, and then performs a hundred times of resampling with an improved neighbour-joining algorithm. Branch lengths are finally estimated with the standard ML method under the HKY model.