Gene-species tree reconciliation, Release 1.0
Gene-Species Tree Reconciliation
Possible headings (to appear at top of every page): Search | About | How to use | Credits
Possible footer (to appear at bottom of every page): CC License | Report a bug or make a feature request | Other contact
Introduction
For a number of reasons, the phylogeny of a gene family is generally different in topology from the phylogeny of the corresponding species. There may be duplications and losses of genes. There may be lateral transfer between species, as frequently happens in bacteria. And there may be lineage sorting among alleles, which can result in particular genes having a pattern of relatedness that differ from the pattern among the species; this is particularly common among species groups that have undergone rapid radiations. Finally, there may simply be topological errors in either the gene or species tree, or both.
Gene-species tree reconciliation is used to study the evolution of gene families. It refers to a broad family of analytic techniques in which the discrepancies between gene and species phylogenies are explained by the action of one or more of these processes happening at particular timepoints or intervals on the respective phylogenies. Because these ancient events cannot be observed, they need to be inferred.
The Tree Reconciliation Tool within iPlant's Discovery Environment [how to name? PhyloReconciler?] is currently focused on inference of gene duplication events. This is mostly for analytic convenience. But it is also at least partly justified by the well-documented frequency of gene (and genome) duplication within plants, such that duplication events are thought to be the main cause for topological discordance in gene families that diverged deep in the history of green plants.
Inference of duplication (and loss events) is also useful for biologists studying gene function because it can be used to determine orthology and paralogy. Two sequences that last diverged from a duplication event are said to be paralogous, whereas two sequences that last diverged from a speciation event are orthologous. Received wisdom says that orthologs will tend to exhibit a greater conservation of function than paralogs. However, this is a rule with many exceptions. Furthermore, other types of evolutionary events may be occurring in these gene families and be reconstructed incorrectly, leading to incorrect assignment of orthology and paralogy.
A reconciliation can be thought of as a mapping between the nodes (and branches) of a gene family tree and the nodes (and branches) of the corresponding species tree. This mapping may be optimized based on parsimony criteria (i.e. minimize the number of inferred duplications) or statistical ones (i.e. Maximum Likelihood or Bayesian approaches). One can determine from the mapping which gene tree nodes represent speciation events and which duplication events, and thus which genes are putatively orthologous or paralogous with each other.
The reconciliations in the current implementation assume a fixed species tree. Depending on the method used, the reconciliation is computed based on a fixed gene tree or, alternatively, a gene family alignment. The genes in the gene family must have been sampled from among the species in the species tree.
[Describe novelty of visualization].
Implementation
The Tree Reconciliation Tool in the Discovery Environment assumes a given, fixed species tree for green plants. Gene families are available from the Discovery Environment along with their pre-computed trees and reconciliations. Two data bases are available at this point (see below). The inference of gene trees and their reconciliations with the species tree was performed from the gene family alignment using TreeBEST. They can be visualized interactively in the Discovery Environment.
TreeBEST combines a maximum likelihood approach for inferring the gene family tree (using an HKY model of molecular sequence evolution) and a maximum parsimony approach for inferring duplications and losses (to infer a minimum number of duplications and then a minimum number of losses). The likelihood of the gene sequences is penalized by the number of duplications, and treeBEST outputs the gene family tree that has the largest penalized likelihood.
To measure the tree reconciliation quality, bootstrap values are assigned to edges in the gene family tree. These values measure the confidence in the reconstruction of each branch. They are obtained by re-sampling the sequence data, re-doing steps 2 and 3 many times and then counting the number of times (in %) that the branch was recovered from the re-sampled data. Short sequences may lead to greater uncertainty in the gene tree and to lower bootstrap values. The bootstrap value of an branch in the gene family tree is visualized on top of the branch.
Data sets
Sources
Tree reconciliation is available for gene families in a 6-plant whole-genome project (courtesy of John Bowers and Jim Leebens-Mack) and for gene families in the Thousand Plant Transcriptome project (oneKP).
This is the place to describe project-specific elements. At this point, we need to describe John Bower's data and the 1kp data. Describe
- how the species tree was obtained (usually from a set of orthologs),
- how the gene families were circumscribed (usually using all-vs-all BLAST followed by some HMM),
- how sequences were aligned within each family. Jim: was MUSCLE used for John Bowers' data? How about 1kp gene families?
I've heard in the past that Muscle was used to align sequences and PrimeTV was used to display the tree. Are either of these still in use? If so, we should mention them in the appropriate step.
PrimeTV is not used, I don't think so.I agree, the alignment method description would fit well in the description of the data. Also, I think you are right, that PrimeTV is not used; I believe our iPlant-produced viewer is being used...I just want to be certain we have the details right, so thank you for the confirmation.
Analysis workflow (with references to other tools used)
Database and web application (with references to other tools used, availability of code)
How to Use
Search
Navigation
Interpretation
Persistent hyperlinks (RESTful links)
Download data and code
References
OneKP project: website
Burleigh, J. G., Bansal, M. S., Eulenstein, O., Hartmann, S., Wehe, A., & Vision, T. J. (2011). Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology, 60(2):117-125.
Jiao, Y., Wickett, N. J., Ayyampalayam, S., Chanderbali, A. S., Landherr, L., Ralph, P. E., Tomsho, L. P., et al. (2011). Ancestral polyploidy in seed plants and angiosperms. Nature, 473(7345):97-100.
Burleigh et al. and Jiao et al. are references to data set we are considering for future releases. Might need to be commented out for now.
Contributors
is a product of the iPlant Collaborative Tree of Life Project (iPToL).
Software Developers
Members of the iPToL Tree reconciliation, Tree Visualization, and Data Integration working groups
Other contributors - including iPToL leads?
Institutions
Funding
National Science Foundation Plant Cyberinfrastructure Program (#DBI-0735191)