Needs for Web Service to Support Multiple Reconciliations

Needs for Web Service to Support Multiple Reconciliations

Overview

The basic need:

Currently the standalone tree reconciliation browser provides an interface for users to access gene trees reconciled to their host species trees. Currently, this browser is limited to providing results from a single software source for mapping the gene trees onto the species trees. For example, we can show the results from a set of TREEBEST reconciliations for all gene trees mapped to a species tree. However the browser can not currently be used to query or visualize multiple reconciliation results for an individual gene tree. For example other results could be a 'canonical' result based on a synteny-informed reconstruction of gene duplication history, or the result from a different program such a Phyldog. Ideally the query within the scope of the same database connection to the same database.

The current browser represents a limitation since we may not know 'a-priori' do want the browser to be able to show the results from different programs since we do not currently know which program even would provide the 'best' reconstruction of gene family history.

The MySQL database that holds the tree reconciliation information DOES allow for multiple reconciliations to be stored, although the way these data are stored may need to be optimized for quick retrieval of reconciled tree sets. (Here reconciled tree set refers to a group of gene trees reconciled to a species tree using the same software with the same parameter values).

General goal:

Give users access to results from multiple reconciliation approaches (or even different parameter values within the same software package) when querying the TR standalone database.

General computational process:

Given a set of plant genes for multiple species that have been classified into gene families (or gene clusters) we want to:

  1. Run multiple reconciliation pipelines so that each gene family is represented by a result for each pipeline
    (This is currently supported for TREEBEST and PRiME-GSR in code that Sheldon wrote, Jamie is working on supporting Phyldog and we have an example of synteny-based reconstruction for a small set of species).
  2. Store the results for these reconciliation within a single TR database
    (This is currently supported in the current TRDB schema, but may need to be slightly modified to increase speed of queries to fetch entire sets at once). 
  3. Allow the TR viewer to access these multiple reconciliations within TR database

Components requiring updates

The components that will require updating to support this are:

  1. Database schema
    1. Minor changes to database schema to better support multiple reconciliations
  2. Database loading scripts
    1. Additional script to load reconciliation meta-data into the database
    2. Changes to the scripts that load reconciliation results
  3. Web-API
    1. Modifications to various components to become aware of reconciliation sets
    2. Possible additions to support new types of queries that are reconciliation set aware
      1. Example 1: Generate a table that summarizes gene family results for all the reconciliation processes used
  4. TR-Standalone Viewer
    1. Modifications to support the option to see multiple reconciliations 

Additional background information:

The best place to start is the wiki describing the TR architecture in general which is available at

https://pods.iplantcollaborative.org/wiki/display/iptol/1.0+Architecture

This includes the 

A set of powerpoint slides documenting the TR database is available at

http://www.slideshare.net/j_estill/james-estill-ievobiofinalpowerpoint-8416065

and a pdf of the poster that Sheldon recently presented on the TR tools is attached to this wiki.

Existing Tools

Live demo version of the viewer:

The current demonstration of the Tree Reconciliation viewer is available at:

http://tr.iplantcollaborative.org/Tr_standalone.html

This connects to the database hosted at:

http://votan.iplantcollaborative.org

Code repositories:

The Perl code for the back end services to connect to the database and to populate a database are hosted on github at:

https://github.com/iPlantCollaborativeOpenSource/iplant-treerec

this java code for the interface is also on github at:

https://github.com/iPlantCollaborativeOpenSource/tr-standalone

a current MySQL database dump of a working version of the database is currently on svn at iPlant.

(need link to get this data here)

Database Documentation:

The database schema that holds the data is described at:

Some general examples of how this database is used for topological queries is at:

https://pods.iplantcollaborative.org/wiki/display/coresw/Copy+of+SQL+Queries+-+Tree+Topological+Queries

Big Tree Viewer documentation:

The current Tree Viewer is based on the the Big Tree Viewer which is documented at:

https://pods.iplantcollaborative.org/wiki/display/iptol/Big+Tree+Viewer+Documentation

Web-API Documentation:

Is currently documented within the code repository as PerlDocs. Specifically see Perl Modules at
https://github.com/iPlantCollaborativeOpenSource/iplant-treerec/tree/master/lib/IPlant/TreeRec

In general 

Changes Needed

Required Changes for TR Database Support of Multiple Reconciliations

The TR database schema does currently support the ability to store information for multiple reconciliations. These reconciliations are stored in the set of reconciliation tables described at:

https://pods.iplantcollaborative.org/wiki/display/iptol/Database+Schema#DatabaseSchema-ReconciliationTables

Attributes concerning an individual reconciliation of a single gene tree to a single species tree are stored in the table 'reconciliation_attribute'. These attributes can include details of how the reconciliation was performed, what software was used, the parameters used within the software and other details using a controlled vocabulary set of tables. These CV tables make use of a 'Tree Reconciliation Ontology' that Jamie has developed to support the annotation of reconciled trees in the database.

Current Limitation to TR DB support

Currently the reconciliation db allows for tagging individual reconciliations with attribute values. This allows for maximum flexibility, but could bog down queries when attempting to retrieve results from all gene trees reconciled using the same suite of parameter values. We may therefore need to modify the database to allow for the storage of a 'reconciliatin_experimetnal_process_set' that refers to all trees resulting from a single pipeline process.

Adding reconciliation_set to schema

Jamie suggests adding the concept of a 'reconciliation_set' to the schema. This is a collection of gene tree reconciliations that share the same parameter values used for the reconciliation. Now instead of the parameters used for a single reconciliation being stored separately for each reconciliation, they can be stored a single time. This will facilitate queries that want to limit reconciliations to a single set of reconciliations. For example queries could take into account reconciliation set when retrieving counts of duplications or other values. I think adding this would allow for the least amount of changes to the API.

An example of the tables would need to be added to the database are below: The reconciliation table would have an additional row, that would link to data on the reconciliation set.

Unknown macro: {mockup}

The meta-data describing where the experimental process that generate the reconciled trees will make use of the reconciled tree ontology as stored in the cvterm table.

Required Changes in Scripts that Populate the Database

The scripts that populate the database currently do not take into account the process used to generate reconciliations when loading the database. As an extension to the existing scripts, the data to populate the reconciliation_set and attributes could be added as a step that comes before loading the reconciled trees. Loading these data would include the tag for the reconciled_set name. The later script that loads the reconcilation results could then take the reconciled_set_name as a variable that this then used to populate the reconciled_set id in the reconciliation table. Scripts that will need to be modified are:

  • tr_create_database.pl
  • tr_import_reconciled_tree.pl
  • tr_populate_reconciliation_attributes.pl
  • tr_import_reconciliation_set.pl - or similarly named program will load the programs and parameters used to generate a set of reconciled trees. 
    • Initially this program will accept a simple tab delimited text file describing 'TAG' 'VALUE' pairs.
    • The tree ontology will need to already exist in the dataset
    • This program will need to determine the cv_term_id for the TAG term before loading the value into the database
    • It may make sense to first scan through the tag-value pairs to make sure that the tag terms already exist in the database and throw an error when they do not
    • Example usage as:
      tr_import_reconciliation_set.pl --infile treebest_description.txt --name 'TREEBEST_default_parameters' --dsn ....
      

Note that later extensions of the data loading scripts could take into account XML extensions that allow for the meta-data of how the tree was generated to be represented within the XML file itself. This would remove the need for the tr_import_reconciliation_set.pl script.

Required API Changes

The following represents what Jamie sees as the necessary changes in the web API. Upon further analysis, the changes involve a fundamental change at the design level. Services were originally written under the assumption that one species tree would correspond to only one reconciliation experiment, thus essentially having the same function as a reconciliation set. The changes identified by Jamie below are still correct, but it will be importnat to identify what other downstream effects this changes will have and they will require extensive modifications to the UI. This will be a good place to include discussion for necessary API changes, and the changes needed will depend in part on the changes requested for the Java interface to the database. Generally code that knows the reconciliation_id should not need modification, but code that takes family_name as an input parameter to summarize or fetch information for a reconciliation will need to also consider the reconciliation set as well. These modifications will include changes to how the $dbh SQL code is handled, as well as modifications to the TreeRec modules below:

For the Perl Modules at/lib/iPlant/TreeRec :

  • BlastArgs.pm
    no changes needed to support multiple reconciliations
  • BlastSearcher.pm
    no changes needed to support multiple reconciliations
  • DatabaseTreeLoader.pm
    • load_gene_tree
      This currently takes only gene_family_name into account, and would need to be modified to take reconciliation set or reconciliation parameters into account.
    • _count_node_dups
      This counts speciation events. This currently assumes that we want to count all speciation events on a species node for all gene trees held within the database. This would probably need to be modified to take into account that we only want to count the speciation nodes for a specific set of reconciliations and not the global set of all reconciliation results
    • _count_node_dups
      This counts duplication events on an edge in the species tree. This also currently assumes that we want to count all duplication events on an edge in the species tree, however we may want to limit this to the a subset of reconciliations that share the same parameter values.
  • DuplicationEventFinder.pm
  • FileRetriever.pm
    This retreives flat-files of reconciliations results. This is hard-coded to individual reconciliation processes assuming TREEBEST and may need to be updated. Perhaps a more elegant solution would be to store text strings of these data within the database instead of being dependent on text files stored on the server.
  • FileTreeLoader.pm
    • load_gene_tree
      This assumes a single gene tree exists for a single gene family. This would need to be updated to take into account the fact that multiple gene trees could exist for a single gene family that used multiple (different) reconciliation processes.
  • GeneFamilyInfo.pm
    • get_tree_counts
      This will need to be updated to return the values ( ie. duplication counts) summarized by the reconciliation process used and not just count all duplications that exist within a family.
    • count_duplications
      This will need to be updated to return the values ( ie. duplication counts) summarized by the reconciliation process used and not just count all duplications that exist within a family or just return values for a single reconciliation set.
    • count_species
      Must take reconciliation set into account
    • count_species
      Must take reconciliation set into account
  • GeneTreeEvents.pm
    • get_events
      Must be modified to take reconciliation_set_id into account
  • GoCloud.pm
    no changes needed to support multiple reconciliations
  • ProteinTreeNodeFinder.pm
    • for_species
      will need to take reconciliation_set_id into account
  • REST.pm
    no changes needed to support multiple reconciliations
  • ReconciliationLoader.pm
    no changes needed to support multiple reconciliations
  • ReconciliationResolver.pm
    • resolve
      May need to be updated to take into account reconciliation_set_id dependent on how the web interface makes use of this 
  • SpeciesTreeEvents.pm
    • get_all_duplications
      Needs to be aware of reconciliation_set_id as intput..
    • get_duplications
      Needs to be aware of reconciliation_set_id as intput..
    • _get_all_duplications
    • Needs to be aware of reconciliation_set_id as intput..
    • _get_duplications
    • Needs to be aware of reconciliation_set_id as intput..
  • TreeDataFormatter.pm
    no changes needed to support multiple reconciliations
  • Utils.pm
    no changes needed to support multiple reconciliations
  • X.pm
    no changes needed to support multiple reconciliations

Required Changes in TR Viewer

These are more potentials to consider than requirements at this point. 

Species Tree Entry Point

The species tree entry point is the first page that comes up when entering the TR standalone viewer. This currently looks like the following:

?
and clicking on an individual triangle on an edge in the species tree will list the gene families that have at least one duplication event on that edge. For example clicking on the triangle at the top of the page on the edge leading up to papaya brings up the following list:

?
Looking closely at the list shows that this list includes a count of the number of duplications that occur on that edge for the gene families.

?

Potential Modification of Species Tree Entry Point

This entry point could be modified to support different types of reconciliation by explicitly listing the reconciliation_set options along with the species tree. The name of the reconciliation set selected will determine the list of gene families and duplication counts.

For example:

Unknown macro: {mockup}

Current visualization of gene tree and species tree

Currently when the user selects a gene family from the list, they get a visualization of the gene tree next to the species tree.

?
The duplications in the reconciled gene tree are indicated as red nodes and the speciation events are indicated as blue nodes. In general this view will need to be updated to show the name of the gene family being investigated at the top.

Potential Modifications of Gene Tree/Species Tree Side-by-side view

The modification of this view will generally take entry from the list of gene families as before. Only now the API will need to be aware that multiple reconciliations exist, and must select the correct one to view by default.

However, it will now be possible to allow the user to view different reconciliations from different methods for the same gene family. This could be done with a drop down list similar to what was illustrated for the species tree, or if a limited number of reconciliation methods are to be used, as set of tabs would also allow the user to flip through different reconciliation views.

Reconciliation Set as Pull Down Menu 
Unknown macro: {mockup}

Alternatively for a limited number of reconciliations, it could be possible to represent the different reconciliations as tabs.

Reconciliation Set as Tabs
Unknown macro: {mockup}

TR Advanced Search Entry

The GUI also allows for advanced search by BLAST search or by text searches for Gene Name, Gene Ontology Term and Gene Family ID. The results of some of these searches is to jump directly to the gene tree that contains that gene_name or gene family ID. A change that would be required would be to generate list of reconciliation results that include that gene family id, along with information on how the reconcilation was generated. That would allow the used to pick the reconciliation result of interest. Alternatively, we could pick a reconciliation result that would be shown by default, and allow the user to pick among reconciliation results once they are in the side-by-side viewer.

?

TR Supporting Data

The data supporting the reconciled tree is available in the upper right hand corner. This data retrieval currently fetches a flat file from a location on the server. This will need to be modified to be aware of the reconciliation_set that is being used. I would suggest if we stick with flat file we keep a consistent use of names that builds on the unique name used to identify the reconciliation set. The gene family information will be the same for each reconiliation_set, but the NHX files will be different.

?

Databases for Testing Updated Interface

Example databases to consider using for testing (Jim Leebens-Mack)

Bower's Dataset 

This database will be a good example for multiple types of reconciliations for a given set of gene families.

While a good example, Jim notes that recent data suggests that populus goes with Malvids and not Fabids

Jiao et al Genome Biology - Timing of Gamma paper

1kp Pilot dataset

This database will be a good example of large number of species and will help test the upper bounds of the visualization environment.

  •  expect availability in the next few of months
  •  expecting paper submission by April or May 1st.
  • 83 species

We will try to have a subset of the 1kp pilot data available soon.

  • SATe specie trees ... and TREEBEST reconciliations
  • John Kerry in Jim's lab will look at starting this reconciliation process with TREEBEST
    • John will either us TREEBEST or Notung to reconcile gene trees and species tree. 
  • We will get an early look at this data set with 83 species to see how well the visualizer holds up
    • Very few gene trees will include all 83 species

Transposable element dataset

  • Jamie would like to have an example TE dataset. This would represent different extreme than the 1kp pilot data.. a very large 'gene tree' with a relatively model species tree.
    • A good database for starting this is GyDB (http://gydb.org/index.php/Main_Page). This can be supplemented by ab initio annotation of some classes on TEs as well as published TE annotations from sequenced genomes.