TR_21DEC09

  • Jump to Notes section

Action items

Agenda

plan and time-line for prototype development

DRAFT
Dec 7-31

  1. Sheldon and Andy investigate data sources and assemble test data sets for development purposes
  2. Sheldon and Andy will publish a more detailed development plan based on feedback on this meeting

    The following are comments/links describing possible gene family databases:(Jim)
    * Phytozome - description at info page is incomplete, but clustering seems to be based on bitscores for within and between species/clade gene sets - i.e. orthology/paralogy estimation may be included in process for circumscribing gene families.
    * PlantTribes - TribeMCL clustering (see info page ). Alignments have been filtered based on columns scores and these can be obtained from Claude dePamphilis.
    * PLAZA - Also based on TribeMCL clustering (info page ). Results of down stream analyses including Notung MAY be available.
    * EnsemblPlants - Not clear whether clustering has been done specifically for plant genes. Description for hclust_sg seems to focus on animal genes, but Ensembl team may have developed plant gene clusters in collaboration with Gramene.

Jan 1-15

  1. Sheldon: Work on back-end data, alignment service (if need be) treebest gene tree construction
  2. Andy: Assemble first data sets for prototype, gather more requirement from WG

Jan 16-31

  1. Sheldon: Work on tree reconciliation and reconciled tree view; queries
  2. Andy: queries; Rough out UI; general tree viewing

Feb 1-28

  1. Sheldon, Andy - iterate with WG, refine design, integrate XXX canned data sets.
  • Selection of data sets for the released prototype will be guided by the working group
  • Selection of testing data for early development will be more pragmatic

Prototype Thoughts and Assumptions

DRAFT

  • May mix and match sources for best coverage. Prototype will us worked examples (may be quite a few), no user supplied data.
  • Full length CDS from ATG to stop codon?
  • AA-guided nucleotide alignments for Phyml/treebest; may need to do (or redo) the alignments.
  • Is there a divergence threshold that we consider too divergent for DNA-based ML? If so, is measured raw distance, Ks, Ka?
  • Species trees require internal node labels for treebest (may need to edit).
  • treebest gives us gene trees. Want to reconcile downstream of treebest, use 'reconcile' and primetv to show reconciled trees graphically.
  • Want to be able to view: alignments, gene/species trees, reconciled trees
  • treebest package compiled c binaries (good for intergration with web app)
  • treefam database schema comes with perl API (good) and tied with treebest, consider using sub-set of treefam schema for prototype back-end, get the schema and API for free.

Notes

Attendees: Michael Gonzales, Jim Leebens-Mack, Jerry Lu, Andrew Lenards

  • Discussion began regarding the user actions that Jim added for 3 use cases. Andy is curious how tasks are currently being done (example: What is the current workflow to reconcile a gene tree and species tree?). The user actions (or steps) supplied by Jim were indicated as helpful. But understanding the current process of getting work done so that shortcomings of that are acknowledged and not repeated in any way.
    • ACTION: Need working group to provide high-level workflow for work done in prototype.
    • ACTION: Elaborate on the user actions/steps discussion for use cases started here in the discussion area.
  • Sheldon & Andy talked on Friday and decided that a more detailed plan/time-line would not be published at this time.
  • Andy & Jim discussed potential data sets and what data sets would be included in the prototype. Jim would like the working group (Todd, Cecile, etc) to discuss more what will be in the prototype. Some data source are lacking transparency in how the gene families clustered were construction (see more on this in above comment). But Sheldon & Andy only need test data sets to begin assuming tools to do tasks, so their source/etc is not important to being that.
  • A slight follow-on from the effort of finding test data sets led a question from Jim about what we are looking for? Andy explained the simplest view he had of what they were looking for in test data sets were inputs to TreeBeST. Jim asked, how would we associate gene names back to species names? Andy thought there were standard gene names and did not realize they would be a need to do any resolving between gene name schemes or determine an manner to associate back to the species. Some lookup structure to resolve between varying gene naming schemes will be needed. Jim was not certain this would be a huge deal. Have other groups producing software tackled this group? Andy asked more about potential standards. It seems the approach taken by Arabidopsis has been accepted as an ad hoc "standard" (or comes closest to a "standard"), but genome sequencing projects, like Populus, do not adhere to this approach. Each genome project may have its own nomenclature. Jim mentioned that Populus used the method/program used to infer the gene as part of the gene name. It seems the Populus project has discussed a re-release of the genome with gene naming in the Arabidopsis manner but that has not happened (and may not in the near future).
  • Elaborating on the third use case discussed [below] (Identify points on species tree where genes in a particular gene family, GO category or genes in a a specific metabolic pathway has diversified Use Case #4):
    • A "user" would go into the system, ask for in Gene Ontology (GO) terms, where are gene families diversifying on the species trees.
      • So there would need to be a way to resolve the gene names involved with the gene naming scheme a user is familiar with (likely Arabidopsis)
      • In other words, picking a test data set from Phytozome means that the gene names in those data sets may not be the same as those used in other databases.
      • Gene Ontology is not providing unique identifiers for genes or gene names. It's for what & where gene products are, what they do. Connecting gene names accross genome projects has mainly been done by following Arabidopsis approach. iPlant may find that it would be a service to the community to help encourage the formation of and/or of standards (or just following the Arabidopsis approach).
  • Jim wanted to ensure that Andy & Sheldon had all the info needed to use TreeBeST. Does Sheldon feel comfortable with the program? Are the inputs merely alignments & a species trees? Jim was concerned that the alignments may require separate phylogenetic analysis in order to be fed into TreeBeST. Andy suspects the Perl wrappers for C Binaries may be handling this, but is not sure since he has not ramped up on TreeBeST yet.

Action items for January 4, 2010