TR_15MAR10
TR_WebEx - this meeting will include visuals, so WebEx Video will be utilized
AI: @Former user (Deleted) will present rough UI for tracking results/hits [done]
AI: @Former user (Deleted) to mockup an expert tree prune/selection w/ auto-complete to find taxa. Common taxa box should be incorporated. APWEB has an example of how to show users where their species/group of interested is placed on the tree. [incomplete]
Brief review (10 min.)
Discussion of UI Mock-flows
Pick-up w/ discussion of individual gene hyperlinks in Results tab/view
Pre-computation Workflow review
Regarding the creation of an iPlant Gene Catalog, what should be included beyond 1KP or 1KP Pilot Data?
[answer] (from Todd Vision)
NCBI. All NCBI RefSeq mRNAs for Viridiplantae include 273,424 sequences for 11 taxa. There are 960,134 mRNAs if one does not confine it to Refseq, but even this excludes all unigenes . A merge of the unigene set and the mRNA set, with some processing to remove redundancies, would be close to ideal. Unigenes are not available for all the taxa with substantial EST collections (I notice Mimulus is absent, for instance), but NCBI might be persuaded to include them.
PlantGDB , which already calculates the cDNA-EST merge automatically, and has more comprehensive taxonomic coverage. So this option might be preferred.
It was suggested that if the entire subfamily was not select (so particular members were checked in the checkbox interface, but not all) we might want to use some visual representation of this.
How do we represent subfamilies or families not checked by the user?
Attendees: @Former user (Deleted), @Sharon Wei, @jleebensmack, @ane (Unlicensed), @tjv, @Zhenyuan Lu, @Former user (Deleted), @NicoleH
Meeting Overview
Discussion of what data will be needed to populate a gene catalog for use by prototype and beyond
Discussion of user interface functionality facilitated by simple mockups
Discussion of aspects of necessary 'Pre-computation Workflow' for creation of gene trees and reconciliations
this workflow supported Use Case #1 for Phase 1 of the Prototype
Detailed Discussion Notes
Brief review and discussion of postdoc and meeting w/ collaborator Jens by Cecile and Todd.
Rejoin discussion about the results user interface reference
GenBank Accession may likely not work because some data in gene catalog used will not be present there.
Want to avoid using internal iPlant-specific identifiers here
What would a sensible "fall back" or pre-population value be? [open question]
discuss domain/business rules for how to populate this later...
Defer this discussion until it is decided what will be in the gene catalog
What will be used in the iPlant Gene Catalog beyond 1KP Transciptome and/or 1KP Pilot Data?
Two options:
NCBI RefSeq & mRNA/cDNA, but only 11 taxa are represented (Mimulus is absent, for example). Merge data from UniGene with mRNA set.
PlantGDB - calculates cDNA/EST merge, provides their own unique IDs, updated every 4 montsh. Path of less hassle & higher quality data. From PlantGDB, you get unigene builds, but what are they doing w/ Next-Gen data? We may want to touch w/ Volker Brendel - his group runs PlantGDB at Iowa State University.
ActionItem: Andrew needs to speak with Steve Goff to see if next-gen assembly is on radar and potential actions iPlant can do to aide efforts.
Following on from this to discussion on Short Read Archive and NextGen data
Raw data goes into Short Read archive - folks have own databases of assemblies, not aware if those are being submitted to GenBank or only exist in SR Archive.
iPlant may be included in assembly of Next-Gen datasets (re iPG2P).
How many species are there that have RNASeq data?
Summary: group focused discussion on taking next-gen data and assembling using NGS pipline
Submission of data w/ standard format key
NCBI's Short Read Archive is difficult to use, extremely cumbersome, and described as disorganized.
Discussion of mockup for Results Tracking reference
Todd was pleased with user interface, overall group liked the suggested approach
The Action Item for Andy to produce a mockup for "expert prune/selection" interface
A common taxa box should be included.
APWEB was mentioned as an example of how to show users where group/species is on the tree, but Andrew was not able to find the example. Jim clarified the example here
Filter results:
only interested in genes from a species taxa (or group of taxa), use the 'star interface' such that a gene is starred
only show genes of interest
only show genes from one of more of the common taxa
Re 'star interface' - a user could star all rice genes, later (at same stage) indicate only interested in specific genes in limited number of rice species.
Re 'common taxa' box
should first start out being populated with model organisms (the most complete) and have an option (say an 'Add...' hyperlink) to select more organisms (the less complete, but of interest to user).
highlight & restrict options would be nice in 'common taxa' box
organize list by taxonmic groupings - collapsible list? Include model organisms above the additions to the common taxa box to make it so users could quickly find model organisms.
The 'Add...' more selections should be adding to the collapsible list?
In some ways, this functions like a faceted search
Discussion of Pre-computation Workflow (for Phase 1 Prototype development)
TreeBeST needs Amino Acid guided alignments as part of the input. Is there scripts or tooling for this? Can we use ESTs to produce Amino Acid guided alignments?
is AA-guided aligment going to be provided? Or will a data source for test datasets be available?
Scripts out there to force this need to figure out how to best estimate EST from unigenes?
BioPerl has an AA-to-DNA align that can take AA-align & output cDNA
With unigene data that has untranslated regions, what is the best tool to translate?
Find the largest open reading frame
Users may want to know what the translation (the prediction) is - this immediate form may be of interested and, therefore, may need to be stored.
ESTWise - search protein sequence, pull out hits and feed 3 together w/ EST or mRNA sequence and provides prediction of translation. Suggested by Todd, who has used it with very satisfactory results. Tool is part of EBI toolkit.
How should we be doing the clustering?
1KP Analysis - TribeMCL using any large genome sequences or full length cDNA sequences - cluster all using TribeMCL & use it as a framework for sorting these using BLAST (then augment w/ Hidden Markov Model). Todd & Jim discussed if TribeMCL was deprecated. It appeared that it has been deprecated or there were memory issues with very large datasets.
OrthoMCL - to achieve higher granularity - identify smaller gene families (or clusters).
all-by-all BLAST and cluster based on e-values - do w/ all large full length coding sequences. This becomes a scaffold - might want to be using some of iPlant computational resources (would need to be able to specify cleared to TACC what is desired/needed).