High-level Workflow for Prototype Development
As part of the discussion in the December 14, 2009 meeting of the Working Group, Andrew Lenards expressed a desire to understand more of the user workflow leading up to the tasks described in the use cases (see Use Case Prioritization). The use cases give us an idea of what must be implemented in deliverables. But we still are missing the initial context of the user and their goal(s).
The purpose of this page to flush out a high-level, workflow for users interested in Tree Reconciliation.
Where do users start? Perhaps a "day in the life" of a scientist/researcher interested in Tree Reconciliation. Feel free to provide this from your perspective as a researcher in this area.
Could we get a sample scientific question? Would the problem statements presented by Todd here be a good starting point? Sample data points, like taxa/gene/gene family, would be helpful to include.
Feel free to include responses in this document, or in comments.
Example Workflow
Prior to the detailed walk-through done in working group meetings [1, 2], workflow flowcharts were developed by Nicole Hopkins as a way to help analysts working on requirements frame questions and understand what was happening.
Visual summary of workflow in flowchart form: Example Workflow Flowchart Summary
Background:
A researcher is interested in pentacopeptide repeat proteins, and wants to see where the duplications of this large plant family have occurred in the plant species tree. She is particularly interested in the phylogenetic age of the subfamily for which some members have been implicated in restoration of cytoplasmic male sterility (Plant Journal 34(4), 407).
1. She starts with Genbank record CAD61285.1 and BLASTs it against the iPlant gene catalog in order to find its homologs.
- How did our researcher conclude that CAD61285.1 was the record of interest? I assume that they know the relation of PPR/Rf to this record. But how do they resolve their knowledge of that gene family with this GenBank record? A search in GenBank or Entrez?
- What constitutes to "gene catalog"? Is it that a BLAST search can be performed against some database containing gene records/data? Is there more to it than that?
- The topic of database and researcher's perspective of them is it's own page created here
- Will the Discovery Environment be expected to fetch sequence data (in fasta) with just a GenBank Accession ID (like CAD61285.1)? Or would a user have the data locally, and copy/paste into a BLAST search or upload from their system? {I know the Discovery Environment will need to interface w/ sources like GenBank, more curious if a researcher would be more likely to know the GenBank Accession ID or have the data locally on their hard disk}
2. She discovers that it has multiple strong hits to gene family X
(see Phytome BLAST results for one example of how to organize BLAST results by family).
- A definition of X here would help in trying to recreate this workflow for existing tools (where possible). If we use Phytome with the CAD61285.1 fasta data, we end up with the strongest hit in Gene Family #139, that hit is TIGR:At1g64100
- Focusing on approach Phytome uses to show results (grouped by gene family)
- What about the organization makes this a page to model after?
- Is it easier to see the groupings by gene family? Some other mechanism?
- Are there any features that did not make it into the current version that users have asked for? (sortable columns, export to csv/tsv, etc.)
- Would there need to be options for aggregating data? If so, what are they? Download data as one single archive?
- It's best not to look at the single best hit, but the best ensemble of hits to a family. In the case of Phytome, Gene Family #139, looking at the Interpro annotations confirms that this the proteins contain the PPR domain (IPR002885). The gene/protein family has 562 members from, collectively, most of the species in the database. question re this answer
3. A menu of options is available for each of the gene families returned.
- What options would be available in this menu? 'Revisit reconcile tree' seems like a workable name for the option identified in the next step (#4, see below).
- Considering the prototype, is this functionality present? Would the list of options change? (considering the identified use cases)
- Are there other paths that lead to other analysis from here? Or is the menu more than a set of visualization options?
4. She chooses the one that allows her to see the 'fat tree' of the PPR Rf subfamily within a low-resolution species tree (restricted to major lineages).
- How does restricting to major lineages affect the 'fat tree' shown? Does restricting mean that "non-major" lineages are simply filtered out or not shown?
5. She examines the identifiers of genes at the tip of the fat tree against a list that she has compiled in order to select the appropriate depth of the subfamily.
5a. All the common ancestors (or the most recent common ancestor) of the selected genes are highlighted.
5b. This allows her to see at what point in the coarse-rez species tree the subfamily of interest is restricted
5c. she chooses to zoom the view on that branch of the species tree
5d. hide the rest of the gene family from view.
5e. This reveals the ancestral species in which the common ancestor of the family was present, and reveals the sequence of duplications, losses (and speciation nodes) that have since occurred in the subfamily.
- Would the above bullets under #5 all be considered functionality needed in the prototype?
- Regarding 5e., to verify, showing whole genome duplication in such visualizations is truly out of scope for the prototype, correct?
6. She might, for instance, be able to tell that 'fertility restoration genes' have undergone a number of duplications since the common ancestor of the subfamily and so are not orthologous. This would support the idea that there has been proliferation of the subfamily to chaperone different mitochondrial transcripts with some specificity.
- These are conclusions made outside of the prototype or Discovery Environment from the visualizations and data, correct?
7. She could potentially then link to the trait evolution interface to map the mitochondrial transcripts associated with each member of the subfamily and attempt to reconstruct the shifts in specificity within the subfamily.
- Sheldon, should this point be mentioned to the Trait Evolution working group? Do we need to forward this on to Liya Wang (their ETA)?