Data Integration needs from existing projects
Core Sofware Team's Projects
/wiki/spaces/coresw/pages/241769200: A fundamental web application that exposes various data upload, retrieval, browsing and analytical services
/wiki/spaces/coresw/pages/241769284: An independent contrasts (PIC) service that can be run within the web application with user-supplied data
/wiki/spaces/coresw/pages/241769041: Collaboration between scientists, sharing and authentication capability for the web application
- Data that will be shared/exchanged for this release is limited to tree data (newick string and character matrix used to create that tree) and trait data associated with species on that tree. The idea is that someone could utilize a shared tree for their own analysis (perhaps taking the character matrix and using it with another inference program if they choose). Someone else could wish to add to their existing trait data table by using some of the measurements gathered by another group.
- Exchange file formats- We want to give the option to save the data in the format of their choosing. At this time, Nexus and Phylip format will be the supported types. The goal is to allow the user to view/download/save the shared file in a format that is compatible for their future use. Trait data- currently, we are planning to allow for upload of trait data in csv, tsv or nexus format. This has not been implemented as of yet (it's a separate user story).
- Minimum set of metadata- at this point, we are keeping the reference to where the tree comes from (if it is someone else's tree), the inference method used to get the tree and provenance information related to who the trait/tree data belongs to and the publication information (if available). Any provenance included in the uploaded file, whether it is an original tree or derived, will be preserved. Per our agreement with Sudha and Matt, this metadata will be kept in a raw form until they have had sufficient opportunity to do their profiling. Semantic parsing and further management of it will likely be added after that time, but better insight into those plans can only be gained by talking with them. We will be implementing the system and the model; they will be generating the model.
- Any ontology will/should be used?- I'm unsure of the scope of this question. Could you elaborate on the question? Perhaps this might be better discussed with developers?
Current development in core software team
Tree Reconciliation's prototype - a gene tree species tree reconciliation service
- what types of data will be used as input, intermediate results, and output?
(a) a species tree, w/ or w/o polytomies, and (b) either a gene tree, or a gene alignment (depending on which method is used). - what are exchange file formats will be used for each type of data?
Iniitially, these will be internal (not user-provided), so no requirement there. In a more mature interface, users will need to upload Newick (for gene trees) or Clustal/Phylip/Fasta (for alignments) of the gene alignments. I would like to see the web service layer accept and produce NeXML to enable machine-to-machine communication.
For nexus, we might as well adopt the Mesquite-style gene-to-species tree mapping system: the TaxaAssociation block. For NeXML, we should convoke a discussion among interested parties (especially Rutger) and formally extend NeXML's capabilities to handle gene-to-species mapping (Bill)
Could this be done by over a semester by Jamie Estill (Jim) - what are the minimum set of meta data (ex. provenance data) should be kept? And how to handle/store the meta data?
For genes, we will want a standard menu of annotation data (Genbank/EMBL IDs, Pfam domains, GO annotations, possibly pathway designations) in order to facilitate search. That is the only external metadata I can think of, apart from the taxonomic names. As for provenance of user supplied files down the road, at a minimum we will want creator, date, title, and format.
I think we would want the relative positions of genes in the sequenced genomes. This information is captured in the gene names for many but not all genomes (Jim) - Any ontology will/should be used?
CDAO for the NeXML. Controlled vocabularies will be useful for species names (scientific/common) and gene/protein names/IDs, since mapping the taxon of the gene to the taxon in the species tree is fundamental. - Use Cases Discussion