Phyloreference

Goal

The overall effort is to build a database that stores phyloreferences – i.e. special statements that have the effect of pointing to nodes on a tree. And the tricky part is that it's not as simple as just pointing to a particular node number – the pointer really must be able to find the right node to point at based on the patterns of names in the tree so that this pointer works even on different tree topologies and even with different selections of leaf-node labels.

Use Cases and possible approaches

  1. a simple user interface to allow users to create new phyloreference entries
    I suppose a simple user table would be needed to track who can log-in to create or delete entries. For our prototype, we don't need to worry about high security, etc... just a simple "create new account" and a basic email address (as login name) and password, etc. When the user logs-in, he/she can see a list of his/her phyloreference entries; he can edit/delete existing ones, or create new ones.
  2. the user creates a phyloreference by:
    1. providing some sort of annotation to indicate what kind of metadata is being attached to the phyloreference (e.g., that a node is 2.2 million years old; that a node represents the higher taxon name "Primates"; that a paralogous gene duplication occurred at this node; that the phylocode name for this node is "Angiospermae", etc). For this , we should use any available ontologies so that it is understood what kind of annotation is being attributed to the node. CDAO would be one such ontology, though I don't know that it has entries for "age of node" or "higher taxon" etc. But perhaps we can invent a certain vocabulary to make it seem as though these ontologies exist. (e.g. we could create a pop-up list for a set of predicates that, for now, we will invent)
    2. providing choices as to what method the phyloreference is using. See here for the syntax of a phyloreference if it is expressed in a PhyloWS query:
      ##providing specifiers to indicate the in-group and non-ingroup criteria for resolving the phyloreference. For example, a phyloreference to a "Primates" clade might have the following in-group specifiers: Homo sapiens (human), Pan troglodytes (chimp), Macaca macaca (monkey), and Lemur; and the following non-in-group specifiers: Raccoon, Hedgehog, Shrew (etc). Based on these ingroup and non-ingroup sets of names, a node on a tree can be identified. There's also the name-space for the specifier that needs to be indicated (e.g., in this case we're using the taxon name, but we could just as well be using an ncbi taxid, or other such identifier). For our purposes, it makes sense to use the taxon_variants that were in our previous name resolution prototype. In other words, we allow the user to type in a specifier and we quickly look-up that name from the large dictionary of taxon_variant names that we have.
  3. Given a tree and given a category of metadata indicated in 2a, annotations are attached to nodes in this tree. For example, let's say I have a tree and I want this tree to be decorated with paleontological dates – the phyloreference service examines the tree and finds any and all nodes that match with any and all phyloreference entries in the database that have to do with paleontological time (e.g. the "2.2 million years" in 2a). The service spits back the tree with relevant nodes labeled with these annotations.
    To accomplish this, I guess this should happen by having a way to parse NeXML or nexus/newick (e.g. using Bio::Phylo), and then with the tree in some sort of data structure, see if any nodes match the criteria for any phyloreferences. Then write, as output, the NeXML, but include annotation statements by each node that was matched to an existing phyloreference entry. The specifier name resolution can be accomplished with our taxon intel prototype.
    The result, then, is a database that stores phyloreferences, and a service wherein a tree can go trawling through this collection of phyloreferences and become decorated, in the right locations, wherever phyloreferences stick to the tree.
  4. A somewhat more extended use-case is for this service to pick a set of phyloreferences and go trawling through a larger collection of trees (e.g. in the a discovery-environment database of trees) and therein pick out all relevant nodes – i.e. using the phyloreferences as a special kind of query. And in this regard, we'd want our phyloreference database to express the queries in PhyloWS syntax (as described here: http://evoio.org/wiki/Phyloreferencing_subgroup), and then have a PhyloWS service on top of a database of trees process the query and recover the trees and nodes that match.
  5. Michael Donoghue has his own particular use case for Phyloreferences. He wants to be able to use them as a way of annotating a tree to indicate that a certain taxon is known to fall within a clade, but it's unknown exactly where in that clade the taxon belongs. So, for example, let's say that we have a well-resolved tree like so: ((A,B),((C,(F,D)),E)) and we have another taxon, G, which we know belongs somewhere in the clade (C,(F,D)), but we don't know if it is (G,(C,(F,D))) or ((C,G),(F,D)) or (C,(G,(F,D))) or (C,((G,F),D)) or (C,(F,(G,D))). The "normal" way that software programs resolve this ambiguity is to collapse the entire clade, i.e.: ((A,B),((G,C,F,D),E)) but the result is that we lose information. Michael wants a way to annotate a tree so that he can say that taxon G belongs somewhere inside clade (C,(F,D)), but without having to oblige this clade to collapse. He thinks that phyloreferencing is the solution, where the phyloreference annotation says something like "taxon G is_in phyloreference (C+D)-(E, A)". This could be expressed using PhyloWS syntax and embedded in a NeXML file containing the ((A,B),((C,(F,D)),E)) tree.

Finally, the approach for matching a phyloreference with a node is greatly enhanced if we can take advantage of classifications so that the OTUs don't have to match up exactly. But more on that later.