Summary of deliverables:
A prototype of web service behind a PhyloWS interface that accepts NeXML, pulls out all the taxon labels, matches them against a dictionary, picks out a canonical name for each and a list of authority IDs, and then repackages the NeXML (+ this added metadata) and returns it to the user.
From an email sent by Bill Piel:
I've been busy working on perl scripts that consume NeXML or nexus, and then take taxon labels and query uBio, GTI, and EOL. I'm still "researching" what these services return (and unfortunately, each one is sub-par – EOL lacks ncbi exposure, uBio services are inconsistent, and GTI also has idiosyncrasies).
Regarding a GUI – I'm very much in favor of building on PhyloWidget. One issue to deal with is that I'd like to use standard metadata annotation as specified in NeXML as much as possible, but PhyloWidget's internal management of annotations is in the style of NHX. I think PhyloWidget needs to read and write (i.e. both GET and PUT) NeXML using PhyloWS, where we invent a way for NHX annotations can automatically be translated into NeXML metadata elements. So, this implies some significant work to be done on PhyloWidget – but that is somewhat higher-hanging fruit.
The lower-hanging fruit, IMO, is to build a taxonomic-improvement service behind a PhyloWS interface, and then demo it but developing a working use-case. For example, take two NeXML files, both are using taxon labels that are not immediately combinable – e.g. one has "Felis_concolor" and the other has "Puma_concolor" (two names that mean the same species). Build a simple supertree service that take the two trees in NeXML and outputs a single MRP matrix, after first sending each one to the taxon intel service that annotates the OTUs. Since both the Felis_concolor OTU and the Puma_concolor OTU get annotated with the same EOL id (or the same NCBI taxid, or the same ITIS id, or the same TROPICOS id etc etc) the supertree service can now fuse them together because it "knows" that the Puma_concolor OTU == the Felis_concolor OTU. It then spits back a MRP result. This demo serves as an example of purely "computer-to-computer" communication using PhyloWS + NeXML + annotation (that is hopefully in compliance with standards, e.g. CDAO).
But thus said, it would make it sexier if we could demo this it in combination with a modified PhyloWidget that can GET/PUT NeXML and display/edit metadata from NeXML in its NHX annotation system. That would make the demo more compelling, because people like to see interfaces. But... I'm guessing that the improvements to PhyloWidget is slightly higher-hanging fruit (is there time to do that too, before the Phoenix meeting?).
Anyway, the goal (IMO) is to have a collection of service-oriented cyberinfrastructure out there – taxonomic services, tree inference services, GUIs, and phylogeny database repositories – that all speak PhyloWS + NeXML + CDAO, and that the biologists (with a bit of instruction) can collectively use to assemble the plant tree of life.
We will be borrowing a lot of ideas from Nadia Anwar's Ph.D. thesis and from Roger Hyam's discussion of taxonomy and the semantic web. Nadia Anwar proposes a data schema that looks like this:
Of these parts, I don't think we need the VERNACULAR table (vernacular names are too vague for our purposes), nor do we need the TREE and the NODES. But the simple set of NAME, SOURCE, ASSERTION, and SYNONYM would suit our purposes. One potential problem is that although uBio delivers a set of names within a lexical group, they are inconsistent in telling us what their "preferred" name is – i.e. what the senior synonym is from within a set of monotypic synonyms. The SYNONYM_NAME table here has a parent-child design where the "VALID_NAME_ID" is always known. That's not necessarily the case, i.e. we may know that species A is an objective synonym of species B, but we may not know which one to use as the "valid" name. Additionally, it's not clear to me from this ER diagram whether synonym assertions are typed to a particular name source – and I think we would want that. At any rate, more to think about here.
Here's one option:
The idea is that we fuse the synonym information with the assertion information. This allows us to identify each synonym assertion with a particular source, and it makes sense because typically an assertion from a source is accompanied by a synonym assertion. The URI of an assertion is the URL in the name_source table combined with the URN in the assertion table. The predicate in the assertion table explains the relationship between the name_id (as child/subject) and the object_id (as parent/object). Typically, the predicate will be "synonym" as in "name_id is_a_synonym_of object_id" where object_id is assumed to be the preferred name or the valid name. By having a predicate column, we allow for other hierarchical relationships, such as "Homo sapiens sapiens is_a_subspecies_of Homo sapiens."
In the name table, the name_text is the taxonomic name, while the alt_name_text is any other name that some sources provide – e.g. uBio has a "fullname" and a "shortname", NCBI has a "name" and a "unique name" (which disambiguates among homonyms).
Bill has populated the database with plant names from ITIS, NCBI, and USDA PLANTS. A PostgreSQL dump of the database is available here.
The record numbers are as follows:
- 344,145 distinct plant names in the name table
- 449,083 assertions made about these names from three sources: USDA Names, NCBI, and ITIS
- 144,106 plant names are considered to be "valid" by at least one of the three sources
By comparison, there are about 95,000 plant names in NCBI that NCBI consider "valid" – so USDA and ITIS don't merely overlap with NCBI – all three supplement the total, but with significant overlap. However, this is still not nearly the "400,000" plant species names. For that, we're hoping that we can get data from Tropicos, fingers crossed. Other sources include EOL, uBio, Catalogue of Life, IPNI, etc. But for now we can start playing with this data.
Here is a query that takes a name (in this case Acacia baileyana F. Muell.) and, in the event of a match, returns a list of assertions assembled from the three datasets:
Using the above query, we can run this simple example: searching on the name Cypripedium passerinum results in URIs for all three databases:
Slightly more complicated, a search on Cladium jamaicensis returns a match for NCBI, but not USDA or ITIS. However, because NCBI recognizes this as a misspelling of Cladium jamaicense, and because USDA and ITIS have records for the version with the correct spelling, we get URIs back for all three sources:
Along similar lines, a search on 'Acacia baileyana F. Muell.' is recognized only by USDA – however, USDA knows this to be a variant of Acacia baileyana, and once that connection is made, all three sources report URIs for it:
Below are two examples of search results for names that are no longer current: Alhagi camelorum and Eurybia commixta. Again, the three sources help each other out. So, even though NCBI does not recognize Eurybia commixta, ITIS reports four synonyms for it and USDA reports three synonyms. Of these synonyms, NCBI recognizes two of them, so can report back two URIs.
Clearly, assembling assertions from multiple sources has a synergistic effect. However, in general, the result errs on the side of lumping, i.e. all it takes is for one data source to map incorrectly to another name, and the list of meta data returned grows considerably.
Matching on Taxon Labels:
It's one thing to have a big dictionary of names in a database – it's quite another to take the labels off of tree leaves and match them against the database. Abbreviated genera and misspellings are a problem, and there is no good solution for them, although building a separate agrep dictionary might do the trick for misspellings. A bigger problem is with suffix codes, for example when a culture collection number or a Genbank accession number is tagged onto the end of a taxon label. Sometimes adding a suffix code is inevitable because most phylogenetic programs require that all labels be unique – so in the case of two or more species appearing in the same tree, numbers must be added to render them distinguishable. Another common problem is when people put a period instead of an underscore to represent a space, e.g. "Homo.sapiens" instead of "Homo_sapiens". A final complication is when people fail to put spaces between the species epithet and a suffix code, e.g. "Homo_sapiensAJ234234".
I (Bill) will recommend the following approach:
- Start by seeing if an exact match with the database dictionary produces a hit. If it does, go with that, otherwise...
- Substitute all periods with underscores; separate any lowercase letter that is followed by an upper case; separate any lower case letter that is followed by a number; consolidate any resulting double spaces into single.
- See if you can extract a capitalized word followed by all lower case words – don't include any trailing words that are either too short (< 3 letters), or have any upper case letters, or have any numbers in them. Note that a hyphen is allowed.
Here is some example code: