Taxonomic_Intelligence
Why does iPlant Need Taxonomic Intelligence?
- From Bill Piel
- Mapping heterogeneous labels onto canonical names with ID numbers of external authorities. It's only through taxonomic names that we can assemble a tree of all plants and then associate life sciences data with leaves on trees. We need a way to resolve among semantic heterogeneity – lexical variants, misspellings, abbreviations, synonyms, etc.
- Providing classifications to enhance queries. It's often not enough to rely on species names – many types of questions rely on higher classifications. Asking the question: "how often did C4 photosynthesis evolve in Monocotyledons" requires a mapping between "Monocotyledons" and all leaves on the tree of plants that belong to this group.
The Problem of Nomenclature and Taxonomy
The problem that we face with respect to Linnaean taxonomy is that this was designed long ago before computers were around – for the human brain it works fine; for computers, it's infuriating. There are several basic problems that we need to deal with, largely stemming from the fact that taxonomy encompasses two related but distinct issues: (1) nomenclature, which deals with tracking the names (i.e. identifiers) that are applied to taxa; and (2) taxonomic circumscription, which deals with the scope of the taxon – i.e. which organisms belong to it and which do not.
The taxonomic system, as codified by the ICZN, ICBN, and other such bodies, mostly regulates issue #1 – i.e. it sets out rules that judge which identifiers are available for use and which are not, and which get priority over another. Unfortunately, biologists are constantly changing taxonomy with respect to issue #2: the circumscription of species changes over time. For example, prior to my Ph.D. thesis, it was thought that Metepeira labyrinthea lived throughout North and South America. I determined that it only lives in New England south to Florida and west to Texas. What they thought lived in California is actually a separate species that ranges from California to Guatemala; and several more (and different) species live in Argentina and Chile. The identifier "Metepeira labyrinthea" has not changed, but its circumscription is greatly reduced to a group of animals with a much smaller geographic range (and sharing a narrower set of characters).
Computers can communicate species names (e.g. "Metepeira labyrinthea"); and computers can communicate species names with a citation that establishes when and who first created the identifier ("Metepeira labyrinthea (Hentz, 1847)"); and computers can communicate a chresonym, which refers to a published use of a name (e.g. "Metepeira labyrinthea Piel 2001" refers to Piel's use of the name in 2001). But for the most part, the circumscription cannot be communicated effectively. For example, if I look at this page, I can tell that my circumscription is being used, in part because my (fairly recent) chresonym is cited. But on this page, I see that someone has photographed a bunch of Californian spiders and called them Metepeira labyrinthea. That means he's using an older circumscription of this identifier, which I happen think is wrong. But understand that he is not misidentifying the animals: he is labeling them correctly, but with an older usage of that identifier. The upshot is that a computer has no way of knowing whether one datapoint, labeled "Metepeira labyrinthea" is from an organism that belongs to the same species as a datapoint for another organism labeled "Metepeira labyrinthea."
In summary, the issues are:
(1) The circumscription of a taxon can change while the identifier stays the same – i.e. one identifier can refer to different things and there's no easy way to communicate this. Metepeira labyrinthea, as I showed above, is a case in point.
(2) The identifier can change even though the taxon has not changed. For example, see the synonyms on this page: The species was first called "Epeira labyrinthea" in 1847, but that genus name was preoccupied and had to be changed, so it became "Araneus labyrintheus" in 1904 until it was determined that Araneus was too broadly defined, so other genera were created and various species were moved from Araneus to these other genera, Metepeira being one of them. Many of these changes are so-called "objective synonyms" – i.e. the identifiers have changed following the rules of nomenclature, and Epeira labyrinthea turned into Araneus labyrintheus without any change in the circumscription of the taxon. It's much easier for computers to figure out objective synonyms (e.g. if two different species names refer to the same taxon only because their genus or spelling has changed), as compared to subjective synonyms (two different species names refer to the same taxon because Piel in 2001 thought that their circumscriptions were, in fact, the same even if the original creators thought they applied to different sets of organisms).
(3) A plant, animal, and bacterial taxon can share the exact same name. This is because the names are governed by different bodies, thus causing homonyms. Within animals, for example, two different taxa cannot share the same name – by the rules of the ICZN, the older of the two names gets precedence. There is no rule that says that a plant name must give way for an identical animal name if the animal name was coined before the plant one. Thus, "Aotus" is both a valid plant genus (a Eudicot) and a valid animal genus (a monkey).
So how do we resolve all this?
A student of Rod Page, Nadia Anwar, wrote a PhD thesis on the topic: http://theses.gla.ac.uk/471/. It pretty thorough and tries to build an infrastructure that resolves the meaning of names as best it can. I'd suggest everyone skim through this thesis.
With TreeBASE, we pursued a less rigorous solution because we knew that we could not commit so much time and effort into building a big taxonomic infrastructure. Our strategy was to rely on uBio for identifying and resolving name variants, lexical groups, and objective synonyms; and rely on NCBI for resolving among subjective synonyms. Since then, uBio has diminished in terms of its reliability and support. Another service, the Global Names Index, has cropped up, buoyed by support from GBIF and EOL. This is the first step in developing a Global Names Architecture. The trouble is, GNA is under development, and we can't just wait around until they have fully developed this.
In the case of plants, we should probably also take advantage of IPNI – a massive index of all plant names. IPNI is great as a source for all known names and citation – but it does not try to resolve which names are synonyms: i.e. there's no subjective or subjective resolution. e.g., search on "Helianthus filiformis" and you'll find a record for it, listing when it was first described, etc., but you might not know that, according to this source, Helianthus filiformis is the same species as Helianthus salicifolius.
I think the solution for iPlant is to build something that is similar to the solution that TreeBASE uses, but perhaps design it in anticipation of the GNI/GNA coming together, and perhaps use Nadia Anwar's approach of trying to fuse multiples sources of taxonomic information – e.g. IPNI, species2000, GNI, etc.
Here, I'll summarize the details of how TreeBASE deals with names:
(1) we take advantage of uBIO's large dictionary of names (and the name-recognition tools that they supply), and store this dictionary in a table called "taxon_variants." Each record stores the fullnamestring and the namebankid extracted from uBIO.
(2) for each lexical group of taxon_variants, we map it against a canonical list of taxa (let's call this table "taxa"). In other words, from among the set of taxon_variants that make up a lexical group, we select a single canonical name that best represents the group. e.g., a lexical group in the taxon_variants table might include: "Homo sapiens", "Homo sapiens L.", "Homo sapiens Linnaeus 1757", "H. sapiens", etc. From this list, pick "Homo sapiens" to be the designated name used in taxon table – each of the taxon_variants maps to this record in the taxon table. How do we pick a designated or canonical name? That's a bit complicated. I would recommend the following: if the members of this group map to an ncbi_taxid (as evidenced from metadata available in the namebankObject), then we query ncbi to get ncbi's preferred name. If no ncbi_taxid can be found in the namebankObjects, then pick whichever name uBIO labels as "canonical" or "preferred name" (also usually reported in the set of namebankObjects).
(3) The table called "taxa" contains one record for each taxon. Its fields include the namestring and an ncbi_taxid (if available). Note that if the set of taxon_variants that make up a lexical group have more than one different ncbi_taxid, then this should alert us to the existence of a homonym (e.g., "Aotus" is both a monkey and a plant). In that case, we should duplicate the set of taxon_variants and have two different taxon records – one for the plant and one for the monkey. Someone has to manually go through sets of newly created homonyms and delete taxon_variants from each set. (We're obliged to do this because sadly uBIO does not report which lexical variants go with which ncbi_taxid).
The uBIO web services are described here. This is a special release that is best for our purposes: we need to use the "http://www.ubio.org/webservices/service_internal.php" services not the standard "http://www.ubio.org/webservices/service.php" ones.
Here's our database for handling names:
CREATE TABLE taxa ( taxon_id integer NOT NULL, namebankid integer, -- from uBIO namestring character varying(255), -- the "preferred" name for the species taxid integer, -- from ncbi groupcode integer -- (not used yet - I'm thinking of a kingdom code of such) ); \\ CREATE TABLE taxon_variants ( taxon_variant_id integer NOT NULL, taxon_id integer, namebankid integer, -- from uBIO namestring character varying(255), -- uBIO's name variant (short style) fullnamestring character varying(255), -- uBIO's name variant (long style) lexicalqualifier character varying(30) -- uBIO's qualifier (e.g. "canonical form") ); \\ CREATE TABLE taxon_labels ( taxon_label_id integer NOT NULL, taxon_variant_id integer, taxon_label character varying(255) -- the label for the leaf node );
Here's our methods for handling names, using perl regular expressions and uBio web services:
(1) Parse a nexus file to import a new tree. Store the tree in the various node and edge tables, store the taxon labels in a the taxon_labels table. Take each new, unmapped taxon label and modify it a bit to help with the matching process:
# remove any host names, crosses, etc $taxon_string =\~ s/^(\[\w\s-\]+) ex\.? .+$/\1/; $taxon_string =\~ s/^(\[\w\s-\]+) fm\.? .+$/\1/; $taxon_string =\~ s/^(\[\w\s-\]+) x\.? .+$/\1/i; \\ # remove comments or notations that simply indicate ambiguity \\ # but have the effect of disrupting taxonFinder's function $taxon_string =\~ s/ cf\.? / /; $taxon_string =\~ s/ var\.? / /; $taxon_string =\~ s/ nr\.? / /; $taxon_string =\~ s/ aff\.? / /; \\ # separate any cases of a lower case followed by an upper case # because taxonFinder will fail if there is no separation between # species and suffix code (e.g. "Homo sapiensLJ34") $taxon_label =\~ s/(\[a-z\])(\[A-Z\])/\1 \2/g; \\ # separate any cases of a letter followed by a number, again because # taxonFInder will fail if there is no separation between species and # suffix code (e.g. "Homo sapiens3453") $taxon_label =\~ s/(\w+)(\d+)/\1 \2/g; \\ # remove a period followed by a number or letter. This is because people # frequently append their suffix codes with periods, e.g. "Homo sapiens.2342" $taxon_label =\~ s/(\[\w\])\.(\[\w\d\])/$1 $2/; \\ # first try to capture a trinomial with a trailing suffix if ($taxon_label =\~ m/^(\[A-Z\]\[a-z-\]+) (\[a-z-\]+) (\[a-z-\]+) (.+)$/) \{ $withoutsuffix = "$1 $2 $3"; # but maybe you have a good trinomial without a trailing suffix \} elsif ($taxon_label =\~ m/^(\[A-Z\]\[a-z-\]+) (\[a-z-\]+) (\[a-z-\]+)$/) \{ $withoutsuffix = "$1 $2 $3"; # if that does not work, capture a binomial with a trailing suffix \} elsif ($taxon_label =\~ m/^(\[A-Z\]\[a-z-\]+) (\[a-z-\]+) (.+)$/) \{ $withoutsuffix = "$1 $2"; \} else \{ $withoutsuffix = "$taxon_label"; \}
(2) Check whether the "$withoutsuffix" name exists as the fulltaxonname in the taxon_variants table.
(2a) If yes, then suggest to the submitter that it map to this. If you get more than one hit and these map to two or more taxon records, then present it to the user as two different homonyms.
(2b) If no, then we need to use uBIO's taxonFinder -> go to (3)
(3) Send "$withoutsuffix" to taxonFinder. Get the namebankID. Here is an example where I take the string "Tetrao afer AB 2342" and run it through taxonFinder. (note that given the name cleansing procedure in (1), the suffix codes (AB 2342) would normally be missing. taxonFinder still works when suffix codes are present, but sometimes it screws up – so better to remove them ahead of time):
www.ubio.org/webservices/service_internal.php?function=taxonFinder&includeLinks=1&freeText=Tetrao+afer+AB+2342&version=2.0
The result is a bit of XML containing the namebankID "2576335".
(4) Using the namebankID, request the nambankObject. Between the taxonFinder results and the namebankObject results, gather up all the name variants and any ncbi taxonid (e.g. search in the XML for all items under "//results/lexicalGroups/value/value" and "//results/basionymGroup/value") and also gather up all the names that are considered the most standard – i.e. from "//results/canonicalForm" and "//results/nameString". Below is an example URL for namebankID 2576335. Note that the keyCode is a personal code issued by uBIO – you can sign up for one.
www.ubio.org/webservices/service_internal.php?function=namebank_object&namebankID=2576335&version=2.0&keyCode=2c6d5eccba2627906481774fdcb60669c2ebee72
From the result, we can pull out all known taxon variants:
namebankid: 2576335 namestring: Tetrao afer fullnamestring: Tetrao afer lexicalqualifier: NULL \\ 275422 Tetrao afer Tetrao afer PLS Müller 1776 unknown (Default) \\ 11817 Pternistis afer Pternistis afer (Statius Muller) 1776 NULL \\ 12294 Francolinus afer Francolinus afer (PLS Müller 1776) NULL \\ 23417 Francolinus afer afer Francolinus afer afer (PLS Müller 1776) NULL \\ 274343 Pternistis afer afer Pternistis afer afer (PLS Müller 1776) NULL \\ 275422 Tetrao afer Tetrao afer PLS Müller 1776 NULL \\ 1559020 Pternistes afer afer Pternistes afer afer (PLS Müller 1776) NULL \\ 1762020 Pternistes afer Pternistes afer NULL \\ 2475119 Francolinus afer afer Francolinus afer afer NULL \\ 2576309 Pternistis afer afer Pternistis afer afer NULL \\ 3345669 Pternistes afer afer Pternistes afer afer NULL
... so in this case there are 12 different name variants. Since I know that ncbi prefers the name "Pternistis afer", I'm going to match this against the <fullNameString> of all the variants, and in this way I discover that the best namebankid to use is 1762020. If I didn't have an ncbi preferred name, I would look for a <lexicalQualifier> that says "canonical form" (in this case none of them do – most have NULL – but that's unusual: most namebank objects provide this info). If there's no ncbi preferred name, and if there's no canonical form qualifier, then I'd pick the top level name, in this case 2576335.
(5) Check that taxon_variants table does not already have any record from any of the namebankIDs that you gathered from all the different name variants. Also, check that there are no records from the taxon table that have any ncbi taxid numbers already gathered. If so, then we should add any new taxon variants to the taxon_variants table uniting them with the exiting taxon record.
(6) For each new namebankID, add it to the taxon_variants table. For each set of taxon_variant records, create a single taxon record (unless there are two ncbi taxids – meaning two homonyms – in which case create two records and likewise two related sets of taxon_variants). The name_string for the taxon record should be the most "standard" you can get. The order of priority something like: (1) if you have a ncbi taxid, use whatever name ncbi uses. (2) if not, use what namebankObject has called the "canonical form" under "lexical_qualifier" from the lexicalGroups record, (3) if not that, use any canonicalForm record, (4) if not that, use the nameString that appears at the highest level of the namebankObject.