Taxonomy_5char_index

Purpose

Work flow for creating unique five-character indices for all NCBI taxa.

Downloaded NCBI taxonomy database dump from ftp.ncbi.nih.gov

The data we are interested in is the names table (names.dmp)

$ tar xzvf taxdump.tar.gz
$ head names.dmp
1	|	all	|		|	synonym	|
1	|	root	|		|	scientific name	|
2	|	Bacteria	|	Bacteria <prokaryote>	|	scientific name	|
2	|	Monera	|	Monera <Bacteria>	|	in-part	|
2	|	Procaryotae	|	Procaryotae <Bacteria>	|	in-part	|
2	|	Prokaryota	|	Prokaryota <Bacteria>	|	in-part	|
2	|	Prokaryotae	|	Prokaryotae <Bacteria>	|	in-part	|
2	|	bacteria	|	bacteria <blast2>	|	blast name	|
2	|	eubacteria	|		|	genbank common name	|
2	|	not Bacteria Haeckel 1894	|		|	synonym	|

Extract the rows with 'scientific name'

$ grep 'scientific name' names.dmp >scientific_names.dmp
$ head scientific_names.dmp 
1	|	root	|		|	scientific name	|
2	|	Bacteria	|	Bacteria <prokaryote>	|	scientific name	|
6	|	Azorhizobium	|		|	scientific name	|
7	|	Azorhizobium caulinodans	|		|	scientific name	|
9	|	Buchnera aphidicola	|		|	scientific name	|
10	|	Cellvibrio	|		|	scientific name	|
11	|	Cellvibrio gilvus	|		|	scientific name	|
13	|	Dictyoglomus	|		|	scientific name	|
14	|	Dictyoglomus thermophilum	|		|	scientific name	|
16	|	Methylophilus	|		|	scientific name	|

simplify the format to just have the taxon ID and taxon name

$ cut -f1,3 scientific_names.dmp >ncbi_taxa.dmp
$ head ncbi_taxa.dmp 
1	root
2	Bacteria
6	Azorhizobium
7	Azorhizobium caulinodans
9	Buchnera aphidicola
10	Cellvibrio
11	Cellvibrio gilvus
13	Dictyoglomus
14	Dictyoglomus thermophilum
16	Methylophilus

A Perl script to generate a unique 5_character ID for each taxon

#!/usr/bin/perl -w
# unique.pl -- generates a unique 5 character id
use strict;

my @chars  = (0..9,'A'..'Z');
my %seen;
while (<>) {
  chomp;
  next if /tax_id/;
  my $string = get_code();
  # make sure we have not seen it
  until (!$seen{$string}) {
    $string = get_code();
  }
  $seen{$string}++;
  print "$string\t$_\n";
}


sub get_code {
  my $string;
  for (0..4) {
    $string .= $chars[int(rand(35)+0.5)];
  }
  return $string;
}

Generate the keys

$ ./unique.pl ncbi_taxa.dmp >indexed_ncbi_taxa.txt
$ head indexed_ncbi_taxa.txt
B3LO9	1	root
5Y61O	2	Bacteria
UJPRA	6	Azorhizobium
9LXZ6	7	Azorhizobium caulinodans
56IDZ	9	Buchnera aphidicola
83JGG	10	Cellvibrio
IYSGZ	11	Cellvibrio gilvus
T6CIH	13	Dictyoglomus
ZEATN	14	Dictyoglomus thermophilum
ZE1A9	16	Methylophilus

Test to see if they are really unique

$ cut -f1 indexed_ncbi_taxa.txt |wc -l
586873
$ cut -f1 indexed_ncbi_taxa.txt | sort -u |wc -l
586873

Create a simple MySQL schema

CREATE TABLE `indexed_species` (
  `id`      varchar(5)   NOT NULL,
  `NCBI_id` INT          NOT NULL,
  `name`    varchar(100) NOT NULL,
  INDEX (`id`, `NCBI_id`, `name`)
)

Create and load the database

$ mysql -u root -e 'create database lookup'
$ mysql -u root lookup <schema.sql
$ mysql -u root lookup
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1397
Server version: 5.0.77 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> LOAD DATA INFILE '/home/smckay/big/treerecon/indexed_ncbi_taxa.txt' INTO TABLE indexed_species;
Query OK, 586873 rows affected, 3 warnings (3.74 sec)
Records: 586873  Deleted: 0  Skipped: 0  Warnings: 3

mysql> SELECT * from indexed_species LIMIT 10;
+-------+---------+----------------------------------------------+
| id    | NCBI_id | name                                         |
+-------+---------+----------------------------------------------+
| 000L9 |  175313 | Macrogalea candida                           | 
| 000QL |  519191 | Oxalis perdicaria                            | 
| 0011T |  376348 | Puccinia striiformis f. sp. hordei           | 
| 001IX |  122867 | Oxyops                                       | 
| 001MY |  651348 | Sparattanthelium uncigerum                   | 
| 001Q5 |  466344 | Influenza A virus (A/Denmark/127/2004(H3N2)) | 
| 001V4 |  179390 | Apocryptophagus sp. GW841B                   | 
| 001YI |  235266 | Acidianus filamentous virus 1                | 
| 0022Q |  510154 | Pseudomonas sp. YRR04                        | 
| 00295 |  576724 | Glypturus sp. RR-2008                        | 
+-------+---------+----------------------------------------------+
10 rows in set (0.00 sec)

mysql>