Taxonomy_5char_index
Purpose
Work flow for creating unique five-character indices for all NCBI taxa.
Downloaded NCBI taxonomy database dump from ftp.ncbi.nih.gov
The data we are interested in is the names table (names.dmp)
$ tar xzvf taxdump.tar.gz $ head names.dmp 1 | all | | synonym | 1 | root | | scientific name | 2 | Bacteria | Bacteria <prokaryote> | scientific name | 2 | Monera | Monera <Bacteria> | in-part | 2 | Procaryotae | Procaryotae <Bacteria> | in-part | 2 | Prokaryota | Prokaryota <Bacteria> | in-part | 2 | Prokaryotae | Prokaryotae <Bacteria> | in-part | 2 | bacteria | bacteria <blast2> | blast name | 2 | eubacteria | | genbank common name | 2 | not Bacteria Haeckel 1894 | | synonym |
Extract the rows with 'scientific name'
$ grep 'scientific name' names.dmp >scientific_names.dmp $ head scientific_names.dmp 1 | root | | scientific name | 2 | Bacteria | Bacteria <prokaryote> | scientific name | 6 | Azorhizobium | | scientific name | 7 | Azorhizobium caulinodans | | scientific name | 9 | Buchnera aphidicola | | scientific name | 10 | Cellvibrio | | scientific name | 11 | Cellvibrio gilvus | | scientific name | 13 | Dictyoglomus | | scientific name | 14 | Dictyoglomus thermophilum | | scientific name | 16 | Methylophilus | | scientific name |
simplify the format to just have the taxon ID and taxon name
$ cut -f1,3 scientific_names.dmp >ncbi_taxa.dmp $ head ncbi_taxa.dmp 1 root 2 Bacteria 6 Azorhizobium 7 Azorhizobium caulinodans 9 Buchnera aphidicola 10 Cellvibrio 11 Cellvibrio gilvus 13 Dictyoglomus 14 Dictyoglomus thermophilum 16 Methylophilus
A Perl script to generate a unique 5_character ID for each taxon
#!/usr/bin/perl -w # unique.pl -- generates a unique 5 character id use strict; my @chars = (0..9,'A'..'Z'); my %seen; while (<>) { chomp; next if /tax_id/; my $string = get_code(); # make sure we have not seen it until (!$seen{$string}) { $string = get_code(); } $seen{$string}++; print "$string\t$_\n"; } sub get_code { my $string; for (0..4) { $string .= $chars[int(rand(35)+0.5)]; } return $string; }
Generate the keys
$ ./unique.pl ncbi_taxa.dmp >indexed_ncbi_taxa.txt $ head indexed_ncbi_taxa.txt B3LO9 1 root 5Y61O 2 Bacteria UJPRA 6 Azorhizobium 9LXZ6 7 Azorhizobium caulinodans 56IDZ 9 Buchnera aphidicola 83JGG 10 Cellvibrio IYSGZ 11 Cellvibrio gilvus T6CIH 13 Dictyoglomus ZEATN 14 Dictyoglomus thermophilum ZE1A9 16 Methylophilus
Test to see if they are really unique
$ cut -f1 indexed_ncbi_taxa.txt |wc -l 586873 $ cut -f1 indexed_ncbi_taxa.txt | sort -u |wc -l 586873
Create a simple MySQL schema
CREATE TABLE `indexed_species` ( `id` varchar(5) NOT NULL, `NCBI_id` INT NOT NULL, `name` varchar(100) NOT NULL, INDEX (`id`, `NCBI_id`, `name`) )
Create and load the database
$ mysql -u root -e 'create database lookup' $ mysql -u root lookup <schema.sql $ mysql -u root lookup Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1397 Server version: 5.0.77 Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> LOAD DATA INFILE '/home/smckay/big/treerecon/indexed_ncbi_taxa.txt' INTO TABLE indexed_species; Query OK, 586873 rows affected, 3 warnings (3.74 sec) Records: 586873 Deleted: 0 Skipped: 0 Warnings: 3 mysql> SELECT * from indexed_species LIMIT 10; +-------+---------+----------------------------------------------+ | id | NCBI_id | name | +-------+---------+----------------------------------------------+ | 000L9 | 175313 | Macrogalea candida | | 000QL | 519191 | Oxalis perdicaria | | 0011T | 376348 | Puccinia striiformis f. sp. hordei | | 001IX | 122867 | Oxyops | | 001MY | 651348 | Sparattanthelium uncigerum | | 001Q5 | 466344 | Influenza A virus (A/Denmark/127/2004(H3N2)) | | 001V4 | 179390 | Apocryptophagus sp. GW841B | | 001YI | 235266 | Acidianus filamentous virus 1 | | 0022Q | 510154 | Pseudomonas sp. YRR04 | | 00295 | 576724 | Glypturus sp. RR-2008 | +-------+---------+----------------------------------------------+ 10 rows in set (0.00 sec) mysql>