1KP file-naming conventions


In most instances we sequenced only one tissue sample per species. But for those few species where we sequenced more than one tissue, a “combined sample” is created by pooling of all of the read-pairs for that species. A similar thing was done when, for whatever reason, we happened to sequence the same species and tissue more than once. The resultant data sets are given names like ZCUA-Flaveria_trinervia-2_samples_combined, indicating, in this instance, a pool of the two samples named HRVY-Flaveria_trinervia-mature_leaf and RLCS-Flaveria_trinervia-juvenile_leaf.

Altogether, we sequenced 1345 samples (from 1174 species), and if we include the 111 combined directories, there are 1456 assemblies. Given that we do not correct the species names until publication, we are maintaining another website to track all known naming problems, as well as possible contamination issues.

A handful of samples are considered "failures" because too few scaffolds are produced upon assembly. These are retained in our data directories, in case someone should find them useful, but the paucity of scaffolds is noted on the website.


All assembled scaffolds longer than 300 bp were queried against all NCBI RefSeq plant sequences (Release 54, July 2012) using BLASTX. The best matching protein coding genes were used to generate GeneWise translations using a modified TransPipes pipeline (described here). Inferred amino acid sequences can be found in the assembly directory for each taxon (CODE-SOAPdenovo-Trans-translated.tar.bz2). Translated assembly names match the nucleotide assemblies from which they are derived.


Unfortunately, the pipeline relabels entries in the fasta using 0, 1, 2, 3...  Because the original SOAPdenovo-Trans assembly output is numbered 2000001, 2000002,... (i.e. from one) the obvious conversion between the two sets of numbers is in error by one.  For example, scaffold-DFYF-2008707-Ilex_sp. corresponds to 8706 not 8707 is this tar file.

This issue is not repeated for the orthogroup analysis, which labels sequences using the original seven digit numbers.