Access to OneKP data set

1KP file-naming conventions

Every sample is assigned a unique four letter code that never changes. If we have to resequence because of a failed experiment, the new sample is given a new code. To aid identification, some (but not all) of the file names and file contents are supplemented with a species name. However species names are not universally recognized, and of course we make mistakes too. Hence the names can change, even up to the last minute before publication.

To avoid having to make repeated changes to the file names and file contents, we generally keep the names initially assigned. Our current species names are at and the way to index them is to use the unique four letter codes because they never change. For example, the directory WHHY-Boerhavia_cf.spiderwort-mature_leaf has the code WHHY and the species is currently (2012-11-30) listed as _Boerhavia coccinea. However, the species name assigned when the sample was first acquired (Boerhavia cf. spiderwort) is used throughout the directory, in the file names and in the file contents.

In most instances we sequenced only one tissue sample per species. But for those few species where we sequenced more than one tissue, a “combined sample” is created by pooling of all of the read-pairs for that species. A similar thing was done when, for whatever reason, we happened to sequence the same species and tissue more than once. The resultant data sets are given names like ZCUA-Flaveria_trinervia-2_samples_combined, indicating, in this instance, a pool of the two samples named HRVY-Flaveria_trinervia-mature_leaf and RLCS-Flaveria_trinervia-juvenile_leaf.

Altogether, we sequenced 1345 samples (from 1174 species), and if we include the 111 combined directories, there are 1456 assemblies. Given that we do not correct the species names until publication, we are maintaining another website to track all known naming problems, as well as possible contamination issues.

Sample failures

A handful of samples are considered "failures" because too few scaffolds are produced upon assembly. These are retained in our data directories, in case someone should find them useful, but the paucity of scaffolds is noted on the website.

A typical sample directory, is shown below. Additional files containing intermediate results might also be present in the sample directory.

Unknown macro: {mockup}

Unassembled RNAseq read-pairs

All of the sequencing was done at BGI-Shenzhen. Read-pairs that fail a minimum quality threshold are discarded. The remaining read-pairs are divided across two FASTQ files and stored under the solexa-reads folder. These files use the Phred+64 convention to represent quality scores. See for details. Note that the above-mentioned "combined samples" will not have a solexa-reads folder.


Given the nature of the tissue samples provided for the RNAseq experiments, it is impossible to avoid contamination by non-plant genes from bacteria, fungi, etc. It is not at all unusual for these genes to be sampled at a sufficiently high depth to be assembled by SOAPdenovo-Trans into scaffolds. We would be skeptical of any claims that these scaffolds represent lateral gene transfers. A more serious confounding issue would be cross-contamination from other plant genes. This is difficult to assess because most of our species have never been subject to large scale sequencing. Hence any searches against the available plant sequences can only return non-exact matches to “related” species.

SOAPdenovo-Trans assembly

In 2012, we ran an assembly using the newly developed SOAPdenovo-Trans and GapCloser software from BGI-Shenzhen ( The assembled sequences are located in the assembly folder. The scaffold names (e.g. scaffold-AALA-2079325-Meliosma_cuneifolia) embed the sample’s four-letter code, a scaffold number, and a description of the source material, in this case just the species name. Scaffold numbers are unique to an assembly. In particular, they begin at 2000000 for the current assemblies. Lower numbers are used for older assemblies and, should we do them, higher numbers would be issued for newer assemblies.

Major configuration parameters used for SOAPdenovo-Trans:






k-mer size used for de Bruijn graphs



minimum read coverage for contigs



minimum contig length for scaffolds



run the internal gap filling step for SOAPdenovo-Trans



maximum number of putative alternative splice forms



average insert size used in sequencing library

Default values were used for any parameter not listed above.

Errors in the assembly process often produce short low-coverage scaffolds. Since the read-pairs are generated from inserts of nominal size 200 bp, output scaffolds near this size (i.e. less than 300 bp) are likely to be artefacts and have been excluded from downstream analyses.

Assembly statistics per scaffold

Each line has a scaffold name, an approximate mean read depth, the number of reads in the scaffold, the number of bases in these reads, the total length of the scaffold, and the numbers of A, T, C, G, and N bases. Information about the numbers of reads and their bases is obtained from the readOnScaf file. Due the specifics of how the SOAPdenovo-Trans works, these parameters are only approximate and are probably a little on the low side.

For example “scaffold-KEFD-2000876-Encalypta_streptocarpa 6.7 27 2235 335 108 114 51 43 19” is a scaffold assembled from 27 reads with a mean read depth of 6.7 (i.e. 2235/335), has 19 undefined (gap) bases, and a GC-content of 30 %= (51+43)/(108+114+51+43).-

BLASTX to NCBI nr proteins

All assembled scaffolds were searched against the NCBI’s nr peptide sequence database (non-redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF, Release 54, July 2012) using BLASTX. The output was filtered at a maximum E-value of 1E-10 and only the top 5 sequence matches were retained in the SOAPdenovo-Trans-assembly.fa.bz2_blastx file under the assembly folder.

Translation to protein sequences

All assembled scaffolds longer than 300 bp were queried against all NCBI RefSeq plant sequences (Release 54, July 2012) using BLASTX. The best matching protein coding genes were used to generate GeneWise translations using a modified TransPipes pipeline (described here). Inferred amino acid sequences can be found in the assembly directory for each taxon (CODE-SOAPdenovo-Trans-translated.tar.bz2). Translated assembly names match the nucleotide assemblies from which they are derived.

Unfortunately, the pipeline relabels entries in the fasta using 0, 1, 2, 3...  Because the original SOAPdenovo-Trans assembly output is numbered 2000001, 2000002,... (i.e. from one) the obvious conversion between the two sets of numbers is in error by one.  For example, scaffold-DFYF-2008707-Ilex_sp. corresponds to 8706 not 8707 is this tar file.

This issue is not repeated for the orthogroup analysis, which labels sequences using the original seven digit numbers.

Gene clusters by orthoMCL

The lab of Claude dePamphilis developed a gene family circumscription pipeline employing OrthoMCL software to produce a framework for operationally defined gene families.   OrthoMCL clustering of gene models from 22 annotated land plant genomes (represented in the tree below) resulted in circumscription of 53,136 gene clusters which we are treating as hypothetical gene families. Sequences were aligned for each cluster and HMM profiles were estimated. The resultant profiles were used to sort the translated amino acid sequences into gene families.

HMM profiles were used to query the inferred protein sequences for each taxon using hmmsearch (part of the HMMer suite). Bit-scores for matches with E-values of better than 1E-10 were retained. A cumulative probability distribution for these bit-scores was assessed to identify one or more HMMs accounting for 95% of the distribution. Most transcripts sorted into a single gene family for which the HMM match had a probability of 95% or greater, but some transcripts sorted to two or more families when bit-score probabilities were required from multiple HMMs in order to reach a 95% confidence level that the assembly was sorted to a correct gene family (i.e. orthoMCL cluster).

OrthoMCL can split traditionally defined gene families into multiple clusters, so we suggest that gene experts submit a diverse set of exemplars. Gene family experts should critically assess these files and provide feedback on their completeness and/or the erroneous inclusion of sequences.

Obtaining orthogroups

A web api is available at


GET /login

Obtain an authorization token to query the database


  • username (required): Username
  • password (required): Password

Important: the API uses ?digest access authentication.

Example (using cURL)

curl -X GET --digest -sku "username:password" "

GET | POST /orthogroups

Obtain the sequences for all the members of an orthogroup, given the ID of one of the genes in the group from these 22 sequences species.


  • accession (required, string): One or more valid gene identifier from one of these 22 genomes, separated by whitespace. (e.g. PACid:18158545, AT2G43210.1)
  • token (required, string): Authentication token.
  • format (optional): The format to be returned: faa: amino acid sequence in fasta format, fna: nucleotide sequence in fasta format, zip: zipped amino acid and nucleotide sequences, json: Java Script Object Notation object. Defaults to faa for a single query identifier and to json for multiple query identifiers.

Graphical user interfaces are also provided for single and multiple queries. Visit

Website for BLAST searches

The entire 1KP data set is available for BLAST searches courtesy of the China National GeneBank at a password protected website. Go to the website and click on "view available sequences". Users will be able to search either all of the samples, or a phylogenetically defined subset of samples, with the caveat that the categorizations are subset to change after the phylogenomics analysis for the capstone is complete. Notice that by default it shows only the public data. For access to the complete dataset, you must log in.