Sample source and purity
Verification of Source Species for an Assembly
As a general principal one can distinguish between reports that a sample i) contains the expected species, ii) does not contain that species, and iii) has no information on this issue.
In particular the recent 18S rRNA analyses has many samples that did not pass that source validation test. However, that does not indicate that these samples are wrong, merely that the 18S analysis does not provide an answer. Accordingly, these are type iii results and a different analysis can validate the sample without conflict.
As we have more reports about each sample, some of them will inevitably conflict. From past experience, some of these reports will not be definite, but may be statements like, "sample <XXXX> is odd, maybe it is not a <SPECIES>." Therefore, each report, will have to classified as to whether it is a robust result or not.
Summary of Multiple Reports
Should there be multiple conflicting reports, analyses judged to be more definite/robust will be preferred. Reports that are specific to that sample are also preferred. The assumption is that if someone has specifically investigated one sample, that is likely a more accurate result than a general project wide analysis.
Samples with significantly conflicting status reports or which remain unvalidated will be flagged for detailed follow-up.
Worrisome Contamination
Significant contamination (other plant material) will be assessed using the same principles. However, the degree to which an analysis is considered definitive may differ between the two analyses.
Combined Assemblies
A report for a source sample used in a combined assembly may or may/not apply to the combined assembly. It will depend on whether the report is positive or negative. Similarly, a report on the combined assembly may or may not also apply to the source materials.
18S rRNA Analysis
To confirm sample source and purity, 1KP assemblies were compared by blastn to a reference set of 18S rRNA sequences derived from the SILVA SSU database (http://www.arb-silva.de/). Only SILVA entries with a clear 18S rRNA annotation in the NCBI nt database were used.
Nuclear 18S sequences were preferred because more reference sequences are available, ensuring a dense sampling across the Viridiplantae. Alignments to chloroplast and mitochondrial SSU sequences were detected by searching for the patterns chloroplast*, plastid, mitochondri* -- and subsequently ignored.
Short and low-identity alignments are not reliable for determining taxonomic sources as they may be taxonomically ambiguous, aligning well with sequences from distantly related species. Hence alignments shorter than 300 bp or with E-values above 10e-9 were also ignored.
Thank you to Shaungxiu Wu (CAS Key Laboratory of Genome Sciences and Information, Beijing) and her group who have done this work.
Categories of Validation
Many 1KP samples contain non-plant sequences especially from bacterial, fungal, or insect sources. This kind of "contamination" is not a problem for most analyses and is described in our summaries as "harmless". It is reported when scaffolds are present for which the best alignments were to non-plant sequences.
A sample source is validated if the best alignments for all of the ribosomal scaffolds are to sequences from species within the same taxonomic family as the sample source. If the best alignment for one of the scaffolds match the expected source at either the genus or species level then this more precise validation was also noted.
Lastly, "worrisome" contamination was reported when a scaffold's best alignment was to a plant species (Viridiplantae, Glaucocystophyceae, Rhodophyceae, Cryptophyceae, Haptophyta, or Stramenopiles) outside of the expected source family. This status does not mean that a problem has occurred, only that more attention is warranted. Final status will be assigned after a manual inspection of the assemblies by plant experts within the 1KP consortium, which is ongoing and not yet complete.
Limitations of Method
Our analysis relies on ribosomal small sub-unit material being assembled from each sample. Because a significant fraction of a cell's RNA is ribosomal, this is likely to be a sensitive detector of contamination. However, if the contamination is from a closely related species, the sequences will co-assemble. Experimentally, we have found that this can happen when ribosome sequences differ by 2% or less. Such contamination will not be reported by our methodologies.
Comparison with Other Results - 1. Barkman
Todd Barkman has constructed trees with SABATH methyltransferase sequences and then manually decided whether samples are taxonomically misplaced. When results from his efforts are compared with the 18S RNA taxonomic validation they agree for 94% of samples.
Barkman's Classification | 18S Validated | Not Validated | No 18S Result |
---|---|---|---|
Taxonomically Good | 831 | 35 | 18 |
Problems/Questionable | 18 | 12 | 1 |
No Data | 376 | 25 | 11 |
His detailed report with an assessment for each assembly is available 1kp-Barkman.xlsx. The category codes are explained on the second sheet of the workbook. The above table groups categories 1-3 and 4-5. Also available is a spreadsheet listing samples which failed either source validation Sample Source Issues.xlsx.
Comparison with Other Results - 2. Mirarab
A number of samples have noticeably odd locations in the capstone test MAFFT tree produced by Siavash Mirarab. These are:
LVNW | Basal Eudicots | Cocculus laurifolius |
WPYJ | Magnoliids | Frankenia laevis |
DYFF | Core Eudicots/Asterids | Pycnanthemum tenuifolium |
XMQO | Basal Eudicots | Gunnera manicata |
JLLY | Core Eudicots/Rosids | Melaleuca quinquenervia |
CYVA | Basal Eudicots | Cimicifuga racemosa |
QJXB | Core Eudicots/Rosids | Wikstroemia indica |
FWBF | Core Eudicots | Alangium chinense |
FONV | Core Eudicots/Rosids | Greyia sutherlandii |
NPND | Basalmost angiosperms | Ceratophyllum demersum |
ULGV | Core Eudicots/Asterids | Morinda citrifolia |
JBGU | Core Eudicots | Amaranthus palmeri |
YMES | Monocots/Commelinids | Typhonium blumei |
JBLI | Eusporangiate Monilophytes | Bolbitis repanda |
FITN | Liverworts | Treubia lacunosa |
NIJU | Core Eudicots/Rosids | Heteropyxis natalensis |
UZNH | Core Eudicots/Asterids | Curtisia dentata |
IQJU | Hornworts | Anthoceros formosae |
FANS | Hornworts | Leiosporoceros dussii |
Comparison with Other Results - 3. Human Genome
For each of the datasets was mapped to a human genome reference (available at https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.29 ) using Bowtie 2 (version 2.2.4). Then the number of read-pairs that cleanly aligned was counted.
This provides a count of human-like reads in the library. For most samples these reads are small fraction of the total. However, a few cases have much larger counts suggesting that substantial contamination with human material may have occurred. A spreadsheet with details is here.
This technique is not intend to be perfect, but provides a rapid estimate. For RNA contamination the result wlll be an under-count, as introns will prevent the reads from aligning with the genome and being counted. Similarly, read-ends that do not align in the expected paired-end fashion are not counted.
Example of the commands used:
# align reads to the genome reference - output temporary file (AALA.sam)
bowtie2 --phred64 --no-unal -x GCF_000001405.29_GRCh38.p3_genomic \
-1 AALA-read_2.fq -2 AALA-read_2.fq -S AALA.sam
# print first read of properly-mapped (flag 64+2) read-pairs and count (lines)
samtools view -f 66 AALA.sam | wc -l
SUMMARY OF RESULTS
Here now are the latest results, BEFORE manual inspection by our plant experts. Detailed analysis reports are available, 1328_statistics_final.xls and 1328_blast_info_2.xls.
Unconfirmed Source (No Worrisome Contamination)
IRBN | Scapania nemorosa | YZGX | Cyrilla racemiflora | TFDQ | Monoclea gottschei |
HTDC | Tamarix chinensis | AEXY | Blasia sp. | TMAJ | Neckera douglasii |
QGLJ | Cocos nucifera | UHJR | Citrus x paradisi | QAIR | Opuntia sp. |
No Family-level Reference Sequence in Database
AZBL | Petiveria alliacea | OQHZ | Quillaja saponaria | BNTL | Souroubea exauriculata |
YQEC | Woodsia ilvensis | PNZO | Culcita macrocarpa | EWXK | Thyrsopteris elegans |
CWLL | Schlegelia parasitica | YJJY | Woodsia scopulina | OQON | Entocladia endozoica |
EGNB | Scourfieldia sp. | GAKQ | Schlegelia parasitica | PQED | Gloeochaete wittrockiana |
OCZL | Homalosorus pycnocarpos | COBX | Polypremum procumbens | HTFH | Onoclea sensibilis |
TAVP | Calliergon cordifolium | OFTV | Barbilophozia barbata | AJAU | Helicodictyon planctonicum |
VHIJ | Blastophysa cf. rhizopus | VJDZ | Botryococcus sudeticus | POOW | Glaucocystis cf. nostochinearum |
YGAT | Phyllanthus sp. | HYZL | Akania lucens |
|
|
No SSU rRNA Sequence Found in Assembly
SHEZ | Dianthus caryophyllus | WWSS | Taxus baccata | JVBR | Aloe vera |
ZYAX | Taxus cuspidata | PAWA | Aristolochia elegans | XSZI | Peperomia fraseri |
HSXO | Ancistrocladus tectorius |
|
|
|
|
Unconfirmed Source & Worrisome Contamination
XXHP | Cystopteris fragilis | TJES | Spergularia media | HEGQ | Gymnocarpium dryopteris |
RXEN | Polycarpaea repens | XONJ | Camptotheca acuminata | DCCI | Calceolaria pinifolia |
EZXQ | Cleome violacea | YOWV | Cystopteris protrusa | RICC | Cystopteris reevesiana |
NIJU | Heteropyxis natalensis | QZZU | Pyrenacantha malvifolia | LVNW | Cocculus laurifolius |
EDXZ | Schlegelia violacea | MBQU | Cleome gynandra | KUXM | Selaginella selaginoides |
IQJU | Anthoceros formosae | QSKP | Polanisia trachysperma | ZYCD | Selaginella acanthonota |
UZNH | Curtisia dentata | HNDZ | Cystopteris utahensis | JDQB | Neocallitropsis pancheri |
RTTY | Salvadora sp. | RNBN | Mollugo cerviana | FITN | Treubia lacunosa |
UPZX | Cleome viscosa | SKNL | Saponaria officinalis | PKMO | Cistus inflatus |
OLES | Schiedea membranacea | GIWN | Sarcobatus vermiculatus | RUUB | Physena madagascariensis |
CWZU | Betula pendula | ZFGK | Selaginella kraussiana | FAKD | Nelumbo sp. |
CTYH | Basella alba | JBGU | Amaranthus palmeri | OTAN | Deutzia scabra |
FANS | Leiosporoceros dussii | VWIP | Carya glabra | HUSX | Roridula gorgonias |
PVGM | Oncotheca balansae | FIDQ | Undaria pinnatifida | KVAY | Tribulus eichlerianus |
QJXB | Wikstroemia indica | JPDJ | Symplocus tinctoria | ZLOA | Cleome gynandra |
NLOM | Pediastrum duplex | LWDA | Alnus serrulata | YXNR | Triodia aff. bynoei |
VBMM | Claopodium rostratum | LKKX | Talinum sp. | EBWI | Ochromonas sp. |
ULGV | Morinda citrifolia | TJLC | Nothofagus obliqua | HELY | Cleome violacea |
QTJY | Euptelea pleiosperma | LHLE | Cystopteris fragilis | OKEF | Hibbertia grossulariifolia |
OBTI | Peganum harmala |
|
|
|
|
Worrisome Contamination (Source Confirmed)
MLPX | Papaver setigerum | IAYV | Rhodomonas sp. | LJPN | Gracilaria blodgettii |
LJQF | Draba ossetica | PWKQ | Gracilaria sp. | VKVG | Synura sp. |
GTSV | Draba hispida | AZZW | Chlorokybus atmophyticus | LSKK | Orchidantha maxillarioides |
OJCW | Maesa lanceolata | UKUC | Dunaliella salina | JGGD | Sargassum muticum |
BXBF | Draba sachalinensis | NBYP | Mesotaenium kramstae | VZWX | Ceramium kondoi |
ZZEI | Phylloglossum drummondii | RFAD | Pavlova lutheri | RTMU | Calypogeia fissa |
ULKT | Lycopodiella appressa | BZSH | Golenkinia longispicula | LXRN | Prymnesium parvum |
FOYQ | Microspora cf. tumidula | WZFE | Ascarina rubricaulis | XAXW | Neosiphonia japonica |
MWAN | Chlorella minutissima | IEHF | Dumontia simplex | XKWQ | Pediastrum duplex |
QHVS | Ophioglossum vulgatum | WEJN | Mazzaella japonica | VNAL | Gracilaria vermiculophylla |
BAKF | Cryptomonas curvata | RKGT | Eschscholzia californica | UGPM | Chondrus crispus |
IIFB | Oenothera gaura | FZQN | Silene latifolia | IFCJ | Canella winterana |
IOVS | Pseudotsuga wilsoniana | KRUQ | Porella navicularis | HVBQ | Tetraphis pellucida |
LACT | Oenothera gaura | PTBJ | Plantago virginica | NMAK | Pavlova lutheri |
BAJW | Isochrysis sp. | YBQN | Odontoschisma prostratum |
|
|
Status Changed After Manual Review
LETF | Planophila laetevirens | culture collection no longer uses original species identification (P. terrestris) |
GJIY | Pseudoneochloris marina | species change to match culture collection |
EEJO | Ettlia oleoabundans | species change to match culture collection |
WDCW | Mesotaenium endlicherianum | 18S rRNA chimeric with human |
MFYC | Oocystaceae species | 18S rRNA indicates change (was Nannochloris atomus) |
CYVA | Apiales species | multiple assembly confirmation (was Cimicifuga racemosa) |
PZIF | Scenedesmus dimorphus | "contaminant" assembly found to be in genus by manual blastn search of nr |
Manually Confirmed Problems
NLOM | Pediastrum duplex | fungal sequence present |
GDUD | Chloromonas reticulata | brown algae 18S rRNA present |
WGMD | Zygnema sp. | brown algae 18S rRNA present |
OVHR | Chlamydomonas bilatus | brown algae 18S rRNA present |