Scythe-0.991 using DE

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to upendra@cyverse.org. Thank you.

Rationale and background: 

Scythe is a smart adapter trimmer 3' contaminant and uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3'-end adapters, which often include poor quality bases.

Most next generation sequencing reads have deteriorating quality towards the 3'-end. It's common for a quality-based trimmer to be employed before mapping, assemblies, and analysis to remove these poor quality bases. However, quality-based trimming could remove bases that are helpful in identifying (and removing) 3'-end adapter contaminants. Thus, it is recommended you run Scythe before quality-based trimming, as part of a read quality control pipeline.

The Bayesian approach Scythe uses compares two likelihood models: the probability of seeing the matches in a sequence given contamination, and not given contamination. Given that the read is contaminated, the probability of seeing a certain number of matches and mismatches is a function of the quality of the sequence. Given the read is not contaminated (and is thus assumed to be random sequence), the probability of seeing a certain number of matches and mismatches is chance. The posterior is calculated across both these likelihood models, and the class (contaminated or not contaminated) with the maximum posterior probability is the class selected.

Scythe and all supporting documentation Copyright (c) Vince Buffalo, 2011-2014

Pre-Requisites

  1. A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)

  2. Input/Outputs 

    1. Adapter file in fasta format

    2. Sequence file in fastq format
    3. Output file name for trimmed sequences
  3. Optional arguments/parameters
    1.  Prior (default: 0.300)
    2. Quality type (illumina, solexa or Sanger (default: sanger)
    3. Matches file (default: no output) Ex: matches.txt
    4. Tag (Add a tag to the header indicating Scythe cut a sequence (default: off))
    5. Smallest contaminant to consider (default: 5)
    6. Minimum keep (Filter sequences less than or equal to this length (default: 35))

Information on quality schemes:

phred: PHRED quality scores (e.g. from Roche 454). ASCII with no offset, range: [4, 60].
sanger: Sanger are PHRED ASCII qualities with an offset of 33, range: [0,93]. From NCBI SRA, or Illumina pipeline 1.8+.
solexa: Solexa (also very early Illumina - pipeline < 1.3). ASCII offset of 64, range: [-5, 62]. Uses a different quality-to-probabilities conversion than other schemes.
illumina: Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, range: [0, 62

Test/sample data:


The test data are provided for testing Scythe-0.991in here - /iplant/home/shared/iplantcollaborative/example_data/Scythe:

Use the following inputs/outputs and parameters for testing Scythe-0.991

  1. Input/Outputs 

    1. Adapter file - illumina_adapters.fa

    2. Sequence file - sabreN_1.fq
    3. Output file - sabreN_1_trimmed.fq
  2. Optional arguments/parameters
    1.  Prior - 0.300 (default) 
    2. Quality type - sanger
    3. Matches file - matches.txt
    4. Smallest contaminant to consider - 5 (default)
    5. Minimum keep - 35 (default) 

Output Reports:

After successful completion of the run, expect a FASTQ file as output. For the test case, the output file you will find in the example_data directory is named sabreN_1_trimmed.fq along with matches.txt and logs folder.

 

More information about Scythe-0.991 can be found at  https://github.com/vsbuffalo/scythe