Quality Control for High Throughput Sequence Data (Workflow Tutorial)

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to xwang@cshl.edu. Thank you.


Introduction and Overview

Goal

When you get your sequences back from a sequencing facility, it is important to assess the quality of data because you don't want to spend time to analyze data of terrible quality (garbage in, garbage out). The goal of this tutorial is to get you thinking about the quality of the raw sequencing data and employing quality-based trimmer to remove those contaminants and poor quality bases before mapping, assemblies, and other analysis. The program we will use to assess the quality of a set of sequencing reads is called FastQC. It will check whether the sequences in an FASTQ file exhibit any unusual qualities, either low sequence quality or interesting features (e.g. sequences duplication levels and overrepresented sequences). After assessing the quality of sequencing data, Scythe is being used as part of the read quality control pipeline to identify and remove 3'-end adapter contaminants. Then, quality-based trimmer -Sickle- is applied to trim poor quality bases using sliding windows with quality and length thresholds. 

 

Rationale and Background

FastQC

FastQC: a quality control tool for high throughput sequence data.

Andrews S. (2010). 

Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

 

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are

  • Import of data from BAM, SAM or FastQ files (any variant)
  • Providing a quick overview to tell you in which areas there may be problems
  • Summary graphs and tables to quickly assess your data
  • Export of results to an HTML based permanent report
  • Offline operation to allow automated generation of reports without running the interactive application


Scythe 

Sickle: A Bayesian adapter trimmer [Software].

Joshi NA, Fass JN. (2011).

Available at https://github.com/najoshi/scythe.

 

Scythe uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3'-end adapters, which often include poor quality bases.

Most next generation sequencing reads have deteriorating quality towards the 3'-end. It's common for a quality-based trimmer to be employed before mapping, assemblies, and analysis to remove these poor quality bases. However, quality-based trimming could remove bases that are helpful in identifying (and removing) 3'-end adapter contaminants. Thus, it is recommended you run Scythe before quality-based trimming, as part of a read quality control pipeline.

 

Sickle 

Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33) [Software].

Joshi NA, Fass JN. (2011).

Available at https://github.com/najoshi/sickle.

 

Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads. It will also discard reads based upon the length threshold. It takes the quality values and slides a window across them whose length is 0.1 times the length of the read. If this length is less than 1, then the window is set to be equal to the length of the read. Otherwise, the window slides along the quality values until the average quality in the window rises above the threshold, at which point the algorithm determines where within the window the rise occurs and cuts the read and quality there for the 5'-end cut. Then when the average quality in the window drops below the threshold, the algorithm determines where in the window the drop occurs and cuts both the read and quality strings there for the 3'-end cut. However, if the length of the remaining sequence is less than the minimum length threshold, then the read is discarded entirely (or replaced with an "N" record). 5'-end trimming can be disabled.

 

Pre-Requisites

  1. An iPlant account. (Register for an iPlant account at user.iplantcollaborative.org.)
  2. The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Test Data

This tutorial uses the sequencing data stored /iplant/home/xiaofei_iplant/Sorghum_chr8/chr8_test/G3_P_K4me3_chr8.

Workflow

The tutorial will take users through the following operations:

Operation 1: Checking read quality with FastQC

 

  1. Mandatory arguments 
    1. Inputs: FASTQ files to be assessed.
    2. outFolder: Create all output files in this specified output directory.

  2. Optional arguments
    1. Adapters: Specifies a non-default file which contains the list of adapters sequences which will be searched against the library. The file must contain sets of named adapters in the form name[tab]sequence.  Lines prefixed with a hash will be ignored.
    2. Extract: If set then the zipped output file will be uncompressed in the same directory after it has been created.

  3. Test output
    1. /iplant/home/xiaofei_iplant/Sorghum_chr8/chr8_test/FastQC_0.11.5_Apr7_Test8/test

Operation 2: Removing 3'-end adapter contaminant using Scythe


  1. Mandatory arguments 
    1. Inputs: FASTQ files directory to be trimmed.
    2. Adapters: adapter file in FASTA format.
    3. Quality Type: quality type, either illumina, solexa, or sanger (default: illumina).

 

Operation 3: Trimming 3'-end of reads using Sickle

 

  1. Mandatory arguments 
    1. Inputs: READ1, Input fastq file for single-end reads or forward fastq file for paired-end data.
    2. Quality Type: Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)).

  1. Optional arguments
    1. Inputs: READ2, Input paired-end reverse fastq file.
    2. Minimum Length: Threshold to keep a read based on length after trimming, default 20.
    3. Quality Threshold: Threshold for trimming based on average quality in a window, default 20.

  2. Test output:
    1. /iplant/home/xiaofei_iplant/Sorghum_chr8/chr8_test/Sickle_1.33_May10_test3-2017-05-10-16-06-40.6/ss_output



Unable to render {include} The included page could not be found.