Evaluate and Pre-Process Sequencing Reads (Workflow Tutorial)

Evaluate and Pre-Process Sequencing Reads (Workflow Tutorial)

 

 

The iPlant App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01'). In critical cases, please report your concern to the iPlant Ask forum or to support@iplantcollaborative.org. Thank you for your patience.

Introduction and overview

Author: Roger A. Barthelson (rogerab at email.arizona.edu) / iPlant Collaborative, BIO5 Institute, University of Arizona

Goal

The goal of this tutorial is to become familiar with a commonly used procedure to clean and filter Illumina reads using tools (apps) in the iPlant Discovery Environment (DE).

The clean-up workflow tutorial will take users through the following operations:

  • Evaluate the quality of reads in a set of sequence files (app: FastQC 0.10.1 (multi-file))

  • Remove adapter sequences (app: Scythe-adapter-trimming)

  • Filter the sequences by their quality (app: Sickle-quality-based-trimming)

  • Re-evaluate the cleaned reads (app: FastQC 0.10.1 (multi-file))

  • Evaluate the cleaned reads using a different method (app: Prinseq-Graph-noPCA evaluate reads)

Rationale and Background

Adapters and Barcodes Next-generation sequence data often arrive in raw form and may require some pre-processing prior to utilizing them in analyses. Frequently, sequences are labeled with short "barcode" sequences to distinguish independent sequence samples and/or group sequences according to the experiment they are associated with. The reads may also contain adapter or primer sequences that were used for a specific sequencing protocol. Barcode, adapter and/or primer sequences need to be removed prior to analyzing sequence reads. This may have been done automatically by the sequencing facility prior to handing the data off. Alternatively, the iPlant DE offers apps to pre-process sequences (e.g., Sabre for de-multiplexing reads, and Scythe for removing primer/adapter sequences). Consult your sequencing service to determine whether your sequences contain any primers/adapters/barcodes. If so, request a list of the respective sequences. Errors in de-convoluting barcodes can completely invalidate an experiment by affecting any or all stages of the data analysis.

Quality It is critically important to ensure that sequence data is of high quality. Each sequence run will generate a certain amount of low-quality reads depending on factors that include how mRNA and cDNAs were prepared prior to sequencing, the reagents used in the sequencing prep, and the specific run conditions on that specific day. The DE applications Sickle-quality-based-trimming, Pre-Process Reads, FASTX quality filter, and FastQC 0.10.0 are designed to filter out any low-quality reads and to identify biases and quality issues that may impact the analysis. The transcript and RNA-Seq data used in this tutorial were treated accordingly.

Additional Resources

  1. Visit this page for a quick overview over additional read pre-processing apps and tools. (Page may require login with an iPlant account.)

Prerequisites

  1. An iPlant account. (Register for an iPlant account at user.iplantcollaborative.org.)

  2. The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Test data

This tutorial uses Rhodobacter sphaeroides Illumina sequencing data that are stored in the Data Store in the iPlant Discovery Environment. They were downloaded from the GAGE project, which, in turn, retrieved them from the NCBI Sequence Read Archive (SRA) to test genome assembly methods. For the tutorial users will be directed to these data in the iPlant Data Store following this path: Community Data > iplant_training > read_cleanup > sampledata.

Results (Selected Screen Captures for Each Section)

Input:
FastQC:
Scythe:
Sickle:
FastQC:
Prinseq_Graph:

Workflow

Operation 1: Evaluate read quality (app: FastQC 0.10.1 (multi-file))

The FastQC 0.10.1 (multi-file) app is used to evaluate the quality and content of sequence reads. (Basic documentation: https://pods.iplantcollaborative.org/wiki/display/DEapps/FastQC)

  1. Log into the Discovery Environment.

  2. Open FastQC 0.10.1 (multi-file) (Apps > Public Apps > NGS > QC and Processing > FastQC > FastQC 0.10.1 (multi-file)).

    1. Name your analysis.

  3. Click on "Select input data;" enter your sequence data file into Select input data.

    1. To analyze the sample data, enter all 4 sequence read files (.fastq) provided in Community Data > iplant_training > read_cleanup > sampledata.

  4. Launch the analysis.

  5. Once the analysis is completed (approx. 10 min. with the sample data), click on "Analysis," and then click on the analysis name to open the output folder.

  6. Examine the results:

    1. Within the output job file, there should be one directory for each of the original read files entered into the app. For the 4 test data files there should be four directories, each with 3 files and 2 folders that contain the information to build a web page listing comprehensive information on the sequence read files.

    2. Most of the evaluation information is provided in graphical formats in "Images" sub-directories. They hold .png files that depict the graphs produced during the analyses and that can be viewed directly in the DE by clicking on the file names.

    3. For a comprehensive comparison of the output data, download (using Simple download) the zipped output file to the desktop. (If the download takes to long or you only wish to examine specific files you can open them in the "Images" directories in the DE.)

    4. Examine the results for each of the four sequence read sets analyzed by opening the webpages (fastqc.report.html files) for each of them.

    5. Also examine the .png files in the Images directories.

    6. Notice that the data quality differs from read set to read set. Some sets have better scores at the longer lengths ("Per base quality"). Some sets may lose a significant portion of the total reads by using a high cutoff score ("per-sequence-quality"). The quality of a sequence varies with the position of the assessed units in the sequence ("Per base sequence content, Kmer content").

  7. Adapters: Following your analysis, determine whether the read sets are likely to show indication for being contaminated with adapter sequences. (For arguments sake, we will treat the sequence as if it is contaminated with adapters, and treat it in Operation #2 with Scythe to remove these adapters. However, see the Further Considerations section on the bottom for further deliberations on how to interpret the FastQC analysis results.)

  8. Read Quality: Following your analysis, estimate a reasonable cutoff setting for the quality scores for each sequence read file. (You will be using 12 and 15 when treating the read sets with Sickle in Operation #3, below. However, see the Further Considerations section on the bottom for further deliberations on how to interpret the FastQC analysis results.)

Operation 2: Remove adapter sequences (app: Scythe-0.991)

The Scythe-adapter-trimming app identifies adapter or primer sequences in your reads and removes them, using a fasta file of expected adapter/primer sequences. Ideally these expected sequences should be a comprehensive list of the primers and adapters used in preparing the sequencing library. If you aren’t sure, then a comprehensive file of a lot of different adapters and primers that might be used could be appropriate, but the settings should be adjusted to reduce the chance of randomly cutting real, organism-specific sequence. For the tutorial we use a large file of Illumina primer and adapter sequences, representing the most commonly expected contaminants from preparation of Illumina sequencing libraries. (Basic documentation: https://pods.iplantcollaborative.org/wiki/display/DEapps/Scythe-adapter-trimming)

  1. Open the Scythe-0.991 app 

    1. Name your analysis.

  2. Click on the "Settings" tab.

    1. Enter the appropriate fasta-formatted “Adapter file.” (For the sample data enter "illumina_adapters.fa" at Community Data > iplant_training > read_cleanup > sampledata > illumina_adapters.fa.)

    2. As “Input file” enter a fastq-formatted sequence file. (Use one of the sample sequence files at Community Data > iplant_training > read_cleanup > sampledata.)

    3. Enter a unique name for the "Output file."

  3. Click on the "Options" tab.

    1. As "Quality format" enter "sanger." (Click on the "i" button to see the choices available. Acceptable formats for fastq files are "solexa," "illumina," or "sanger." The default setting for the app is “illumina,” but even Illumina sequences are predominantly presented in sanger format.)

    2. Enter an appropriate "Minimum match" cut-off. (For the sample data enter "10." With sequences of your own you may have to experiment to identify the optimal setting.)

    3. Enter a name for the "matches file" if you wish to monitor the matches Scythe will find. (Retaining matches provides a good way to judge the level of primer contamination of the sequences.)

    4. Enter an appropriate value for "Prior," smaller values establish stronger matching requirements. (For the sample data enter “0.005.” For your own data you may want to experiment to identify the optimal setting.)

  4. Click "Launch Analysis."

  5. Repeat the Scythe trimming procedure with the other 3 .fq sequence files.

  6. Once an analysis is completed (approx. 5 min. with the sample data), click on "Analysis," and then click on the analysis name to open the output folder.

  7. Examine the results:

    1. The file size of the matches files are with 2% - 2.5% of the .fq output files at a normal level and don't indicate major contamination. This confirms the FastQC results that suggested some contamination, but not necessarily a major issue.

Operation 3: Filter sequences by quality (app: Sickle-quality-based-trimming)

The Sickle-quality-based-trimming app is used to trim and filter FastQ files based on quality values. It is best used with sequences that have already been trimmed of adapters, primer sequences, or other extraneous sequences. Sickle works with both, paired or single reads. However, for paired-end sequence files, Sickle (as other quality-based trimmers) may break pairing for some reads. Sickle automatically monitors this and moves orphaned reads into a file of single reads. Also, if the sequence file is to be used for assembly with an application that uses read correction (e.g. AllpathsLG) or kmer analysis, results may be better without quality-based trimming or with-less stringent trimming. (Basic documentation: https://pods.iplantcollaborative.org/wiki/display/DEapps/Sickle-quality-based-trimming)

Unable to render {include} The included page could not be found.