Cleaning up your reads with the HTProcess Pipeline

A guide to using the first two steps of the HTProcess pipeline. The HTProcess Pipeline is intended to provide a way to process and analyze multiple files at once, but also to preserve key metadata to allow for more automatic setup of analysis. Overall, when used properly, it allows the user to run analyses with less effort, and more secure preservation of metadata with the files as they are modified and created. The entire pipeline in its various forms can be used for RNASeq or for genome sequence assembly, but the first 2 steps offer anyone working with sequencing data a comprehensive way to prepare their reads for further analysis. See also a webinar on this topic: iPlant Webinar- Next Generation Sequencing.

Step-by-step guide to clean sequences

HTProcess-prepare_directories_and_run_fastqc. This first step is a Discovery Environment workflow. Workflows in this context means that though they act like independent apps in the DE, they actually consist of 2 or more apps that are linked together to run in a single step. But this is not important to the typical user. The basic information on trying this app is found here: HTProcess-prepare_directories-and-run_fastqc. This app is important in that it gathers up a library of related reads and records information about them to help in their downstream analysis. Much of the critical data about the reads is recorded in the manifest file (manifest_file.txt), which follows the reads through most or all of the workflow. All the boxes need to be filled in, and filled with a single name or number without spaces (remember this runs in Linux and Linux does not like spaces!). If you do not have paired reads, then record 0 for pair spacing and standard deviation.

Because sequencing reads in a single library may include paired and unpaired reads, this app inputs files in separate batches reflecting the 2 separate pairs and any unpaired reads. It uses the names of the input files to match up the paired files and records the file names to identify them as paired or unpaired. The user must also enter a number of parameters that describe the library, and this information is also recorded in the manifest file. During the run, the application FastQC is run on each read file to determine its quality in detail. Some of the information from FastQC is recorded directly in the manifest file, also. Typically FastQC will produce a report on the quality of a read file, and it will produce a series of graphs that present the quality information in a more meaningful way. For ease of use and to make the presentation of the data for a whole library more useful, a single report with all the graphs for all the reads within a library is created by HTProcess-prepare_directories_and_run_fastqc. That file is the fastqc_summary.html file. It is constructed in such a way that the user should need only to click on it to open a new tab in their browser with the report. A sample fastqc_summary.html file is here: fastqc_summary.html. It is strongly recommended to read through the results of the FastQC results and apply the information it contains to the next step, which will be trimming with HTProcess_trimmomatic. More information on FastQC can be found here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. This link includes examples of typical good and bad results. Another helpful guide: http://dwheelerau.com/tag/fastqc/.

In addition to the fastqc_summary.html file, HTProcess-prepare_directories_and_run_fastqc outputs a folder that contain the processed reads and the manifest file, and a log file. It is named HTProcess_Reads. This folder is what the user must drag and drop into the input of the next step in the HTProcess Pipeline, HTProcess_trimmomatic. It is always a good idea to take a brief look at the contents of the HTProcess_Reads directory to confirm that all the read files are there, and check the manifest file to make sure all the files are recorded there also. A sample manifest file:
HTProcess1_Reads
......................................................
Library_name=Belgica_control
Library_num=2
condition=control
pair_spacing=300
pair_sd=35
pair_type=fragment
pairing=paired-only
encoding_SRR566981.sra_1.fastq=1.5
max_len_SRR566981.sra_1.fastq=78
encoding_SRR567164.sra_1.fastq=1.5
max_len_SRR567164.sra_1.fastq=76
encoding_SRR567165.sra_1.fastq=1.5
max_len_SRR567165.sra_1.fastq=78
encoding_SRR566981.sra_2.fastq=1.5
max_len_SRR566981.sra_2.fastq=78
encoding_SRR567164.sra_2.fastq=1.5
max_len_SRR567164.sra_2.fastq=76
encoding_SRR567165.sra_2.fastq=1.5
max_len_SRR567165.sra_2.fastq=78
library_max=78
Paired Reads
!PPP SRR566981.sra_1.fastq,SRR566981.sra_2.fastq
!PPP SRR567164.sra_1.fastq,SRR567164.sra_2.fastq
!PPP SRR567165.sra_1.fastq,SRR567165.sra_2.fastq
Reads1
!XXX SRR566981.sra_1.fastq
!XXX SRR567164.sra_1.fastq
!XXX SRR567165.sra_1.fastq
Reads2
!YYY SRR566981.sra_2.fastq
!YYY SRR567164.sra_2.fastq
!YYY SRR567165.sra_2.fastq
......................................................
!!!TRIM SETTINGS!!!
......................................................
!PairTrim 1 SRR566981.sra_1.fastq,SRR566981.sra_2.fastq
!PairTrim 1 SRR567164.sra_1.fastq,SRR567164.sra_2.fastq
!PairTrim 1 SRR567165.sra_1.fastq,SRR567165.sra_2.fastq
HTProcess_trimmomatic. This step takes the HTProcess_Reads directory created by HTProcess-prepare_directories_and_run_fastqc and runs a comprehensive read trimming program, trimmomatic on each read file. It uses the files listed on the manifest file as a guide. In the example manifest file given above, there are only paired files, and the files to be trimmed are listed with the tag !PairTrim. This tag is what guides HTProcess_trimmomatic to the right files. The number 1 that follows !PairTrim tells HTProcess_trimmomatic to use program 1 for trimming. The user can take control of the trimming process by using up to 2 different programs for trimming. Typically most of the reads in a library will be similar enough in character that they may be trimmed effectively with the same settings, but if the user sees one or more files that they would like to process differently, they may do this by opening the manifest file in the DE, and editing the 1 to a 2 in to turn on the second program for those reads. Be sure to save the changes in the manifest file. The 2 different trimming programs are accessed in 2 separate panels of parameters in the app. The second panel often is not used, because only a single set of trimming settings is needed.

Trimmomatic has a lot of different settings, so before doing your first run, it's a good idea to go over what they are. Basic info on running HTProcess_trimmomatic: HTProcess_trimmomatic_0.33. Manual and other information for the application Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic. This application includes tools for removing fixed numbers of bases from the front or back of a read, a tool for recognizing and trimming adapter sequences, quality thresholds for chopping off the beginning or end of a read, a sliding window quality trimmer, an information-maximizing quality trimmer, and a setting for the minimum length for a read after trimming. The user can use one or many of the trimmers at once.

Once the trimming is complete, HTProcess_trimmomatic runs FastQC and produces a new fastqc_summary.html, so that the user can determine if the trimming accomplished what it needed to accomplish. If not, rerun the job with changes that may do a better job of trimming. The main output for HTProcess_trimmomatic is the HTProcess_Reads_T1 directory, which contains all the trimmed reads and an updated manifest file. An example manifest file after trimming:
HTProcess1_Reads
......................................................
Library_name=Belgica_control
Library_num=2
condition=control
pair_spacing=300
pair_sd=35
pair_type=fragment
pairing=paired-only
encoding_SRR566981.sra_1.fastq=1.5
max_len_SRR566981.sra_1.fastq=78
encoding_SRR567164.sra_1.fastq=1.5
max_len_SRR567164.sra_1.fastq=76
encoding_SRR567165.sra_1.fastq=1.5
max_len_SRR567165.sra_1.fastq=78
encoding_SRR566981.sra_2.fastq=1.5
max_len_SRR566981.sra_2.fastq=78
encoding_SRR567164.sra_2.fastq=1.5
max_len_SRR567164.sra_2.fastq=76
encoding_SRR567165.sra_2.fastq=1.5
max_len_SRR567165.sra_2.fastq=78
library_max=78
Paired Reads
!PPP SRR566981.sra_1.fastq,SRR566981.sra_2.fastq
!PPP SRR567164.sra_1.fastq,SRR567164.sra_2.fastq
!PPP SRR567165.sra_1.fastq,SRR567165.sra_2.fastq
Reads1
!XXX SRR566981.sra_1.fastq
!XXX SRR567164.sra_1.fastq
!XXX SRR567165.sra_1.fastq
Reads2
!YYY SRR566981.sra_2.fastq
!YYY SRR567164.sra_2.fastq
!YYY SRR567165.sra_2.fastq
......................................................
!!!TRIM SETTINGS!!!
......................................................
!PairTrim 1 SRR566981.sra_1.fastq,SRR566981.sra_2.fastq
!PairTrim 1 SRR567164.sra_1.fastq,SRR567164.sra_2.fastq
!PairTrim 1 SRR567165.sra_1.fastq,SRR567165.sra_2.fastq
......................................................
!!!TRIMMED READS!!!
......................................................
!TRIMMED_Pr TrmPr1_SRR566981.sra_1.fastq,TrmPr2_SRR566981.sra_2.fastq
!TRIMMED_Pr TrmPr1_SRR567164.sra_1.fastq,TrmPr2_SRR567164.sra_2.fastq
!TRIMMED_Pr TrmPr1_SRR567165.sra_1.fastq,TrmPr2_SRR567165.sra_2.fastq
!TRIMMED_S TrmS_Belgica_control.fastq
......................................................
......................................................
......................................................
......................................................
!!!TRIMMED ORPHAN AND INDIVIDUAL SINGLES!!!
......................................................
Not used for normal analysis with a completely uniform library
......................................................
!TRIMMED_OS TrmSos_SRR566981.sra_1.fastq
!TRIMMED_OS TrmSos_SRR567164.sra_1.fastq
!TRIMMED_OS TrmSos_SRR567165.sra_1.fastq

In this example, note that there are entries for trimmed single reads, tagged with !TRIMMED_OS, even though the library started with only paired read files. These single read files result with quality trimming, which may eliminate one of the reads in a pair, leaving its mate as a single read. It is a good idea to check closely the results for these entries in the fastqc_summary.html file. Often if a read's mate is of low quality, then it, too, may be marginal. Here is the fastqc_summary file for this set of reads: fastqc_summary.html. In this example, as is often the case, these files contain relatively few reads, but these reads are not of terrible quality, so the user may choose to retain them. I you do find that you don't like the quality of a read file, it can be eliminated by editing the manifest file and removing the entire line for that file, starting with !Trimmed_OS. The file itself may also be deleted, but for much of the HTProcess Pipeline, if the read is not entered on the manifest file, it will not be used for further processing. Simply removing the file from the HTProcess_Reads_T1 directory, and not the manifest file, will cause errors in downstream processing.

Once the user is satisfied with the trimming results for the sequencing files, they are ready for some of the more interesting analysis steps, whether it be mapping for RNASeq analysis or use for transcriptome or whole genome assembly. The results for these later analytical steps will benefit greatly from your taking the time to clean up the reads properly.

1 Learning Materials

Cleaning up your reads with the HTProcess Pipeline

Step-by-step guide to clean sequences

Related articles

Filter by label