Jetstream Workflow That Creates Gene Expression Matrices (GEMs) from SRA/FASTQ NGS Files
Warning |
---|
Rationale and background
Pegasus-GEM is Pegasus Jetstream workflow that utilizes Jetstream resources to produce a Gene Expression Matrix (GEM) from DNA sequence files in FASTQ format. This is adapted from OSG-GEM workflow that currently runs on Open Science Grid (OSG).
William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan and Frank A. Feltus. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid. Bioinformatics and Biology Insights 2016:10 133–141 doi: 10.4137/BBI.S38193.
Introduction
This workflow processes both paired or single end FASTQ files to produce a matrix of normalized RNA molecule counts (FPKM). Pegasus-GEM also supports direct input downloads from NCBI SRA for processing. An indexed reference genome along with gene model annotation files must be obtained prior to configuring and running the workflow. The following tasks are directed by the Pegasus workflow manager:
...
Part 3: Set up a Pegasus-GEM run using the Terminal window
Step 1. Get oriented. A workflow specific ssh key has to be created. This key is used for some of the data staging steps of the workflow.
Code Block |
---|
# Create a directory .ssh if you don't have that already $ mkdir -p ~/.ssh $ ssh-keygen -t rsa -b 2048 -f ~/.ssh/workflow (Just hit enter when asked for passphrase) Generating public/private rsa key pair. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/upendra/.ssh/workflow. Your public key has been saved in /home/upendra/.ssh/workflow.pub. The key fingerprint is: e4:63:31:61:46:ef:52:0e:dd:04:70:bc:a8:46:0a:51 upendra@js-17-50.jetstream-cloud.org The key's randomart image is: +--[ RSA 2048]----+ | .E .*oo.. | | . o =.o | | . =.+.. | | . .o.B. | | . o .S o | | . o. o | | . | | | | | +-----------------+ $ cat ~/.ssh/workflow.pub >>~/.ssh/authorized_keys |
Step 2: Example Workflow Setup. The workflow cloned from github contains an example config file as well as example input files from the 21st chromosome of Gencode Release 24 of the GRCh38 build of the human reference genome. Two small FASTQ files containing 200,000 sequences from NCBI dataset SRR1825962 lie within the Test_data directory of the workflow
Code Block | ||
---|---|---|
| ||
$ git clone https://github.com/feltus/OSG-GEM.git $ cd OSG-GEM # To run the test workflow, the user must copy the osg-gem.confi.template file like this $ cp osg-gem.conf.example osg-gem.conf # Take a look at the config file first $ cat osg-gem.conf [reference] reference_prefix = chr21-GRCh38 [inputs] # # List the inputs to process. Each line can either be a pair # of forward and reverse files, separated by a space: # # input1 = forward.fastq.gz reverse.fastq.gz # # or a single SRR number. Example: # # input2 = DRR046893 # # or a single fastq file for single end reads. Example: # input3 = SRR4343300.fastq.gz input1 = ./Test_data/TEST_1.fastq.gz ./Test_data/TEST_2.fastq.gz #input1 = ./Test_data/SRR4343300.fastq.gz #input2 = SRR4343300 #input2 = DRR046893 [config] # Memory available to the jobs. This should be roughly 2X the # size of the reference genome, rounded up whole GB memory = 4 GB # Reads are single end single = False # Reads are paired end paired = True # process using TopHat2 tophat2 = False # process using Hisat2 hisat2 = True # process using STAR star = False # process using Cufflinks cufflinks = False # process using StringTie stringtie = True |
...
Code Block | ||
---|---|---|
| ||
$ ./submit # To check the status of your run. For example, run $ pegasus-status -l /mydata/OSG-GEM/runs/osg-gem-1498085153/workflow/osg-gem-1498085153 (no matching jobs found in Condor Q) UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME 0 0 0 0 0 26 0 100.0 Success 00/00/level-2/level-2.dag 0 0 0 0 0 6 0 100.0 Success *gem-0.dag 0 0 0 0 0 32 0 100.0 TOTALS (32 jobs) Summary: 2 DAGs total (Success:2) # Final output file is located in ls -lh runs/osg-gem-1498085153/outputs/merged_GEM.tab -rw-r--r-- 1 upendra upendra 68K Jun 21 17:50 runs/osg-gem-1498085153/outputs/merged_GEM.tab |
Monitoring Workflow
Pegasus provides a set of commands to monitor workflow progress. The path to the workflow files as well as commands to monitor the workflow will print to screen upon submitting the workflow. For example:
Info |
---|
|
Output will be transferred at the base of this directory upon completion. For example:
Info |
---|
From here, the user may follow our documentation to modify the software options as well as point to their own input datasets. Note that there are no test reference genome indices available for STAR, because they are too large to upload to github. |
Pre-Workflow User Input
The user must provide indexed reference genome files as well as gene model annotation information prior to submitting the workflow. The user must select a reference prefix($REF_PREFIX) that will be recognized by Pegasus as well as by Hisat2 or Tophat2. In addition, information about splice sites or a reference transcriptome must be provided in order to guide accurate mapping of split input files. Once the user has downloaded a reference genome fasta file and gene annotation in GTF/GFF3 format, the following commands can be used to produce the necessary input files, using GRCh38 as an example $REF_PREFIX for Gencode Release 24 of the human reference genome:
...