Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Jetstream Workflow That Creates Gene Expression Matrices (GEMs) from SRA/FASTQ NGS Files

Warning

This is the old version of Pegasus-GEM JS workflow. Please use this latest version - Pegasus-GEM Jetstream Workflow Tutorial - 2.02

Rationale and background

Pegasus-GEM is Pegasus Jetstream workflow that utilizes Jetstream resources to produce a Gene Expression Matrix (GEM) from DNA sequence files in FASTQ format. This is adapted from OSG-GEM workflow that currently runs on Open Science Grid (OSG)

William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan and Frank A. Feltus. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid. Bioinformatics and Biology Insights 2016:10 133–141 doi: 10.4137/BBI.S38193.

Introduction

This workflow processes both paired or single end FASTQ files to produce a matrix of normalized RNA molecule counts (FPKM). Pegasus-GEM also supports direct input downloads from NCBI SRA for processing. An indexed reference genome along with gene model annotation files must be obtained prior to configuring and running the workflow. The following tasks are directed by the Pegasus workflow manager:

...

Part 3: Set up a Pegasus-GEM run using the Terminal window

Step 1. Get oriented. A workflow specific ssh key has to be created. This key is used for some of the data staging steps of the workflow.

Code Block
# Create a directory .ssh if you don't have that already
$ mkdir -p ~/.ssh 
$ ssh-keygen -t rsa -b 2048 -f ~/.ssh/workflow 
(Just hit enter when asked for passphrase)

Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/upendra/.ssh/workflow.
Your public key has been saved in /home/upendra/.ssh/workflow.pub.
The key fingerprint is:
e4:63:31:61:46:ef:52:0e:dd:04:70:bc:a8:46:0a:51 upendra@js-17-50.jetstream-cloud.org
The key's randomart image is:
+--[ RSA 2048]----+
|   .E  .*oo..    |
|  .    o =.o     |
|   .    =.+..    |
|  .   .o.B.      |
|   . o .S o      |
|    . o. o       |
|     .           |
|                 |
|                 |
+-----------------+
$ cat ~/.ssh/workflow.pub >>~/.ssh/authorized_keys

Step 2: Example Workflow Setup. The workflow cloned from github contains an example config file as well as example input files from the 21st chromosome of Gencode Release 24 of the GRCh38 build of the human reference genome. Two small FASTQ files containing 200,000 sequences from NCBI dataset SRR1825962 lie within the Test_data directory of the workflow

Code Block
languagebash
$ git clone https://github.com/feltus/OSG-GEM.git

$ cd OSG-GEM

# To run the test workflow, the user must copy the osg-gem.confi.template file like this
$ cp osg-gem.conf.example osg-gem.conf

# Take a look at the config file first
$ cat osg-gem.conf
[reference]
reference_prefix = chr21-GRCh38

[inputs]
#
# List the inputs to process. Each line can either be a pair
# of forward and reverse files, separated by a space:
#
#    input1 = forward.fastq.gz reverse.fastq.gz
#
# or a single SRR number. Example:
#
#    input2 = DRR046893
#
# or a single fastq file for single end reads.  Example:
#    input3 = SRR4343300.fastq.gz

input1 = ./Test_data/TEST_1.fastq.gz ./Test_data/TEST_2.fastq.gz
#input1 = ./Test_data/SRR4343300.fastq.gz
#input2 = SRR4343300
#input2 = DRR046893

[config]

# Memory available to the jobs. This should be roughly 2X the
# size of the reference genome, rounded up whole GB
memory = 4 GB

# Reads are single end
single = False

# Reads are paired end
paired = True

# process using TopHat2
tophat2 = False

# process using Hisat2
hisat2 = True

# process using STAR
star = False

# process using Cufflinks
cufflinks = False

# process using StringTie
stringtie = True

...

Code Block
languagebash
$ ./submit

# To check the status of your run. For example, run
$ pegasus-status -l /mydata/OSG-GEM/runs/osg-gem-1498085153/workflow/osg-gem-1498085153 
(no matching jobs found in Condor Q)
UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                 
    0     0     0     0     0    26     0 100.0 Success 00/00/level-2/level-2.dag               
    0     0     0     0     0     6     0 100.0 Success *gem-0.dag                              
    0     0     0     0     0    32     0 100.0         TOTALS (32 jobs)                        
Summary: 2 DAGs total (Success:2)

# Final output file is located in
ls -lh runs/osg-gem-1498085153/outputs/merged_GEM.tab 
-rw-r--r-- 1 upendra upendra 68K Jun 21 17:50 runs/osg-gem-1498085153/outputs/merged_GEM.tab 

Monitoring Workflow

Pegasus provides a set of commands to monitor workflow progress. The path to the workflow files as well as commands to monitor the workflow will print to screen upon submitting the workflow. For example:

Info
2016.05.26 23:31:03.859 CDT:   Your workflow has been started and is running in
2016.05.26 23:31:03.869 CDT:     /stash2/user/username/workflows/osg-gem-x/workflow/osg-gem-x
2016.05.26 23:31:03.880 CDT:   *** To monitor the workflow you can run ***
2016.05.26 23:31:03.891 CDT:     pegasus-status -l /stash2/user/username/workflows/osg-gem-x/workflow/osg-gem-x
2016.05.26 23:31:03.901 CDT:   *** To remove your workflow run ***
2016.05.26 23:31:03.912 CDT:     pegasus-remove /stash2/user/username/workflows/osg-gem-x/workflow/osg-gem-x

Output will be transferred at the base of this directory upon completion. For example:

 

Info

From here, the user may follow our documentation to modify the software options as well as point to their own input datasets. Note that there are no test reference genome indices available for STAR, because they are too large to upload to github.

Pre-Workflow User Input

The user must provide indexed reference genome files as well as gene model annotation information prior to submitting the workflow. The user must select a reference prefix($REF_PREFIX) that will be recognized by Pegasus as well as by Hisat2 or Tophat2. In addition, information about splice sites or a reference transcriptome must be provided in order to guide accurate mapping of split input files. Once the user has downloaded a reference genome fasta file and gene annotation in GTF/GFF3 format, the following commands can be used to produce the necessary input files, using GRCh38 as an example $REF_PREFIX for Gencode Release 24 of the human reference genome:

...