Jetstream Workflow That Creates Gene Expression Matrices (GEMs) from SRA/FASTQ NGS Files
Rationale and background
Pegasus-GEM is Pegasus Jetstream workflow that utilizes Jetstream resources to produce a Gene Expression Matrix (GEM) from DNA sequence files in FASTQ format. This is adapted from OSG-GEM workflow that currently runs on Open Science Grid (OSG).
William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan and Frank A. Feltus. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid. Bioinformatics and Biology Insights 2016:10 133–141 doi: 10.4137/BBI.S38193.
Introduction
This workflow processes both paired or single end FASTQ files to produce a matrix of normalized RNA molecule counts (FPKM). Pegasus-GEM also supports direct input downloads from NCBI SRA for processing. An indexed reference genome along with gene model annotation files must be obtained prior to configuring and running the workflow. The following tasks are directed by the Pegasus workflow manager:
...
6.1 Add ssh keys on your computer
Code Block |
---|
$ ssh-add # Hit enter when asked for passphrase Enter passphrase for /Users/upendra_35/.ssh/id_rsa: Enter passphrase for /Users/upendra_35/.ssh/id_dsa: # Make sure that the ssh key has been added $ ssh-add -L # This will show the public key that you have added |
6.2 ssh key forwarding to the MASTER
Code Block |
---|
$ ssh -A <xsede username>@<MASTER ip.address> # For example $ ssh -A upendra@149.165.169.158 #Do not use this username and ip address as they will not work for you!!!! |
Part 3: Set up a Pegasus-GEM 2.0 run using the Terminal window
Step 1. Get oriented. A workflow specific ssh key has to be created. This key is used for some of the data staging steps of the workflow.
Code Block |
---|
$ mkdir -p ~/.ssh $ ssh-keygen -t rsa -b 2048 -f ~/.ssh/workflow # Hit enter when asked for passphrase $ cat ~/.ssh/workflow.pub >> ~/.ssh/authorized_keys |
Step 2. Copy the example data onto your home directory, change the permissions and navigate to that directory.
Code Block | ||
---|---|---|
| ||
$ cd ~/vol_b $ sudo cp -r /opt/Pegasus-GEM_example_data . $ sudo chown -hR ${USER} Pegasus-GEM_example_data $ sudo chgrp -hR ${USER} Pegasus-GEM_example_data $ cd Pegasus-GEM_example_data |
Code Block |
---|
$ ls dax.xml LICENSE osg-hosts README.md sites.xml.template submit Test_data useful_files email-notify osg-gem.conf pegasus.conf reference staging task-files tools worker-launch.yml |
List the contents in the Test_data folder
Code Block |
---|
$ ls Test_data SRR4343300.fastq.gz TEST_1.fastq.gz TEST_2.fastq.gz |
The example data contains an config file (osg-gem.conf) as well as input files from the 21st chromosome of Gencode Release 24 of the GRCh38 build of the human reference genome. Two small FASTQ files containing 200,000 sequences from NCBI dataset SRR1825962 lie within the Test_data directory of the example data folder
Step 3. Setting up the VM and password on the MASTER
...
Code Block | ||
---|---|---|
| ||
$ ./submit Adding reference files ... Adding reference files ... chr21-GRCh38.1.bt2 chr21-GRCh38.1.ht2 chr21-GRCh38.2.bt2 chr21-GRCh38.2.ht2 chr21-GRCh38.3.bt2 chr21-GRCh38.3.ht2 chr21-GRCh38.4.bt2 chr21-GRCh38.4.ht2 chr21-GRCh38.5.ht2 chr21-GRCh38.6.ht2 chr21-GRCh38.7.ht2 chr21-GRCh38.8.ht2 chr21-GRCh38.Splice_Sites.txt chr21-GRCh38.fa chr21-GRCh38.rev.1.bt2 chr21-GRCh38.rev.2.bt2 chr21-GRCh38.transcriptome_data.tar.gz Adding gff3 file ... chr21-gencode.v24.annotation.gff3 Found input input1: ./Test_data/TEST_1.fastq.gz ./Test_data/TEST_2.fastq.gz An 'Output's directory will be created within the base of the workflow directory. This directory, /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/outputs will have a 'merged_GEM.tab' file, an expression vector for each individual file, and all standard output files from trimmomatic/hisat2 jobs. 2017.11.09 16:45:09.835 CST: 2017.11.09 16:45:09.841 CST: ----------------------------------------------------------------------- 2017.11.09 16:45:09.846 CST: File for submitting this DAG to HTCondor : gem-0.dag.condor.sub 2017.11.09 16:45:09.851 CST: Log of DAGMan debugging messages : gem-0.dag.dagman.out 2017.11.09 16:45:09.857 CST: Log of HTCondor library output : gem-0.dag.lib.out 2017.11.09 16:45:09.862 CST: Log of HTCondor library error messages : gem-0.dag.lib.err 2017.11.09 16:45:09.867 CST: Log of the life of condor_dagman itself : gem-0.dag.dagman.log 2017.11.09 16:45:09.873 CST: 2017.11.09 16:45:09.878 CST: -no_submit given, not submitting DAG to HTCondor. You can do this with: 2017.11.09 16:45:09.888 CST: ----------------------------------------------------------------------- 2017.11.09 16:45:28.857 CST: Created Pegasus database in: sqlite:////home/upendra/.pegasus/workflow.db 2017.11.09 16:45:28.863 CST: Your database is compatible with Pegasus version: 4.8.0 2017.11.09 16:45:29.296 CST: Submitting to condor gem-0.dag.condor.sub 2017.11.09 16:45:33.157 CST: Submitting job(s). 2017.11.09 16:45:33.163 CST: 1 job(s) submitted to cluster 70. 2017.11.09 16:45:33.168 CST: 2017.11.09 16:45:33.173 CST: Your workflow has been started and is running in the base directory: 2017.11.09 16:45:33.179 CST: 2017.11.09 16:45:33.184 CST: /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 2017.11.09 16:45:33.189 CST: 2017.11.09 16:45:33.195 CST: *** To monitor the workflow you can run *** 2017.11.09 16:45:33.200 CST: 2017.11.09 16:45:33.205 CST: pegasus-status -l /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 2017.11.09 16:45:33.211 CST: 2017.11.09 16:45:33.216 CST: *** To remove your workflow run *** 2017.11.09 16:45:33.221 CST: 2017.11.09 16:45:33.227 CST: pegasus-remove /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 2017.11.09 16:45:33.232 CST: 2017.11.09 16:45:34.318 CST: Time taken to execute is 28.401 seconds |
Monitoring Workflow: Pegasus provides a set of commands to monitor workflow progress. The path to the workflow files as well as commands to monitor the workflow will print to screen upon submitting the workflow. For example:
Info |
---|
$ pegasus-status -l /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 STAT IN_STATE JOB UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME |
...
Output will be transferred at the base of this directory upon completion. For example this message tells you where the ouputs are
"An 'Output's directory will be created within the base of the workflow directory.
This directory, /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/outputs
Code Block |
---|
ls -lh /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/outputs total 3.7M drwxr-xr-x 2 upendra upendra 6 Nov 9 17:54 ballgown-data -rw-r--r-- 1 upendra upendra 131K Nov 9 17:53 input1-e2t.ctab -rw-r--r-- 1 upendra upendra 432K Nov 9 17:53 input1-e_data.ctab -rw-r--r-- 1 upendra upendra 108K Nov 9 17:53 input1-i2t.ctab -rw-r--r-- 1 upendra upendra 163K Nov 9 17:53 input1-i_data.ctab -rw-r--r-- 1 upendra upendra 239K Nov 9 17:53 input1-t_data.ctab -rw-r--r-- 1 upendra upendra 68K Nov 9 17:54 merged_GEM.tab -rw-r--r-- 1 upendra upendra 136 Nov 9 17:54 QC_Report.tab -rw-r--r-- 1 upendra upendra 2.6M Nov 9 17:53 sampleinput1.gtf |
Info |
---|
From here, the user may follow our documentation to modify the software options as well as point to their own input datasets. Note that there are no test reference genome indices available for STAR, because they are too large to upload to github. |
Pre-Workflow User Input
The user must provide indexed reference genome files as well as gene model annotation information prior to submitting the workflow. The user must select a reference prefix($REF_PREFIX) that will be recognized by Pegasus as well as by Hisat2 or Tophat2. In addition, information about splice sites or a reference transcriptome must be provided in order to guide accurate mapping of split input files. Once the user has downloaded a reference genome fasta file and gene annotation in GTF/GFF3 format, the following commands can be used to produce the necessary input files, using GRCh38 as an example $REF_PREFIX for Gencode Release 24 of the human reference genome:
...