Jetstream Workflow That Creates Gene Expression Matrices (GEMs) from SRA/FASTQ NGS Files

Rationale and background

Pegasus-GEM is Pegasus Jetstream workflow that utilizes Jetstream resources to produce a Gene Expression Matrix (GEM) from DNA sequence files in FASTQ format. This is adapted from OSG-GEM workflow that currently runs on Open Science Grid (OSG).

William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan and Frank A. Feltus. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid. Bioinformatics and Biology Insights 2016:10 133–141 doi: 10.4137/BBI.S38193.

Introduction

This workflow processes both paired or single end FASTQ files to produce a matrix of normalized RNA molecule counts (FPKM). Pegasus-GEM also supports direct input downloads from NCBI SRA for processing. An indexed reference genome along with gene model annotation files must be obtained prior to configuring and running the workflow. The following tasks are directed by the Pegasus workflow manager:

Splitting input FASTQ files into files containing 20,000 sequences each.
Trimming raw sequences with Trimmomatic
Aligning reads to the reference genome using Hisat2 or Tophat2 or STAR
Merging BAM alignment files into a single sorted BAM file using Samtools merge
Quantifying RNA transcript levels using StringTie or Cufflinks

It is suggested that the user become familiar with the documentation associated with the following software packages:

This tutorial will take users through steps of:

Running Pegasus-GEM on Jetstream cloud
Running Pegasus-GEM on an example data

Considerations

Sounds great, what do I need to get started?

XSEDE account
XSEDE allocation. Users can request a startup XSEDE allocation. The startup XSEDE allocations are relatively easy to apply and get allotted. Please contact help@xsede.org for further information.
Getting started with Jetstream
Your data (or you can run example data)

What kind of data do I need?

Mandatory requirements
1. FASTQ files. Either single end or paired. You can also use SRR id's
2. Select the type of reads. Single or Paired.
3. Select the type of mapper (Tophat2 or Hisat2 or STAR)
4. Select the type of assembler (Cufflinks or Stringtie)

What kind of resources will I need for my project?

Enough SUs (Service Units) to run your computation
One MASTER and several workers needed for running your computation
Enough storage space on the WQ-MAKER Jetstream instance for both input and output files. Since most of the images have limited disk space, it is recommended to mount an external volume to the running Pegasus-GEM MASTER instance.

Part 1: Connect to an instance of an Pegasus-GEM 2.0 Jetstream Image

Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.

Step 2. Click on the "Create New Project" in the Project tab on the top and enter the name of the project and a brief description

Step 3. Launch an instance from the selected image and name it as MASTER

After the project has been created and entered inside it, click the "New" button, select "Pegasus-GEM" image and then click Launch instance. In the next window (Basic Info),

name the instance as "MASTER" (don't worry if you forgot to name the instance at that point, as you can always modify the name of the instance later)
leave base image version as it is
leave the project name as it is or change to a different project if needed
select "Jetstream - Indiana University or Jetstream - TACC" as Provider (for this tutorial we will chose Jetstream - TACC) and click 'Continue'. Your choice of provider will depend on the resources you have available (SUs) and the needs of your instance
select "m1.medium" as Instance size and click "Continue".

Make sure to check on the Projected Resources Usage to make sure you have enough resources to run the instance. If you need more SUs or CPUs to run instances, contact the Jetstream team at help @xsede.org . Review the details of the instance you are selecting to launch and click "Launch Instance"

As the instance is launched behind the scenes, you will get an update as it goes through each step.

Status updates of Instance launch (both MASTER and WORKER) include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run WQ-MAKER and test files to run a MAKER demo.

Step 4. Launch WORKER instance from "Pegasus-GEM" image

Launch two more instances from the "Pegasus-GEM" image and name them as WORKER-1 and WORKER-2. Use the same configuration as the MASTER VM

Step 5. Create and attach the volume to the MASTER instance

Since the m1 medium instance size (60GB disk space) selected for running MASTER instance of Pegasus-GEM may not be sufficient for most of the GEM runs, it is recommended to run it on volumes. Following are the three steps for mounting the external volume.

5.1: Create a volume

Click the "New" button in the project and select "Create Volume". Enter the name of the volume, volume size (GB) needed and the provider (TACC or Indiana) and finally click "Create Volume"

5.2: Attach the created volume to the MASTER instance

Click the checkbox by the side of Volume (1), then click the attach button (2), next chose the MASTER instance (3) and finally click the ATTACH VOLUME TO INSTANCE button (4)

Step 6. ssh into the MASTER

6.1 Add ssh keys on your computer

$ ssh-add # Hit enter when asked for passphrase
Enter passphrase for /Users/upendra_35/.ssh/id_rsa: 
Enter passphrase for /Users/upendra_35/.ssh/id_dsa: 
 
# Make sure that the ssh key has been added
$ ssh-add -L
# This will show the public key that you have added

6.2 ssh key forwarding to the MASTER

$ ssh -A <xsede username>@<MASTER ip.address>
 
# For example
$ ssh -A upendra@149.165.169.158 #Do not use this username and ip address as they will not work for you!!!!

Part 3: Set up a Pegasus-GEM 2.0 run using the Terminal window

Step 1. Get oriented. A workflow specific ssh key has to be created. This key is used for some of the data staging steps of the workflow.

$ mkdir -p ~/.ssh 
$ ssh-keygen -t rsa -b 2048 -f ~/.ssh/workflow # Hit enter when asked for passphrase
$ cat ~/.ssh/workflow.pub >> ~/.ssh/authorized_keys

Step 2. Copy the example data onto your home directory, change the permissions and navigate to that directory.

$ cd /vol_b
$ sudo cp -r /opt/Pegasus-GEM_example_data .
$ sudo chown -hR ${USER} Pegasus-GEM_example_data
$ sudo chgrp -hR ${USER} Pegasus-GEM_example_data
$ cd Pegasus-GEM_example_data

List the contents using 'ls' command

$ ls
dax.xml       LICENSE       osg-hosts     README.md  sites.xml.template  submit      Test_data  useful_files
email-notify  osg-gem.conf  pegasus.conf  reference  staging             task-files  tools      worker-launch.yml

List the contents in the Test_data folder

$ ls Test_data
SRR4343300.fastq.gz  TEST_1.fastq.gz  TEST_2.fastq.gz

The example data contains an config file (osg-gem.conf) as well as input files from the 21st chromosome of Gencode Release 24 of the GRCh38 build of the human reference genome. Two small FASTQ files containing 200,000 sequences from NCBI dataset SRR1825962 lie within the Test_data directory of the example data folder

Step 3. Setting up the VM and password on the MASTER

$ sudo sh /opt/setup_vm.sh <MASTER ip.address> <password>
 
# For example
$ sudo sh /opt/setup_vm.sh 149.165.169.158 "pegasus_123"

Step 4. Setting up the VM and password on the WORKERS

4.1 Creat ansible.cfg file into your home directory which will help you to avoid host verification

$ nano ~/.ansible.cfg
[defaults]
host_key_checking = False

4.2 Edit osg-hosts file and populate it with ip addresses of the WORKERS

$ echo "149.165.168.49" >> osg-hosts # This ip address of the WORKER is specific to my account. This will not work for you
$ echo "149.165.169.142" >> osg-hosts # This ip address of the WORKER is specific to my account. This will not work for you

$ cat osg-hosts
[workers]
149.165.168.49
149.165.169.142

4.2 Run Ansible playbook (worker-launch.yml) to add WORKERS to condor pool

$ ansible-playbook -u ${USER} -i osg-hosts -e "hostname=149.165.169.158 pwd=pegasus_123" worker-launch.yml
 
PLAY [workers] ************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************
ok: [149.165.169.142]
ok: [149.165.168.49]

TASK [Execute the script] *************************************************************************************************************************************
 [WARNING]: Consider using 'become', 'become_method', and 'become_user' rather than running sudo


changed: [149.165.169.142]
changed: [149.165.168.49]


PLAY RECAP ****************************************************************************************************************************************************
149.165.168.49             : ok=2    changed=1    unreachable=0    failed=0   
149.165.169.142            : ok=2    changed=1    unreachable=0    failed=0

$ cat worker-launch.yml---
- hosts : workers
 environment:
 PATH: "{{ ansible_env.PATH }}:/home/${USER}/bin:/home/${USER}/.local/bin"
 vars:
 master: "{{ hostname }}"
 password: "{{ pwd }}"
 tasks :
 - name : Execute the script
 shell : sudo sh /opt/setup_vm.sh "{{ master }}" "{{ password }}"

4.3 Run condor_status command to see how many VMs are present in condor pool

$ condor_status

Name                                 OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime
slot1@js-168-49.jetstream-cloud.org  LINUX      X86_64 Unclaimed Idle      0.350 15886  0+00:00:03
slot1@js-169-142.jetstream-cloud.org LINUX      X86_64 Unclaimed Idle      0.100 15886  0+00:00:03
slot1@js-169-158.jetstream-cloud.org LINUX      X86_64 Unclaimed Idle      0.150 15886  0+00:29:51
                     Machines Owner Claimed Unclaimed Matched Preempting  Drain
        X86_64/LINUX        3     0       0         3       0          0      0
               Total        3     0       0         3       0          0      0

Step 5. Run Pegasus-GEM

5.1 Take a look at the config file first

$ cat osg-gem.conf
[reference]
reference_prefix = chr21-GRCh38

[inputs]
#
# List the inputs to process. Each line can either be a pair
# of forward and reverse files, separated by a space:
#
#    input1 = forward.fastq.gz reverse.fastq.gz
#
# or a single SRR number. Example:
#
#    input2 = DRR046893
#
# or a single fastq file for single end reads.  Example:
#    input3 = SRR4343300.fastq.gz

input1 = ./Test_data/TEST_1.fastq.gz ./Test_data/TEST_2.fastq.gz
#input1 = ./Test_data/SRR4343300.fastq.gz
#input2 = SRR4343300
#input2 = DRR046893

[config]

# Memory available to the jobs. This should be roughly 2X the
# size of the reference genome, rounded up whole GB
memory = 4 GB

# Reads are single end
single = False

# Reads are paired end
paired = True

# process using TopHat2
tophat2 = False

# process using Hisat2
hisat2 = True

# process using STAR
star = False

# process using Cufflinks
cufflinks = False

# process using StringTie
stringtie = True

5.2 Run default test data . The workflow, configured to run Hisat2 and StringTie, can then be launched by running:

$ ./submit

Adding reference files ...
Adding reference files ...
    chr21-GRCh38.1.bt2
    chr21-GRCh38.1.ht2
    chr21-GRCh38.2.bt2
    chr21-GRCh38.2.ht2
    chr21-GRCh38.3.bt2
    chr21-GRCh38.3.ht2
    chr21-GRCh38.4.bt2
    chr21-GRCh38.4.ht2
    chr21-GRCh38.5.ht2
    chr21-GRCh38.6.ht2
    chr21-GRCh38.7.ht2
    chr21-GRCh38.8.ht2
    chr21-GRCh38.Splice_Sites.txt
    chr21-GRCh38.fa
    chr21-GRCh38.rev.1.bt2
    chr21-GRCh38.rev.2.bt2
    chr21-GRCh38.transcriptome_data.tar.gz
Adding gff3 file ...
    chr21-gencode.v24.annotation.gff3
 
Found input input1: ./Test_data/TEST_1.fastq.gz ./Test_data/TEST_2.fastq.gz
An 'Output's directory will be created within the base of the workflow directory.
This directory, /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/outputs
will have a 'merged_GEM.tab' file, an expression vector for each individual file,
and all standard output files from trimmomatic/hisat2 jobs.
2017.11.09 16:45:09.835 CST:    
2017.11.09 16:45:09.841 CST:   ----------------------------------------------------------------------- 
2017.11.09 16:45:09.846 CST:   File for submitting this DAG to HTCondor           : gem-0.dag.condor.sub 
2017.11.09 16:45:09.851 CST:   Log of DAGMan debugging messages                 : gem-0.dag.dagman.out 
2017.11.09 16:45:09.857 CST:   Log of HTCondor library output                     : gem-0.dag.lib.out 
2017.11.09 16:45:09.862 CST:   Log of HTCondor library error messages             : gem-0.dag.lib.err 
2017.11.09 16:45:09.867 CST:   Log of the life of condor_dagman itself          : gem-0.dag.dagman.log 
2017.11.09 16:45:09.873 CST:    
2017.11.09 16:45:09.878 CST:   -no_submit given, not submitting DAG to HTCondor.  You can do this with: 
2017.11.09 16:45:09.888 CST:   ----------------------------------------------------------------------- 
2017.11.09 16:45:28.857 CST:   Created Pegasus database in: sqlite:////home/upendra/.pegasus/workflow.db 
2017.11.09 16:45:28.863 CST:   Your database is compatible with Pegasus version: 4.8.0 
2017.11.09 16:45:29.296 CST:   Submitting to condor gem-0.dag.condor.sub 
2017.11.09 16:45:33.157 CST:   Submitting job(s). 
2017.11.09 16:45:33.163 CST:   1 job(s) submitted to cluster 70. 
2017.11.09 16:45:33.168 CST:    
2017.11.09 16:45:33.173 CST:   Your workflow has been started and is running in the base directory: 
2017.11.09 16:45:33.179 CST:    
2017.11.09 16:45:33.184 CST:     /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 
2017.11.09 16:45:33.189 CST:    
2017.11.09 16:45:33.195 CST:   *** To monitor the workflow you can run *** 
2017.11.09 16:45:33.200 CST:    
2017.11.09 16:45:33.205 CST:     pegasus-status -l /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 
2017.11.09 16:45:33.211 CST:    
2017.11.09 16:45:33.216 CST:   *** To remove your workflow run *** 
2017.11.09 16:45:33.221 CST:    
2017.11.09 16:45:33.227 CST:     pegasus-remove /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 
2017.11.09 16:45:33.232 CST:    
2017.11.09 16:45:34.318 CST:   Time taken to execute is 28.401 seconds

Monitoring Workflow: Pegasus provides a set of commands to monitor workflow progress. The path to the workflow files as well as commands to monitor the workflow will print to screen upon submitting the workflow. For example:

$ pegasus-status -l /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501

STAT IN_STATE JOB
Run 04:21 gem-0 ( /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501 )
Run 03:05 ??subdax_level-2_ID0000004
Run 01:58 ??hisat2_ID0000003
Run 01:58 ??hisat2_ID0000007
Run 01:58 ??hisat2_ID0000006
Summary: 5 Condor jobs total (R:5)

UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
6 0 0 3 0 12 0 57.1 Running 00/00/level-2/level-2.dag
2 0 0 1 0 4 0 57.1 Running *gem-0.dag
8 0 0 4 0 16 0 57.1 TOTALS (28 jobs)
Summary: 2 DAGs total (Running:2)

Make sure the run has finished by running the same command

$ pegasus-status -l /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501

(no matching jobs found in Condor Q)
UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                 
    0     0     0     0     0    21     0 100.0 Success 00/00/level-2/level-2.dag               
    0     0     0     0     0     7     0 100.0 Success *gem-0.dag                              
    0     0     0     0     0    28     0 100.0         TOTALS (28 jobs)                        
Summary: 2 DAGs total (Success:2)

You look at the statistics of the run using pegasus-stastics command

$ pegasus-statistics /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/workflow/osg-gem-1510267501

------------------------------------------------------------------------------
Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
Tasks          18        0       0           18        1         19           
Jobs           27        0       0           27        1         28           
Sub-Workflows  1         0       0           1         0         1            
------------------------------------------------------------------------------
Workflow wall time                                       : 9 mins, 8 secs
Cumulative job wall time                                 : 7 mins, 18 secs
Cumulative job wall time as seen from submit side        : 22 mins, 37 secs
Cumulative job badput wall time                          : 7 secs
Cumulative job badput wall time as seen from submit side : 56 secs

Output will be transferred at the base of this directory upon completion. For example this message tells you where the ouputs are

"An 'Output's directory will be created within the base of the workflow directory.

This directory, /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/outputs

ls -lh /home/upendra/Pegasus-GEM_example_data/runs/osg-gem-1510267501/outputs


total 3.7M
drwxr-xr-x 2 upendra upendra    6 Nov  9 17:54 ballgown-data
-rw-r--r-- 1 upendra upendra 131K Nov  9 17:53 input1-e2t.ctab
-rw-r--r-- 1 upendra upendra 432K Nov  9 17:53 input1-e_data.ctab
-rw-r--r-- 1 upendra upendra 108K Nov  9 17:53 input1-i2t.ctab
-rw-r--r-- 1 upendra upendra 163K Nov  9 17:53 input1-i_data.ctab
-rw-r--r-- 1 upendra upendra 239K Nov  9 17:53 input1-t_data.ctab
-rw-r--r-- 1 upendra upendra  68K Nov  9 17:54 merged_GEM.tab
-rw-r--r-- 1 upendra upendra  136 Nov  9 17:54 QC_Report.tab
-rw-r--r-- 1 upendra upendra 2.6M Nov  9 17:53 sampleinput1.gtf

From here, the user may follow our documentation to modify the software options as well as point to their own input datasets. Note that there are no test reference genome indices available for STAR, because they are too large to upload to github.

Pre-Workflow User Input

The user must provide indexed reference genome files as well as gene model annotation information prior to submitting the workflow. The user must select a reference prefix($REF_PREFIX) that will be recognized by Pegasus as well as by Hisat2 or Tophat2. In addition, information about splice sites or a reference transcriptome must be provided in order to guide accurate mapping of split input files. Once the user has downloaded a reference genome fasta file and gene annotation in GTF/GFF3 format, the following commands can be used to produce the necessary input files, using GRCh38 as an example $REF_PREFIX for Gencode Release 24 of the human reference genome:

Hisat2:

Index the reference genome

$ cd reference
$ hisat2-build -f GRCh38.fa GRCh38

Generate Tab delimited list of splice sites using gene model GTF file as input (Python DefaultDictionary Module necessary)

$ python hisat2_extract_splice_sites.py GRCh38-gencode.v24.annotation.gtf > GRCh38.Splice_Sites.txt

Tophat2:

Index the reference genome

$ cd reference
$ bowtie2-build -f GRCh38.fa GRCh38

Generate and Index Reference Transcriptome

$ tophat2 -G GRCh38.gencode.v24.annotation.gff3 --transcriptome-index=transcriptome_data/GRCh38 GRCh38
$ tar czf GRCh38.transcriptome_data.tar.gz transcriptome_data/

STAR:

$ cd reference/star_index
$ bowtie2-build -f GRCh38.fa GRCh38
$ STAR-2.5.2b/bin/Linux_x86_64_static/STAR --runMode genomeGenerate --runThreadN 4 --genomeDir ./ --genomeFastaFiles ../chr21-GRCh38.fa --sjdbGTFfile ../chr21-gencode.v24.annotation.gff3

Generate Tab delimited list of splice sites using gene model GTF file as input (Python DefaultDictionary Module necessary)

$ python hisat2_extract_splice_sites.py GRCh38-gencode.v24.annotation.gtf > GRCh38.Splice_Sites.txt

Note

Please contact upendra@cyverse.org if you need help determining the optimal amount of resources for running pegasus-GEM.

1 Learning Materials

Pegasus-GEM Jetstream Workflow Tutorial - 2.2

Analytics