Monday Aug 13

Monday Aug 13

Opening Session (9:00-10:30)

  1. Sponsors and programmatic announcements 9:00-9:20
  2. Overview of workshop. Present common resources (wiki, schedule, links) 9:20-9:30
  3. Instructor and particpant round-robin intro 30m
  4. Introductory material 10:00-10:30
    1. Cyberinfrastructure basics
    2. Access patterns for bioinformatic analysis
    3. NGS file and data types
  5. Coffee Break 10:30-10:40 (10m)

Introductory Exercises (10:40-12:00)

The following are designed to get you familiar with the three main computing platforms you will be using in this week's workshop. We will start with the iPlant Discovery Environment as a bioinformatics workbench, transition to iPlant Atmosphere which supports interactive on-demand computing, and end with the OSU High Performance Computing Cluster "Pistol Pete" for command-line operations.

iPlant Discovery Environment (10:40-11:10)

The Discovery Environment (DE) is one of the ways users can interact with iPlant cyberinfrastructure. Rather than managing computing resource details, or learning new software for every type of analysis, the DE allows you to handle all aspects of your bioinformatics workflow (e.g., data management, analysis, sharing large datasets, etc.) in one space.

Log into the iPlant DE at http://de.iplantcollaborative.org/de using your iPlant username and password (also known as credentials). Follow along with the instructor as we learn about the Data, Analyses, and Apps windows.

Import data set SRX008324 from the NCBI SRA

SRX008324 is a metagenomic sample of the phyllosphere microbiota of soybean leaves sequenced using a 454 machine

  1. Open a new browser window and go to http://www.ncbi.nlm.nih.gov/sra/SRX008324
    1. Right-click on the link '558 Mb'
      1. This will copy the FTP link for the SRA file to your clipboard. Windows users may need to select "Copy link location"
      2. ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX008/SRX008324
  2. In the Discovery Environment,  NCBI SRA Import. Click on it to select from the Apps catalog, then click again to open an instance of this App.
  3. Paste the URL from your clipboard into the text field of the App (use the key combination ctrl-v to paste) and click 'Launch Analysis'
  4. Follow the progress of your import task in the Analyses window. When its complete, move onto the next steps.
  5. Convert the SRA files from SRX008324 to SFF format
    1. Search the Apps catalog for 'NCBI SRA Toolkit sff-dump' and click on it to launch it.
    2. Click on the file selection dialog and navigate through the menu (click on the "+" symbols) to the 'analyses' directory of your home folder, where you will find a folder named after the NCBI SRA Import job you just ran. Inside here is a folder 'SRA-Import' and inside it are two additional folders, one for each sample SRR023845 and SRR023846. Navigate to the inside of the SRR023845 folder/sample, and select (highlight) the SRR023845.sra file.
    3. Do not change any options and click 'Launch Analysis'. Name your job "sra_SRR023845".
    4. Repeat the previous three steps for the other sample, naming your job Name your job "sra_SRR023846". You now have two jobs running in the iPlant system at once.
  6. As your jobs complete, you will be notified by email, or you can follow the progress in the "Analysis" window. The converted *.sff files will show up in your analysis window.
  7. Again in the Discovery Environment, click the 'Apps' icon and search for the SFF2Fastq tool. Use SFF2Fastq to convert the *.sff files (created by 'NCBI SRA Toolkit sff-dump') to FASTQ files. Name your jobs "fastq_SRR023845" and "fastq_SRR023846", respectively. The converted *.fastq files will again show up in your analysis window.

Congratulations! You have used the iPlant Discovery environment to download a large data archive from NCBI and converted it to two important NGS formats (.sff and *.fastq).*

iPlant Atmosphere (11:10-11:30)

Atmosphere, the iPlant Collaborative's cloud infrastructure service platform, facilitates and addresses the growing need for highly configurable and cloud-enabled computational resources by the plant sciences research community.

  1. Using your browser, log into the iPlant Atmosphere control panel using your iPlant credentials
    1. https://atmo-beta.iplantcollaborative.org/login/
  2. Follow along with the instructor as we learn about launching instances, terminating instances, the difference bewteen 'Apps' and hand-launched VMs, accessing your new system via VNC, and more
  3. Launch an instance of the 'Entangled Genomes' machine by clicking on it, so we can log into it later. Don't worry about the VNC Viewer or SSHing into these systems yet. We'll walk you through this by demonstration.

Links explaining some cloud-based systems (for your reference)

  1. http://aws.amazon.com/ec2/
  2. http://en.wikipedia.org/wiki/Virtual_machine

OSU HPC 'Pistol Pete' (11:30-12:00)

The High Performance Computing Center (HPCC) provides supercomputing services and computational science expertise that enables faculty and students to conduct a wide range of focused research, development, and test activities. The center puts advanced technology in the hands of the academic population more quickly, less expensively, and with greater certainty of success.

  1. Workshop tutorial details: http://hpcwiki.it.okstate.edu/index.php/Workshop_HPC_tutorial
    1. Login and change password: screen share: https://join.me/335-209-508
    2. copy file,
    3. interactive queue
    4. submitting jobs
    5. editing a file with nano
  2. Use wget to copy some class data from a remote URL to your present UNIX directory
    wget "http://data.iplantcollaborative.org/quickshare/1ef94adab33e8c0a/SRR041654.fasta.bz2"
  3. Now, check the size of the file you downloaded. 3.2 megabytes
    ls -alth SRR041654.fasta.bz2
    -rw------- 1 vaughn G-802821 3.2M Aug  9 10:50 SRR041654.fasta.bz2
    

    Later, if you want information about the "ls" command, you may type "ls --help" at the command prompt (do not use the quotes)

  4. Now, uncompress the data file you just downloaded using bunzip2 and check out the size.
    bunzip2 SRR041654.fasta.bz2
    ls -alth SRR041654.fasta
    -rw------- 1 vaughn G-802821 15M Aug  9 10:50 SRR041654.fasta
    

Note that uncompressed, this sequence read file is 15 megabytes (almost 5x larger). This is why you should get cozy with compression!

Lunch (12:00-1:00)

Synchronizing multiple analysis platforms via iPlant Data Store (1:00-1:30)

iPlant offers a large, persistent, universally accesible file system called the iPlant Data Store. In addition to simple storage, it offers powerful tools for high-speed data transfer, querying, backup, and synchronization. We will demonstrate a simple use case for the iPlant Data Store in the content of the OSU HPC system 'Pistol Pete'. You will need to move data to and from iPlant several times during the workshop using the skills you learn here.

  1. Log into Pistol Pete via SSH
  2. Just this time: Create a new directory on Pistol Pete and create a file inside it.
    mkdir ~/iplant-home
    touch ~/iplant-home/README.txt
    echo 'Help! My genome is entangled!' >> ~/iplant-home/README.txt
    
  3. At the command prompt type:
    module load bio_apps
    This puts the icommands binaries in your path on Pistol Pete 
  4. Initialize icommands by typing "iinit" at the command prompt (again, do not use the quotes).
  5. You will now be asked to input your iPlant credentials. You have just connected to iPlant iRODS and the Pistol Pete HPC systems simultaneously!
  6. List your iPlant iRODS home directory using "ils"
  7. List your Pistol Pete HPC home directory using "ls iplant-home" and note that they are not the same
  8. Put a file into the iPlant Data Store using iput ?
iput -frPVT SRR041654.fasta
  1. Get a file from the iPlant Data Store using iget, storing it in your local iplant-home directory on PP
    iget -frPVT /iplant/home/shared/osu-entangled-genomes/SRR041654.fasta iplant-home
  2. Synchronize your HPC and iPlant Data Store directories using irsync.
irsync -r i:/iplant/home/<IPLANTUSERNAME> /home/<PISTOLPETEUSERNAME>/iplant-home/
  1. List your HPC iplant-home directory and note the changes (it's gained a few files)
  2. Open a new browser window (or tab) to log into the iPlant Discovery Environment and note the new files present there. Read the README.txt file.
  3. Homework: Check out irm, imv, imkdir, and other commands in the iRODS documentation.
  1. Background Slides for iPlant Data Store (PDF 3.7 MB)
  2. http://www.iplantcollaborative.org/discover/data-store
  3. https://pods.iplantcollaborative.org/wiki/display/start/Storing+Your+Data+with+iPlant+and+Accessing+that+Data
  4. https://www.irods.org/index.php/Documentation

Quality analysis of an NGS data set (1:30-2:00)

We will use a tool called FastQC to generate a report on the quality, uniqueness, etc. of the NGS reads from the previously imported next gen sequence data files.

  1. Log into the iPlant DE at http://de.iplantcollaborative.org/de
  2. Find the FastQC application and launch a job to analyse each of the FASTQ files from the previous DE exercise.
  3. Discuss significance of the FastQC report files and images

Links

  1. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Processing, Trimming, and so on of your NGS data set (2:00-2:45)

This could be done in the iPlant DE, but we are going to show you how to do it from the command line. The applications have the same names and work the same way, but the UI is simply different.  

  1. Log into Pistol Pete and copy FASTQ files from shared directory
    cp /share/apps/bioworkshop2012/SRR023845.fastq .
    cp /share/apps/bioworkshop2012/SRR023846.fastq .
    
  1. Start an interactive job session and load the path variables
  2. qsub -I -q workshop -l walltime=1:00:00
    module load bio_apps
    cd iplant-home
  3. Trim the sequences to 300 bp long based on FASTQC reports   
    fastx_trimmer -i SRR023845.fastq -o SRR023845_trim.fastq -f 1 -l 300 -Q33
  4. Filter reads with average scores < Q20
    fastq_quality_filter -i SRR023845_trim.fastq -o SRR023845_trim_qual.fastq -q 20 -p 50 -Q33 -v
  5. Do both tasks in a single step using UNIX pipes
    fastx_trimmer -i SRR023845.fastq -f 1 -l 300 -Q33 | fastq_quality_filter -o SRR023845_trim_qual.fastq -q 20 -p 50 -Q33 -v
    
    Quality cut-off: 20
    Minimum percentage: 50
    Input: 566540 reads.
    Output: 552136 reads.
    discarded 14404 (2%) low-quality reads.
    fastx_trimmer -i SRR023846.fastq -f 1 -l 300 -Q33 | fastq_quality_filter -o SRR023846_trim_qual.fastq -q 20 -p 50 -Q33 -v
    
    Quality cut-off: 20
    Minimum percentage: 50
    Input: 543285 reads.
    Output: 535510 reads.
    discarded 7775 (1%) low-quality reads.
  6. Convert to FASTA
    fastq_to_fasta -i SRR023845_trim_qual.fastq -o SRR023845_trim_qual.fasta -Q33
    fastq_to_fasta -i SRR023846_trim_qual.fastq -o SRR023846_trim_qual.fasta -Q33
  7. Concatenate the two FASTQ files together. This creates one large file from two (or more) files. 
    cat SRR023846_trim_qual.fastq SRR023845_trim_qual.fastq > SRX008324.fastq
  8. Your turn: Concatenate the two FASTA files together
  9. Your turn: Upload the concatenated FASTQ and FASTA files into your iPlant Data Store using iput

Links

  1. http://www.december.com/unix/tutor/pipesfilters.html
  2. http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

Here is where a file viewer step would be good. We could:

a. Have them download the .fasta files to their desktop and use BioEdit to view them (checking the number of lines per file and making sure ALL lines were transferred.
b. Try <head SRR023845_trim_qual.fasta | tail SRR023846_trim_qual.fasta | head SRX008324.fasta | tail SRX008324.fasta>

Break (2:45-3:00)

Automating a workflow using scripting (3:00-4:15)

To run a pre-determined sequence of events (which computers are pretty good at), you can simply write them into a specially-formatted text file and tell the computer to run it. The following examples, which we will do interactively, will illustrate the basics. In this case, we will use the UNIX shell called 'bash' (but there are several others). Later you can check them out online and decide which one you will use in the future. 

The simplest possible example

Let's start by simply recapitulating the actions from the command line for one of your FASTQ files.

  1. Open up a text editor by typing 'nano -w' and type in the text from the box below named 'script01.0'
    1. GNU nano home page
    2. Nano editor tutorials
    3. OSU HPC wiki page on nano: http://hpcwiki.it.okstate.edu/index.php/Workshop_HPC_tutorial#Editing_a_file_using_nano
  2. Save to a file called script01.sh.
  3. Make the script executable with the following command
    chmod a+x script01.sh
    
  4. Now, execute the script as follows:
    ./script01.sh
    
  5. It will result in a new file in your directory 'SRR023845.scripted.fastq'
script01.0
#!/bin/bash

fastx_trimmer -i SRR023845.fastq -f 1 -l 300 -Q33 | \
fastq_quality_filter -o SRR023845.scripted.fastq -q 20 -p 50 -Q33 -v

Adding some flexibility

In the first example, the filename is hard-coded into the script. This is not very useful. So, let's use a variable and parameter to add some flexibility.

  1. Edit your file script01.sh so that it looks like the code in box 'script01.1'. We will now introduce a variable"$INFILE" into the script. Variables add flexibility because they allow us to submit any parameter we want to the same script instructions.
    script01.1
    #!/bin/bash
    
    INFILE=$1 #<== Assign the first parameter passed to the script to INFILE
    
    fastx_trimmer -i ${INFILE} -f 1 -l 300 -Q33 | \
    fastq_quality_filter -o "trimmed_filtered_${INFILE}" -q 20 -p 50 -Q33 -v
    
    Now, how do we run this new script? Note that I am now passing a parameter (in this case: "SRR023845.fastq") to the script!
./script01.sh SRR023845.fastq
  1. The parameter is passed to the variable. The result will be a file called 'trimmed_filtered_SRR023845.fastq'. Nifty, huh?
  2. Questions: How would you change script01.1 to
    1. Make the output filename user-specified?
    2. Accept a different minimum quality score?
      script01.1
       

Iteration using a loop

Let's say you have a folder full of files you need to process. How can we modify our script to iterate over all of them?

  1. As shown below, copy the following to your working directory, and decompress it. The result will be a directory 'sequences' full of files
    wget "http://data.iplantcollaborative.org/quickshare/77629afae36145be/data.bz2"
    tar -xjvf data.bz2
    
  2. Edit script01.sh to match script01.2 and save it as script01.2.sh.
  3. Run the script ./script01.2.sh and note the progress reporting.
  4. When it's done, list (ls sequences) the sequences directory
script01.2
#!/bin/bash

for FILE in sequences/*.fq.*
do
echo "Working on $FILE"
fastx_trimmer -i ${FILE} -f 1 -l 300 -Q33 | \
fastq_quality_filter -o "${FILE}.trimmed_filtered.fq" -q 20 -p 50 -Q33 -v
done
echo "Finished!"

Submitting a scripted workflow as a batch computing job

Running in interactive mode is useful, but ultimately you will want to be able to fire off tasks on lots of computers. Luckily, this is pretty easy! If you have a script that runs a set of steps, here's an example of how to make it batch-computing aware.

  1. Create a new file called 'batch.sh' and edit it to contain the text in the batch01.0 box
  2. Now, submit this to the job queue on Pistol Pete, which is implemented via software packages called -'Torque & Maui'
  3. qsub batch.sh
  4. Now, watch the status of your job using qstat. How long did it take to run?
batch01.0
#!/bin/bash


#PBS -V
#PBS -N jobname
#PBS -q workshop 
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1
#PBS -j oe

source /etc/profile.d/env-modules.sh
module load bio_apps

cd $PBS_O_WORKDIR
# mention that we could also just put ./script01.sh 846_from_sff.fastq here?

fastx_trimmer -i SRR023845.fastq -f 1 -l 300 -Q33 | \
fastq_quality_filter -o SRR023845.scripted_pbs.fastq -q 20 -p 50 -Q33 -v

Links

  1. http://en.wikipedia.org/wiki/Unix_shell
  2. http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html

Automating a workflow using the Discovery Environment (4:15-5:00)

  1. Log into the iPlant Discovery Environment
  2. Follow along with the instructor

Links

  1. Building Automated Workflows in the iPlant Discovery Environment

Wrap-up 5:00-