UARK RNA-Seq Tutorial Copy

UARK RNA-Seq Tutorial Copy

RNA-Seq using the Discovery Environment

Rationale and background:

RNA-seq refers to whole transcriptome shotgun sequencing of cDNA, generally using an ultra-high-throughput ("next-generation") sequencing technology. The value of RNA-seq is the ability generate deep-coverage information about a sample's RNA. This can be used for a variety of purposes such as transcriptome assembly, gene discovery/annotation, and detecting differential transcript abundances between tissues, developmental stages, genetic backgrounds, and environmental conditions. In this example we will compare gene transcript abundance between wild type and mutant Arabidopisis strains.

LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF) that functions downstream of multiple families of photoreceptors. Mutations in the HY5 gene cause a myriad of aberrant phenotypes in Arabidopsis, including elongated hypocotyl, reduced accumulation of pigments, halted chloroplast development in greening hypocotyls, altered root morphology and defective hormonal and stimulus responses. HY5 thus acts as an integrator that links various gene networks to coordinate plant development. 
We will use RNAseq to compare expression levels for genes between WT and hy5- samples and to identify new transcripts or isoforms. In this tutorial, we will use data stored at the NCBI Sequence Read Archive to simulate us having done the lab work ourselves.

  1. Align the data to the Arabidopsis genome using TopHat

  2. Identify differentially-expressed genes using CuffDiff

  3. Use tools in the Discovery Environment to explore the differential gene expression data.

If you do not have an account, please see one of the on-site iPlant staff for a temporary account.

Specific Objectives

By the end of this module, you should

  1. Be more familiar with the DE user interface

  2. Understand the starting data for RNA-seq analysis

  3. Be able to align short sequence reads with a reference genome in the DE

  4. Be able to analyze differential gene expression in the DE and Atmosphere

Note on Staged Data:

Several of the methods in this tutorial can take 1 to 2 hrs to complete on a full-sized data set. So that you can complete the tutorial in the allotted time, we have pre-staged input and output files in the 'Community Data' folder for each step. You can start your analyses then skip to the next step using pre-staged data. Also, we have filtered the original read data (downloaded from NCBI's Sequence Read Archive) to generate fastq files containing pseudo-replicates derived from a random subset of each library.

Original data from NCBI Sequence Read Archive study SRP003951 (all are derived from Illumina singe-end reads):

  • SRR70570 WT RNA-seq replicate 1

  • SRR70571 WT RNA-seq replicate 2

  • SRR70572 hy5 RNA-seq replicate 1

  • SRR70573 hy5 RNA-seq replicate 2

The Staged Fastq Data can be found in the

'Community Data'->iplant_training->intro_rna-seq->01_input_data

Section 1: Align RNA-seq reads using Tophat

Tophat is a specialized alignment software for RNA-seq reads that is aware of splice junctions when aligning to a reference assembly.  It can be found by clicking on the Apps icon. The most recent App for Tophat will accept multiple fastq files.

a) Navigate to the Apps window by clicking on the icon. Look for TopHat2-SE (TopHat2 for Illumina paired-end reads) in the 'Public Applications' category under NGS->transcriptome_profiling->Tuxedo RNA-Seq 2. Click on the name of the App to open it.

b) In the Input Data section, you can click 'Add' at the top right of the FASTQ File(s) box to navigate to the folder containing the FASTQ files:

'Community Data'->iplant_training->intro_rna-seq->01_input_data

Note: For multiple input files, it is convenient to open a separate data window, navigate to the above folder and drag all of files into the input data box. Each of the input files will be processed independently. For convenience, a batch of FASTQ files can be analyzed together but these files can also be processed concurrently in independent TopHat runs.

c) In the Ref. Genome section, Use the drop-down menu to select "Arabidopsis thaliana (Ensembl 14)" for the reference genome. This is the TAIR10 release. 

d) In the Ref. Annotations section, use the drop-down menu to select "Arabidopsis thaliana (Ensembl 14)" for the reference gene annotations.

e) Start the analysis by clicking the Launch Analysis button, naming the analysis 'tophat' in the dialog box. 
Reasonable default options are provided for the analysis settings.

f) Tophat will require some time to complete its work on each sample. Click on the Analyses icon to view the status of your submitted analysis.

g) When the analysis is complete, navigate to the results under your 'analyses' folder. The principle outputs from TopHat are BAM files, each set of alignments is in its own folder, corresponding to the name of the original Fastq file. This is the most time-consuming step of this training module. If your analysis has not completed in time, you can skip ahead by using the pre-computed results:

'Community Data'->iplant_training->intro_rna-seq->02_tophat

In the tophat_out folder, you should see these folders:

Section 2: Assemble transcripts with Cufflinks

Cufflinks assembles RNA-Seq alignments into a parsimonious set of transcripts, then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. A detailed manual can be found at http://cole-trapnell-lab.github.io/cufflinks/manual/. We will be using a reference annotation file for '''Arabidopsis thaliana''' to perform Reference annotation based assembly (RABT), which facilitates discovery of known and novel transcripts. For further discussion, see http://cole-trapnell-lab.github.io/cufflinks/papers/

a) Click on the Apps icon and find Cufflinks2. Open it.

b) In the 'Select Input Data' section, open a separate data window for the 'bam' folder from tophat_out (above) and select and drag all four bam files into the input box. For convenience, a batch of TopHat bam files can be analyzed together but these files can also be processed concurrently in independent Cufflinks runs.

c) In the Reference Sequence section, select Arabidopsis thaliana from the pull-down menu.

d) Run Cufflinks. When it is complete, you will see the following outputs in cufflinks_out

The outputs of cufflinks are GTF and FPKM files. For example, look in one of the four output folders:

Examine the GTF file, transcripts.gtf. This file contains annotated transcripts assembled by cufflinks, using the annotated transcripts selected from the pulldown menu in the Cufflinks app as a guide.

1 Cufflinks transcript 3631 5899 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 3631 3913 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 3996 4276 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 4486 4605 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 4706 5095 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 5174 5326 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 5439 5899 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

Examine the FPKM file, genes.fpkm_tracking. This files gives the normalized abundance of each transcript, expressed in fragments per kilobase of exon per million fragments mapped (FPKM) and their upper and lower 95% confidence interval.

tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status AT1G01010 - - AT1G01010 ANAC001 - 1:3630-5899 - - 0 0 0 OK AT1G01020 - - AT1G01020 ARV1 - 1:5927-8737 - - 6.22574 1.51952 10.9319 OK AT1G01030 - - AT1G01030 NGA3 - 1:11648-13714 - - 0.921783 0 2.22538 OK AT1G01040 - - AT1G01040 DCL1 - 1:23145-31227 - - 0.129701 0 0.389104 OK AT1G01050 - - AT1G01050 AtPPa1 - 1:31169-33153 - - 6.08883 1.11732 11.0603 OK AT1G01046 - - AT1G01046 MIR838A - 1:28499-28706 - - 0 0 0 OK AT1G01070 - - AT1G01070 AT1G01070 - 1:38751-40944 - - 1.92728 0 4.75919 OK AT1G01073 - - AT1G01073 AT1G01073 - 1:44676-44787 - - 0 0 0 OK

Section 3: Merge all Cufflinks transcripts into a single transcriptome annotation file using Cuffmerge

Cuffmerge merges together several Cufflinks assemblies. It handles also handles running Cuffcompare. The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Cuffdiff. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies. For an informative discussion on this topic, see http://seqanswers.com/forums/showthread.php?t=16422.

a) Open the Cuffmerge2 app. Under 'Input Data', browse to the results of the cufflinks analyses (cufflinks_out). Select the four GTF files in cufflinks_out/gtf as input for Cuffmerge2.

b) Under Reference Data, select the Arabidopsis genome annotation and genome sequence using the pull-down menus.

c) Run the App. When it is complete, you should see the following outputs under cuffmerge_out:

Section 4: Compare expression analysis using CuffDiff

Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for significant differences. Comprehensive documentation of this tool is given through the Nature Protocols paper linked at http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html.

a) Open a separate data window for the bam files generated by tophat (above).

b) In Apps select CuffDiff2

c) Enter the name of your first Sample (WT).

d) Add BAM files all replicates from this sample, either from your tophat output directory or the pre-staged data. Ignore the BAM index (.bai) files. Cuffdiff merges the replicate data and uses comparison between replicates as part of calculating the test statistic.

d) Repeat steps (b) and (c) for the second sample (hy5)

e) Next, open the Reference Data section and add a custom reference annotation file using the "merged_with_ref_ids.gtf" file from the cuffmerge output folder.

f) Launch the analysis

g) Examine the output files.

The folder cuffdiff_out contains the output files from cuffdiff.