UARK RNA-Seq Tutorial Copy

UARK RNA-Seq Tutorial Copy

RNA-Seq using the Discovery Environment

Rationale and background:

RNA-seq refers to whole transcriptome shotgun sequencing of cDNA, generally using an ultra-high-throughput ("next-generation") sequencing technology. The value of RNA-seq is the ability generate deep-coverage information about a sample's RNA. This can be used for a variety of purposes such as transcriptome assembly, gene discovery/annotation, and detecting differential transcript abundances between tissues, developmental stages, genetic backgrounds, and environmental conditions. In this example we will compare gene transcript abundance between wild type and mutant Arabidopisis strains.

LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF) that functions downstream of multiple families of photoreceptors. Mutations in the HY5 gene cause a myriad of aberrant phenotypes in Arabidopsis, including elongated hypocotyl, reduced accumulation of pigments, halted chloroplast development in greening hypocotyls, altered root morphology and defective hormonal and stimulus responses. HY5 thus acts as an integrator that links various gene networks to coordinate plant development. 
We will use RNAseq to compare expression levels for genes between WT and hy5- samples and to identify new transcripts or isoforms. In this tutorial, we will use data stored at the NCBI Sequence Read Archive to simulate us having done the lab work ourselves.

  1. Align the data to the Arabidopsis genome using TopHat
  2. Identify differentially-expressed genes using CuffDiff
  3. Use tools in the Discovery Environment to explore the differential gene expression data.

If you do not have an account, please see one of the on-site iPlant staff for a temporary account.

Specific Objectives

By the end of this module, you should

  1. Be more familiar with the DE user interface
  2. Understand the starting data for RNA-seq analysis
  3. Be able to align short sequence reads with a reference genome in the DE
  4. Be able to analyze differential gene expression in the DE and Atmosphere

Note on Staged Data:

Several of the methods in this tutorial can take 1 to 2 hrs to complete on a full-sized data set. So that you can complete the tutorial in the allotted time, we have pre-staged input and output files in the 'Community Data' folder for each step. You can start your analyses then skip to the next step using pre-staged data. Also, we have filtered the original read data (downloaded from NCBI's Sequence Read Archive) to generate fastq files containing pseudo-replicates derived from a random subset of each library.

Original data from NCBI Sequence Read Archive study SRP003951 (all are derived from Illumina singe-end reads):

  • SRR70570 WT RNA-seq replicate 1
  • SRR70571 WT RNA-seq replicate 2
  • SRR70572 hy5 RNA-seq replicate 1
  • SRR70573 hy5 RNA-seq replicate 2

The Staged Fastq Data can be found in the

'Community Data'->iplant_training->intro_rna-seq->01_input_data

Section 1: Align RNA-seq reads using Tophat

Tophat is a specialized alignment software for RNA-seq reads that is aware of splice junctions when aligning to a reference assembly.  It can be found by clicking on the Apps icon. The most recent App for Tophat will accept multiple fastq files.

a) Navigate to the Apps window by clicking on the icon. Look for TopHat2-SE (TopHat2 for Illumina paired-end reads) in the 'Public Applications' category under NGS->transcriptome_profiling->Tuxedo RNA-Seq 2. Click on the name of the App to open it.

b) In the Input Data section, you can click 'Add' at the top right of the FASTQ File(s) box to navigate to the folder containing the FASTQ files:

'Community Data'->iplant_training->intro_rna-seq->01_input_data

Note: For multiple input files, it is convenient to open a separate data window, navigate to the above folder and drag all of files into the input data box. Each of the input files will be processed independently. For convenience, a batch of FASTQ files can be analyzed together but these files can also be processed concurrently in independent TopHat runs.

c) In the Ref. Genome section, Use the drop-down menu to select "Arabidopsis thaliana (Ensembl 14)" for the reference genome. This is the TAIR10 release. 

d) In the Ref. Annotations section, use the drop-down menu to select "Arabidopsis thaliana (Ensembl 14)" for the reference gene annotations.

e) Start the analysis by clicking the Launch Analysis button, naming the analysis 'tophat' in the dialog box. 
Reasonable default options are provided for the analysis settings.

f) Tophat will require some time to complete its work on each sample. Click on the Analyses icon to view the status of your submitted analysis.

g) When the analysis is complete, navigate to the results under your 'analyses' folder. The principle outputs from TopHat are BAM files, each set of alignments is in its own folder, corresponding to the name of the original Fastq file. This is the most time-consuming step of this training module. If your analysis has not completed in time, you can skip ahead by using the pre-computed results:

'Community Data'->iplant_training->intro_rna-seq->02_tophat

In the tophat_out folder, you should see these folders:

Cufflinks assembles RNA-Seq alignments into a parsimonious set of transcripts, then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. A detailed manual can be found at http://cole-trapnell-lab.github.io/cufflinks/manual/. We will be using a reference annotation file for '''Arabidopsis thaliana''' to perform Reference annotation based assembly (RABT), which facilitates discovery of known and novel transcripts. For further discussion, see http://cole-trapnell-lab.github.io/cufflinks/papers/

a) Click on the Apps icon and find Cufflinks2. Open it.

b) In the 'Select Input Data' section, open a separate data window for the 'bam' folder from tophat_out (above) and select and drag all four bam files into the input box. For convenience, a batch of TopHat bam files can be analyzed together but these files can also be processed concurrently in independent Cufflinks runs.

c) In the Reference Sequence section, select Arabidopsis thaliana from the pull-down menu.

d) Run Cufflinks. When it is complete, you will see the following outputs in cufflinks_out

The outputs of cufflinks are GTF and FPKM files. For example, look in one of the four output folders:

Examine the GTF file, transcripts.gtf. This file contains annotated transcripts assembled by cufflinks, using the annotated transcripts selected from the pulldown menu in the Cufflinks app as a guide.

1	Cufflinks	transcript	3631	5899	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	3631	3913	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	3996	4276	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	4486	4605	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	4706	5095	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	5174	5326	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	5439	5899	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

Examine the FPKM file, genes.fpkm_tracking. This files gives the normalized abundance of each transcript, expressed in fragments per kilobase of exon per million fragments mapped (FPKM) and their upper and lower 95% confidence interval.

tracking_id	class_code	nearest_ref_id	gene_id	gene_short_name	tss_id	locus	length	coverage	FPKM	FPKM_conf_lo	FPKM_conf_hi	FPKM_status
AT1G01010	-	-	AT1G01010	ANAC001	-	1:3630-5899	-	-	0	0	0	OK
AT1G01020	-	-	AT1G01020	ARV1	-	1:5927-8737	-	-	6.22574	1.51952	10.9319	OK
AT1G01030	-	-	AT1G01030	NGA3	-	1:11648-13714	-	-	0.921783	0	2.22538	OK
AT1G01040	-	-	AT1G01040	DCL1	-	1:23145-31227	-	-	0.129701	0	0.389104	OK
AT1G01050	-	-	AT1G01050	AtPPa1	-	1:31169-33153	-	-	6.08883	1.11732	11.0603	OK
AT1G01046	-	-	AT1G01046	MIR838A	-	1:28499-28706	-	-	0	0	0	OK
AT1G01070	-	-	AT1G01070	AT1G01070	-	1:38751-40944	-	-	1.92728	0	4.75919	OK
AT1G01073	-	-	AT1G01073	AT1G01073	-	1:44676-44787	-	-	0	0	0	OK

Section 3: Merge all Cufflinks transcripts into a single transcriptome annotation file using Cuffmerge

Cuffmerge merges together several Cufflinks assemblies. It handles also handles running Cuffcompare. The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Cuffdiff. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies. For an informative discussion on this topic, see http://seqanswers.com/forums/showthread.php?t=16422.

a) Open the Cuffmerge2 app. Under 'Input Data', browse to the results of the cufflinks analyses (cufflinks_out). Select the four GTF files in cufflinks_out/gtf as input for Cuffmerge2.

b) Under Reference Data, select the Arabidopsis genome annotation and genome sequence using the pull-down menus.

c) Run the App. When it is complete, you should see the following outputs under cuffmerge_out:

Section 4: Compare expression analysis using CuffDiff

Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for significant differences. Comprehensive documentation of this tool is given through the Nature Protocols paper linked at http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html.

a) Open a separate data window for the bam files generated by tophat (above).

b) In Apps select CuffDiff2

c) Enter the name of your first Sample (WT).

d) Add BAM files all replicates from this sample, either from your tophat output directory or the pre-staged data. Ignore the BAM index (.bai) files. Cuffdiff merges the replicate data and uses comparison between replicates as part of calculating the test statistic.

d) Repeat steps (b) and (c) for the second sample (hy5)

e) Next, open the Reference Data section and add a custom reference annotation file using the "merged_with_ref_ids.gtf" file from the cuffmerge output folder.

f) Launch the analysis

g) Examine the output files.

The folder cuffdiff_out contains the output files from cuffdiff.

The folder graphs contains some basic plots produced using the R package cummeRbund (http://compbio.mit.edu/cummeRbund/).

The folder sorted_data contains some summary documents, suitable for import into spreadsheets, that sort gene and isoform expression by fold-change or total expression.

CuffDiff Outputs

After a successful run you will find many output files under cuffdiff_out. We will focus on four files related to differential gene expression at the gene and transcript levels:

  1. genes.fpkm_tracking Transcript abundance at the gene level in units of fragments per Kb of exon per million mapped reads (FPKM)
  2. gene_exp.diff Results of differential expression testing between the two samples, WT and hy5, at the gene level
  3. isoforms.fpkm_tracking Transcript abundance at the transcript isoform level in units of fragments per Kb of exon per million mapped reads (FPKM)
  4. isoform_exp.diff Results of differential expression testing between the two samples, WT and hy5, at the transcript isoform level


Explanation of Outputs: 
gene_exp.diff contains gene FPKM values for the WT and hy5 samples and ln(fold change) values.

Statistical significance is given by the p_value and q_value and a determination of whether the there is significant differential expression is given in the last column (with values of 'yes' or 'no'). The q-value is a corrected p-value for multiple tests based on the user supplied false discovery rate (FDR: default 0.05).

test_id	gene_id	gene	locus	sample_1	sample_2	status	value_1	value_2	log2(fold_change)	test_stat	p_value	q_value	significant
AT1G01010	AT1G01010	ANAC001	1:3630-5899	WT	hy5	NOTEST	0	1.68901	1.79769e+308	1.79769e+308	0.0786496	1	no
AT1G01020	AT1G01020	ARV1	1:5927-8737	WT	hy5	OK	7.00799	8.85126	0.336881	-0.405048	0.685442	0.816558	no
AT1G01030	AT1G01030	NGA3	1:11648-13714	WT	hy5	NOTEST	1.44458	1.10952	-0.380717	0.22392	0.822819	1	no
AT1G01040	AT1G01040	DCL1	1:23145-33153	WT	hy5	NOTEST	0.312212	3.04201	3.28442	-2.31653	0.0205292	1	no
AT1G01046	AT1G01046	MIR838A	1:23145-33153	WT	hy5	NOTEST	0	0	0	0	1	1	no
AT1G01050	AT1G01050	AtPPa1	1:23145-33153	WT	hy5	OK	12.4652	24.0248	0.946618	-1.48078	0.138664	0.34761	no
AT1G01060	AT1G01060	LHY	1:33378-37871	WT	hy5	OK	5.2891	8.01668	0.599983	-0.961352	0.336375	0.55581	no
AT1G01070	AT1G01070	AT1G01070	1:38751-40944	WT	hy5	NOTEST	1.99104	3.22537	0.695944	-0.355638	0.722111	1	no
AT1G01073	AT1G01073	AT1G01073	1:44676-44787	WT	hy5	NOTEST	0	0	0	0	1	1	no
AT1G01080	AT1G01080	AT1G01080	1:45295-47019	WT	hy5	OK	15.4469	20.5669	0.413009	-0.750314	0.453065	0.647742	no
AT1G01090	AT1G01090	PDH-E1 ALPHA	1:47484-49286	WT	hy5	OK	71.0777	41.2999	-0.783261	2.9202	0.00349811	0.0284027	yes

Section 5: Examine the gene expression data

Below is an example of some differentially regulated genes from the output file genes.sorted_by_expression.sig.txt, sorted by total expression levels (hy5 + WT transcripts). Note the preponderance of down-regulated genes in hy5.

Below is another example of isoform (transcript)-level expression, this time sorted by fold-change. The source file is transcripts.sorted_by_fold.sig.txt.

b) Click on the WT_hy5_volcano_plot.png under graphs in the cuffdiff results. Note that there are a lot more significant values on left side of the graph (down-regulated genes).

RNA-Seq in Atmosphere

Section 1: Connect to an instance of an Atmosphere Image (virtual server)

1. Organize into groups of 2-3 people. Go to https://atmo.iplantcollaborative.org and log in with IPLANT TEST USER CREDENTIALS.

2. There should already be a VM named RNA-Seq Visualization launched and ready for you to connect to. The screen should resemble this one:

5. Launch the standalone VNC Viewer application on your desktop computer and enter YOUR VM's IP address following by :1 (The IP address is provided in the "My Instances" listing at the left, just underneath the name of the instance you are launching, e.g. 128.196.142.97 in this example.)

Once you have connected, you will see your virtual Desktop, a Linux server running in the Atmosphere cloud computing platform. You will interact with it the same way as you would a physical computer.

Section 2: Visualize TopHat outputs in the Integrated Genome Viewer (IGV).

TopHat output data from an RNA-Seq genome alignment using the full SRA accession SRR70570 from the above example data set has been pre-staged in the TopHat folder on your virtual desktop. The folder contains the alignment data in a bam file (SRR70570_WT1.bam) and split-read (intron/exon) junction data in the file junctions.bed.

1. Open IGV by double-clicking on the IGV icon on your dekstop. It will take a moment to load. Do not click the icon again if nothing appears to happen for a few seconds. This implementation of IGV is configured to load the Arabidopsis thaliana genome (TAIR10) by default.

2. Find the File->Load From File menu.

3. Navigate to tophat_out folder on your desktop and select the BAM and BED files. IGV will load the index file automatically, you do not need to select it also. Click OK. This will your TopHat BAM alignment files into IGV.

4. Maximize the IGV window. You should be looking at a whole genome summary view. You can focus on a chromosome by clicking on the chromosome name and zoom in to particular regions by drag-selecting on the ruler or using the navigation tool in the top-right corner of the window.

5. To focus in on a gene, enter Chr1:25048000-25049500. This will zoom to the highly expressed gene RBCS1A (ribulose bisphosphate carboxylase small chain 1A). Note the read coverage and the splicing and other data produced by TopHat. The alignments and splicing data were used by Cufflinks to assemble the transcripts.

6. Zoom to other areas of the genome to find other examples of expressed genes.

Section 3: Get started with R and cummeRbund

Notes on the command line interface

We will be working on the command line for parts of this tutorial

  • Attention to detail is critical. The most common errors we encounter are:
    • typos
    • case sensitivity
    • inappropriate spaces, for example, never put a space between the '-' and the following command line flag. For example

      -i

      is not the same as

      -  i
    • unbalanced quotes and brackets. For example, If you open a parenthesis with (, always make sure to close with ). While typing, if you always balance your quotes and brackets first, then insert the contents, you can avoid these common errors.
    • Never hit the Enter key until you are finished writing the whole command

Cuffdiff output data generated as described in the discovery environment are pre-staged in the cuffdiff_out folder on your desktop.

1. Open a terminal window by double clicking on the Terminal icon on your desktop.
2. Navigate to your Cuffdiff output data

cd Desktop/cuffdiff_out

3. View the contents of the folder with the ls command

$ ls
cds.count_tracking         isoforms.count_tracking
cds.diff                   isoforms.fpkm_tracking
cds.fpkm_tracking          isoforms.read_group_tracking
cds.read_group_tracking    promoters.diff
cds_exp.diff               read_groups.info
cuffData.db                run.info
gene_exp.diff              splicing.diff
genes.count_tracking       tss_group_exp.diff
genes.fpkm_tracking        tss_groups.count_tracking
genes.read_group_tracking  tss_groups.fpkm_tracking
isoform_exp.diff           tss_groups.read_group_tracking

4. Launch R

$ R

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

This is the R interactive shell, which we will use to enter commands specific to R.

5. Load the cummeRbund package

> library(cummeRbund)

6. Check the version

> citation("cummeRbund")

To cite package 'cummeRbund' in publications use:

  L. Goff, C. Trapnell and D. Kelley (2012). cummeRbund: Analysis,
  exploration, manipulation, and visualization of Cufflinks
  high-throughput sequencing data.. R package version 2.0.0.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {cummeRbund: Analysis, exploration, manipulation, and visualization of
Cufflinks high-throughput sequencing data.},
    author = {L. Goff and C. Trapnell and D. Kelley},
    year = {2012},
    note = {R package version 2.0.0},
  }

ATTENTION: This citation information has been auto-generated from the
package DESCRIPTION file and may need manual editing, see
'help("citation")' .

7. Load the cuffdiff data into a database named cuff

> cuff <- readCufflinks(rebuild = TRUE)

8. Have a look at the database summary:

> cuff
CuffSet instance with:
         2 samples
         33714 genes
         43481 isoforms
         35113 TSS
         32924 CDS
         33621 promoters
         35113 splicing
         27350 relCDS

Section 4: Plot data

  1. The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data. Run an SCV plot.

    > fpkmSCVPlot(genes(cuff))

  2. Draw a density plot of genes in the two samples.

    > dens <- csDensity(genes(cuff))

    This will load the results of the analysis into an object named dens. You can now plot the graph.

    > dens

  3. You can skip directly to the graph like so:

    >csDensity(genes(cuff))
  4. Draw a scatter plot. This command also requires X and Y arguments, 'WT' and 'hy5', respectively.

    > csScatter(genes(cuff), 'WT', 'hy5',smooth=T)

  5. Draw a Volcano matrix plot.

    > csVolcanoMatrix(isoforms(cuff), 'WT', 'hy5')

  6. Save a Volcano matrix plot to a PNG file

    > png('../volcano_plot.png')
    > csVolcanoMatrix(isoforms(cuff), 'WT', 'hy5')
    > dev.off()

Section 5: Explore the data at the gene level

  1. how many significantly up or down regulated genes are there at various thresholds?

    > sig <- getSig(cuff, alpha=0.05, level='genes')
    > length(sig)
    [1] 5701
    > sig <- getSig(cuff, alpha=0.01, level='genes')
    > length(sig)
    [1] 4189
    > sig <- getSig(cuff, alpha=10E-8, level='genes')
    > length(sig)
    [1] 1235
    > sig <- getSig(cuff, alpha=0, level='genes')
    > length(sig)
    [1] 466
    
  2. Have a look at the last 100 genes with a q-value (corrected p-value) of 0. Do you notice anything about the last few genes names?

    > tail(sig, 100)
      [1] "AT4G38250"   "AT4G38680"   "AT4G38770"   "AT4G38920"   "AT4G39200"
      [6] "AT4G39363"   "AT4G40040"   "AT5G01530"   "AT5G02380"   "AT5G02500"
     [11] "AT5G02960"   "AT5G03240"   "AT5G03545"   "AT5G03850"   "AT5G04140"
     [16] "AT5G04290"   "AT5G05370"   "AT5G06640"   "AT5G07322"   "AT5G08450"
     [21] "AT5G10450"   "AT5G10960"   "AT5G11740"   "AT5G12140"   "AT5G13930"
     [26] "AT5G14040"   "AT5G14120"   "AT5G14320"   "AT5G16250"   "AT5G17820"
     [31] "AT5G17870"   "AT5G18380"   "AT5G19120"   "AT5G19220"   "AT5G19510"
     [36] "AT5G20290"   "AT5G20630"   "AT5G20790"   "AT5G21020"   "AT5G21940"
     [41] "AT5G22580"   "AT5G22640"   "AT5G23820"   "AT5G24165"   "AT5G24490"
     [46] "AT5G24930"   "AT5G25630"   "AT5G26570"   "AT5G27330"   "AT5G27660"
     [51] "AT5G27700"   "AT5G27770"   "AT5G27850"   "AT5G28540"   "AT5G29890"
     [56] "AT5G35190"   "AT5G40450"   "AT5G41471"   "AT5G41700"   "AT5G41790"
     [61] "AT5G42110"   "AT5G42530"   "AT5G42650"   "AT5G44020"   "AT5G44190"
     [66] "AT5G44580"   "AT5G45350"   "AT5G45510"   "AT5G47020"   "AT5G47110"
     [71] "AT5G47560"   "AT5G47700"   "AT5G47890"   "AT5G49440"   "AT5G50920"
     [76] "AT5G52310"   "AT5G52650"   "AT5G53560"   "AT5G54270"   "AT5G54770"
     [81] "AT5G55660"   "AT5G57290"   "AT5G59030"   "AT5G59690"   "AT5G59850"
     [86] "AT5G60390"   "AT5G60850"   "AT5G63530"   "AT5G64040"   "AT5G64840"
     [91] "AT5G66040"   "AT5G66570"   "AT5G67030"   "AT5G67070"   "AT5G67360"
     [96] "XLOC_008325" "XLOC_011655" "XLOC_013774" "XLOC_033160" "XLOC_033162"
    
    > 
  3. Get the gene data from the cuff database.

    > sigGenes <- getGenes(cuff,sig)
    Getting gene information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting isoforms information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting CDS information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting TSS information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting promoter information:
            distData
    Getting splicing information:
            distData
    Getting relCDS information:
            distData
    >
    
  4. Examine the gene set.

    > sigGenes
    CuffGeneSet instance for  468  genes
    
    Slots:
             annotation
             fpkm
             repFpkm
             diff
             count
             isoforms        CuffFeatureSet instance of size 664
             TSS             CuffFeatureSet instance of size 470
             CDS             CuffFeatureSet instance of size 574
             promoters               CuffFeatureSet instance of size 468
             splicing                CuffFeatureSet instance of size 470
             relCDS          CuffFeatureSet instance of size 468
    >  

    Note that there are more isoforms (transcripts) than genes in the set.

Section 6: More Visual Analytics

  1. Let's see what the scatter plot looks with this subset of genes

    > csScatter(sigGenes, 'WT', 'hy5')

  2. We can also do a heat map plot of these genes. First, let's take a smaller subset.

    > sigGenes <- getGenes(cuff,tail(sig,100))
    

    .

  3. Clustered the genes by similar expression profiles.

    > csHeatmap(sigGenes, cluster='both')

  4. This graph is too dense, we can stretch it vertically using a PNG device.

    > png(filename = 'thin_heatmap.png', width = 400, height = 1000, units = 'px')
    > csHeatmap(sigGenes, cluster='both')
    > dev.off()

  5. Finally, consider gene FP3, we can find other genes in the database with similar expression patterns:

    > sim = findSimilar(cuff,'FP3',n=20)
    > sim
    CuffGeneSet instance for  20  genes
    
    Slots:
             annotation
             fpkm
             repFpkm
             diff
             count
             isoforms        CuffFeatureSet instance of size 30
             TSS             CuffFeatureSet instance of size 23
             CDS             CuffFeatureSet instance of size 26
             promoters               CuffFeatureSet instance of size 20
             splicing                CuffFeatureSet instance of size 23
             relCDS          CuffFeatureSet instance of size 20
    
  6. Now, plot the expression pattern(s)

    > expressionPlot(sim,logMode=T,showErrorbars=F)