RNA-Seq with CummeRbund

RNA-Seq with CummeRbund

This is not the official tutorial for these materials and will not be updated. Please see the CyVerse Learning Center for the latest tutorials and materials.

Section 3: Get started with R and cummeRbund

Notes on the command line interface

We will be working on the command line for parts of this tutorial

  • Attention to detail is critical. The most common errors we encounter are:
    • typos
    • case sensitivity
    • inappropriate spaces, for example, never put a space between the '-' and the following command line flag. For example

      -i

      is not the same as

      -  i
    • unbalanced quotes and brackets. For example, If you open a parenthesis with (, always make sure to close with ). While typing, if you always balance your quotes and brackets first, then insert the contents, you can avoid these common errors.
    • Never hit the Enter key until you are finished writing the whole command

Cuffdiff output data generated as described in the discovery environment are pre-staged in the cuffdiff_out folder on your desktop.

1. Open a terminal window by double clicking on the Terminal icon on your desktop.
2. Navigate to your Cuffdiff output data

cd Desktop/cuffdiff_out

3. View the contents of the folder with the ls command

$ ls
cds.count_tracking         isoforms.count_tracking
cds.diff                   isoforms.fpkm_tracking
cds.fpkm_tracking          isoforms.read_group_tracking
cds.read_group_tracking    promoters.diff
cds_exp.diff               read_groups.info
cuffData.db                run.info
gene_exp.diff              splicing.diff
genes.count_tracking       tss_group_exp.diff
genes.fpkm_tracking        tss_groups.count_tracking
genes.read_group_tracking  tss_groups.fpkm_tracking
isoform_exp.diff           tss_groups.read_group_tracking

4. Launch R

$ R

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

This is the R interactive shell, which we will use to enter commands specific to R.

5. Load the cummeRbund package

> library(cummeRbund)

6. Check the version

> citation("cummeRbund")

To cite package 'cummeRbund' in publications use:

  L. Goff, C. Trapnell and D. Kelley (2012). cummeRbund: Analysis,
  exploration, manipulation, and visualization of Cufflinks
  high-throughput sequencing data.. R package version 2.0.0.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {cummeRbund: Analysis, exploration, manipulation, and visualization of
Cufflinks high-throughput sequencing data.},
    author = {L. Goff and C. Trapnell and D. Kelley},
    year = {2012},
    note = {R package version 2.0.0},
  }

ATTENTION: This citation information has been auto-generated from the
package DESCRIPTION file and may need manual editing, see
'help("citation")' .

7. Load the cuffdiff data into a database named cuff

> cuff <- readCufflinks(rebuild = TRUE)

8. Have a look at the database summary:

> cuff
CuffSet instance with:
         2 samples
         33714 genes
         43481 isoforms
         35113 TSS
         32924 CDS
         33621 promoters
         35113 splicing
         27350 relCDS

Section 4: Plot data

  1. The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data. Run an SCV plot.

    > fpkmSCVPlot(genes(cuff))

  2. Draw a density plot of genes in the two samples.

    > dens <- csDensity(genes(cuff))

    This will load the results of the analysis into an object named dens. You can now plot the graph.

    > dens

  3. You can skip directly to the graph like so:

    >csDensity(genes(cuff))
  4. Draw a scatter plot. This command also requires X and Y arguments, 'WT' and 'hy5', respectively.

    > csScatter(genes(cuff), 'WT', 'hy5',smooth=T)

  5. Draw a Volcano matrix plot.

    > csVolcanoMatrix(isoforms(cuff), 'WT', 'hy5')

  6. Save a Volcano matrix plot to a PNG file

    > png('../volcano_plot.png')
    > csVolcanoMatrix(isoforms(cuff), 'WT', 'hy5')
    > dev.off()

Section 5: Explore the data at the gene level

  1. how many significantly up or down regulated genes are there at various thresholds?

    > sig <- getSig(cuff, alpha=0.05, level='genes')
    > length(sig)
    [1] 5701
    > sig <- getSig(cuff, alpha=0.01, level='genes')
    > length(sig)
    [1] 4189
    > sig <- getSig(cuff, alpha=10E-8, level='genes')
    > length(sig)
    [1] 1235
    > sig <- getSig(cuff, alpha=0, level='genes')
    > length(sig)
    [1] 466
    
  2. Have a look at the last 100 genes with a q-value (corrected p-value) of 0. Do you notice anything about the last few genes names?

    > tail(sig, 100)
      [1] "AT4G38250"   "AT4G38680"   "AT4G38770"   "AT4G38920"   "AT4G39200"
      [6] "AT4G39363"   "AT4G40040"   "AT5G01530"   "AT5G02380"   "AT5G02500"
     [11] "AT5G02960"   "AT5G03240"   "AT5G03545"   "AT5G03850"   "AT5G04140"
     [16] "AT5G04290"   "AT5G05370"   "AT5G06640"   "AT5G07322"   "AT5G08450"
     [21] "AT5G10450"   "AT5G10960"   "AT5G11740"   "AT5G12140"   "AT5G13930"
     [26] "AT5G14040"   "AT5G14120"   "AT5G14320"   "AT5G16250"   "AT5G17820"
     [31] "AT5G17870"   "AT5G18380"   "AT5G19120"   "AT5G19220"   "AT5G19510"
     [36] "AT5G20290"   "AT5G20630"   "AT5G20790"   "AT5G21020"   "AT5G21940"
     [41] "AT5G22580"   "AT5G22640"   "AT5G23820"   "AT5G24165"   "AT5G24490"
     [46] "AT5G24930"   "AT5G25630"   "AT5G26570"   "AT5G27330"   "AT5G27660"
     [51] "AT5G27700"   "AT5G27770"   "AT5G27850"   "AT5G28540"   "AT5G29890"
     [56] "AT5G35190"   "AT5G40450"   "AT5G41471"   "AT5G41700"   "AT5G41790"
     [61] "AT5G42110"   "AT5G42530"   "AT5G42650"   "AT5G44020"   "AT5G44190"
     [66] "AT5G44580"   "AT5G45350"   "AT5G45510"   "AT5G47020"   "AT5G47110"
     [71] "AT5G47560"   "AT5G47700"   "AT5G47890"   "AT5G49440"   "AT5G50920"
     [76] "AT5G52310"   "AT5G52650"   "AT5G53560"   "AT5G54270"   "AT5G54770"
     [81] "AT5G55660"   "AT5G57290"   "AT5G59030"   "AT5G59690"   "AT5G59850"
     [86] "AT5G60390"   "AT5G60850"   "AT5G63530"   "AT5G64040"   "AT5G64840"
     [91] "AT5G66040"   "AT5G66570"   "AT5G67030"   "AT5G67070"   "AT5G67360"
     [96] "XLOC_008325" "XLOC_011655" "XLOC_013774" "XLOC_033160" "XLOC_033162"
    
    > 
  3. Get the gene data from the cuff database.

    > sigGenes <- getGenes(cuff,sig)
    Getting gene information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting isoforms information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting CDS information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting TSS information:
            FPKM
            Differential Expression Data
            Annotation Data
            Replicate FPKMs
            Counts
    Getting promoter information:
            distData
    Getting splicing information:
            distData
    Getting relCDS information:
            distData
    >
    
  4. Examine the gene set.

    > sigGenes
    CuffGeneSet instance for  468  genes
    
    Slots:
             annotation
             fpkm
             repFpkm
             diff
             count
             isoforms        CuffFeatureSet instance of size 664
             TSS             CuffFeatureSet instance of size 470
             CDS             CuffFeatureSet instance of size 574
             promoters               CuffFeatureSet instance of size 468
             splicing                CuffFeatureSet instance of size 470
             relCDS          CuffFeatureSet instance of size 468
    >  

    Note that there are more isoforms (transcripts) than genes in the set.

Section 6: More Visual Analytics

  1. Let's see what the scatter plot looks with this subset of genes

    > csScatter(sigGenes, 'WT', 'hy5')

  2. We can also do a heat map plot of these genes. First, let's take a smaller subset.

    > sigGenes <- getGenes(cuff,tail(sig,100))
    

    .

  3. Clustered the genes by similar expression profiles.

    > csHeatmap(sigGenes, cluster='both')

  4. This graph is too dense, we can stretch it vertically using a PNG device.

    > png(filename = 'thin_heatmap.png', width = 400, height = 1000, units = 'px')
    > csHeatmap(sigGenes, cluster='both')
    > dev.off()

  5. Finally, consider gene FP3, we can find other genes in the database with similar expression patterns:

    > sim = findSimilar(cuff,'FP3',n=20)
    > sim
    CuffGeneSet instance for  20  genes
    
    Slots:
             annotation
             fpkm
             repFpkm
             diff
             count
             isoforms        CuffFeatureSet instance of size 30
             TSS             CuffFeatureSet instance of size 23
             CDS             CuffFeatureSet instance of size 26
             promoters               CuffFeatureSet instance of size 20
             splicing                CuffFeatureSet instance of size 23
             relCDS          CuffFeatureSet instance of size 20
    
  6. Now, plot the expression pattern(s)

    > expressionPlot(sim,logMode=T,showErrorbars=F)