RNA-Seq with CummeRbund
This is not the official tutorial for these materials and will not be updated. Please see the CyVerse Learning Center for the latest tutorials and materials.
Section 3: Get started with R and cummeRbund
Notes on the command line interface
We will be working on the command line for parts of this tutorial
- Attention to detail is critical. The most common errors we encounter are:
- typos
- case sensitivity
inappropriate spaces, for example, never put a space between the '-' and the following command line flag. For example
-i
is not the same as
- i
- unbalanced quotes and brackets. For example, If you open a parenthesis with (, always make sure to close with ). While typing, if you always balance your quotes and brackets first, then insert the contents, you can avoid these common errors.
- Never hit the Enter key until you are finished writing the whole command
Cuffdiff output data generated as described in the discovery environment are pre-staged in the cuffdiff_out folder on your desktop.
1. Open a terminal window by double clicking on the Terminal icon on your desktop.
2. Navigate to your Cuffdiff output data
cd Desktop/cuffdiff_out
3. View the contents of the folder with the ls command
$ ls cds.count_tracking isoforms.count_tracking cds.diff isoforms.fpkm_tracking cds.fpkm_tracking isoforms.read_group_tracking cds.read_group_tracking promoters.diff cds_exp.diff read_groups.info cuffData.db run.info gene_exp.diff splicing.diff genes.count_tracking tss_group_exp.diff genes.fpkm_tracking tss_groups.count_tracking genes.read_group_tracking tss_groups.fpkm_tracking isoform_exp.diff tss_groups.read_group_tracking
4. Launch R
$ R R version 2.15.3 (2013-03-01) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
This is the R interactive shell, which we will use to enter commands specific to R.
5. Load the cummeRbund package
> library(cummeRbund)
6. Check the version
> citation("cummeRbund") To cite package 'cummeRbund' in publications use: L. Goff, C. Trapnell and D. Kelley (2012). cummeRbund: Analysis, exploration, manipulation, and visualization of Cufflinks high-throughput sequencing data.. R package version 2.0.0. A BibTeX entry for LaTeX users is @Manual{, title = {cummeRbund: Analysis, exploration, manipulation, and visualization of Cufflinks high-throughput sequencing data.}, author = {L. Goff and C. Trapnell and D. Kelley}, year = {2012}, note = {R package version 2.0.0}, } ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")' .
7. Load the cuffdiff data into a database named cuff
> cuff <- readCufflinks(rebuild = TRUE)
8. Have a look at the database summary:
> cuff CuffSet instance with: 2 samples 33714 genes 43481 isoforms 35113 TSS 32924 CDS 33621 promoters 35113 splicing 27350 relCDS
Section 4: Plot data
The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data. Run an SCV plot.
> fpkmSCVPlot(genes(cuff))
Draw a density plot of genes in the two samples.
> dens <- csDensity(genes(cuff))
This will load the results of the analysis into an object named dens. You can now plot the graph.
> dens
You can skip directly to the graph like so:
>csDensity(genes(cuff))
Draw a scatter plot. This command also requires X and Y arguments, 'WT' and 'hy5', respectively.
> csScatter(genes(cuff), 'WT', 'hy5',smooth=T)
Draw a Volcano matrix plot.
> csVolcanoMatrix(isoforms(cuff), 'WT', 'hy5')
Save a Volcano matrix plot to a PNG file
> png('../volcano_plot.png') > csVolcanoMatrix(isoforms(cuff), 'WT', 'hy5') > dev.off()
Section 5: Explore the data at the gene level
how many significantly up or down regulated genes are there at various thresholds?
> sig <- getSig(cuff, alpha=0.05, level='genes') > length(sig) [1] 5701 > sig <- getSig(cuff, alpha=0.01, level='genes') > length(sig) [1] 4189 > sig <- getSig(cuff, alpha=10E-8, level='genes') > length(sig) [1] 1235 > sig <- getSig(cuff, alpha=0, level='genes') > length(sig) [1] 466
Have a look at the last 100 genes with a q-value (corrected p-value) of 0. Do you notice anything about the last few genes names?
> tail(sig, 100) [1] "AT4G38250" "AT4G38680" "AT4G38770" "AT4G38920" "AT4G39200" [6] "AT4G39363" "AT4G40040" "AT5G01530" "AT5G02380" "AT5G02500" [11] "AT5G02960" "AT5G03240" "AT5G03545" "AT5G03850" "AT5G04140" [16] "AT5G04290" "AT5G05370" "AT5G06640" "AT5G07322" "AT5G08450" [21] "AT5G10450" "AT5G10960" "AT5G11740" "AT5G12140" "AT5G13930" [26] "AT5G14040" "AT5G14120" "AT5G14320" "AT5G16250" "AT5G17820" [31] "AT5G17870" "AT5G18380" "AT5G19120" "AT5G19220" "AT5G19510" [36] "AT5G20290" "AT5G20630" "AT5G20790" "AT5G21020" "AT5G21940" [41] "AT5G22580" "AT5G22640" "AT5G23820" "AT5G24165" "AT5G24490" [46] "AT5G24930" "AT5G25630" "AT5G26570" "AT5G27330" "AT5G27660" [51] "AT5G27700" "AT5G27770" "AT5G27850" "AT5G28540" "AT5G29890" [56] "AT5G35190" "AT5G40450" "AT5G41471" "AT5G41700" "AT5G41790" [61] "AT5G42110" "AT5G42530" "AT5G42650" "AT5G44020" "AT5G44190" [66] "AT5G44580" "AT5G45350" "AT5G45510" "AT5G47020" "AT5G47110" [71] "AT5G47560" "AT5G47700" "AT5G47890" "AT5G49440" "AT5G50920" [76] "AT5G52310" "AT5G52650" "AT5G53560" "AT5G54270" "AT5G54770" [81] "AT5G55660" "AT5G57290" "AT5G59030" "AT5G59690" "AT5G59850" [86] "AT5G60390" "AT5G60850" "AT5G63530" "AT5G64040" "AT5G64840" [91] "AT5G66040" "AT5G66570" "AT5G67030" "AT5G67070" "AT5G67360" [96] "XLOC_008325" "XLOC_011655" "XLOC_013774" "XLOC_033160" "XLOC_033162" >
Get the gene data from the cuff database.
> sigGenes <- getGenes(cuff,sig) Getting gene information: FPKM Differential Expression Data Annotation Data Replicate FPKMs Counts Getting isoforms information: FPKM Differential Expression Data Annotation Data Replicate FPKMs Counts Getting CDS information: FPKM Differential Expression Data Annotation Data Replicate FPKMs Counts Getting TSS information: FPKM Differential Expression Data Annotation Data Replicate FPKMs Counts Getting promoter information: distData Getting splicing information: distData Getting relCDS information: distData >
Examine the gene set.
> sigGenes CuffGeneSet instance for 468 genes Slots: annotation fpkm repFpkm diff count isoforms CuffFeatureSet instance of size 664 TSS CuffFeatureSet instance of size 470 CDS CuffFeatureSet instance of size 574 promoters CuffFeatureSet instance of size 468 splicing CuffFeatureSet instance of size 470 relCDS CuffFeatureSet instance of size 468 >
Note that there are more isoforms (transcripts) than genes in the set.
Section 6: More Visual Analytics
Let's see what the scatter plot looks with this subset of genes
> csScatter(sigGenes, 'WT', 'hy5')
We can also do a heat map plot of these genes. First, let's take a smaller subset.
> sigGenes <- getGenes(cuff,tail(sig,100))
.
Clustered the genes by similar expression profiles.
> csHeatmap(sigGenes, cluster='both')
This graph is too dense, we can stretch it vertically using a PNG device.
> png(filename = 'thin_heatmap.png', width = 400, height = 1000, units = 'px') > csHeatmap(sigGenes, cluster='both') > dev.off()
Finally, consider gene FP3, we can find other genes in the database with similar expression patterns:
> sim = findSimilar(cuff,'FP3',n=20) > sim CuffGeneSet instance for 20 genes Slots: annotation fpkm repFpkm diff count isoforms CuffFeatureSet instance of size 30 TSS CuffFeatureSet instance of size 23 CDS CuffFeatureSet instance of size 26 promoters CuffFeatureSet instance of size 20 splicing CuffFeatureSet instance of size 23 relCDS CuffFeatureSet instance of size 20
Now, plot the expression pattern(s)
> expressionPlot(sim,logMode=T,showErrorbars=F)