edgeR (multifactorial pairwise comparisons) in DE
For an introduction to using the DE, see Using the Discovery Environment.
Rationale and background
Currently, the edgeR app in the DE does not allow multifactorial pairwise comparison of RNA-Seq data for differential gene expression analysis. To provide this functionality, the edgeR (multifactorial pairwise comparisons) app has been added. Based on SARTools (R package dedicated to the differential analysis of RNA-Seq data), edgeR (multifactorial pairwise comparisons) allows multifactorial pairwise comparison of RNA-Seq data for differential gene expression analysis. It provides tools to generate descriptive and diagnostic graphs, to run the differential analysis with the edgeR package, and export the results into easily readable tab-delimited files. It also facilitates the generation of an HTML report that displays all the figures produced, explains the statistical methods, and gives the results of the differential analyses.
The SARTools R package has been developed at PF2 - Institut Pasteur by M.-A. Dillies and H. Varet (hugo.varet@pasteur.fr). Please cite H. Varet, L. Brillet-Guéguen, J.-Y. Coppee and M.-A. Dillies, SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data, PLoS One, 2016, doi: http://dx.doi.org/10.1371/journal.pone.0157022 when using this tool for any analysis published."
Introduction and overview
While there are many tools that determine the differential expression for microarray data (such as limma), these tools assume a continuous expression response (in fluorescence intensity), whereas RNA-Seq, ChIP-Seq, SAGE, DGE, and proteomics data generate expression counts. The edgeR (empirical analysis of DGE in R) app compares expression counts from different experimental data sets and uses Fisher's Exact Test to identify differentially expressed gene products. The edgeR approach uses a negative binomial distribution and simplifies the estimation of over-dispersion by assuming that mean and variance are related, allowing applications to experiment with small numbers of replicates. At least one of the experimental conditions must have replicated. edgeR assumes a negative binomial distribution (which simplifies to a Poisson distribution when there is no variation) and uses Bayes' inference to correct for variation prior to using Fisher's Exact Test to identify differential expression.
Robinson, Mark D., Davis J. McCarthy, Gordon K. Smyth. Bioinformatics. 2010 Jan 1;26(1):139-40. "edgeR
: a Bioconductor package for differential expression analysis of digital gene expression data." http://bioinformatics.oxfordjournals.org/content/26/1/139.long
Prerequisites
- A CyVerse account. (Register for a CyVerse account at https://user.cyverse.org/.)
- An up-to-date Java-enabled web browser. (Firefox recommended. If you wish to work with your own large datasets and upload them using iCommands, Chrome is not suitable due to its issues in utilizing 64-bit Java.)
- Input files:
Target file: The user has to supply a tab-delimited file which describes the experiment; i.e. one that contains the name of the biological condition associated with each sample. This file is called "target". This file has one row per sample and is composed of at least three columns with headers:
- first column: unique names of the samples (short but informative as they will be displayed on all the figures); (Ex: "label")
- second column: name of the count files; (Ex: "file")
- third column: biological conditions; (Ex: "group)
- optional columns: further information about the samples (day of library preparation for example). (Ex: "cond")
Table 1: Target file for the test dataset (see below):
label file group cond 5_OP_1 count3.txt OP 5 5_OP_2 count3.txt OP 5 5_OP_3 count3.txt OP 5 33_OP_1 count3.txt OP 33 33_OP_2 count3.txt OP 33 33_OP_3 count3.txt OP 33 5_M_1 count3.txt M 5 5_M_2 count3.txt M 5 5_M_3 count3.txt M 5 33_M_1 count3.txt M 33 33_M_2 count3.txt M 33 33_M_3 count3.txt M 33 5_LL_1 count3.txt LL 5 5_LL_2 count3.txt LL 5 5_LL_3 count3.txt LL 5 33_LL_1 count3.txt LL 33 33_LL_2 count3.txt LL 33 33_LL_3 count3.txt LL 33 - Raw counts file or Raw counts folder: The edgeR statistical analysis assumes that reads have already been mapped and that counts per feature (gene or transcript) are available. There are two different ways to provide the option to the app.
- A raw counts file that contains all the samples, each column corresponds to a sample with gene/transcript the same and a column first column which consists of the unique IDs of the features. (See an example of this type in Table 2). You can use Htseq-Count-Merge-0.6.1 to generate that kind of file.
- A directory consisting of one count file per sample with two tabs delimited columns without the header. The first column is the unique IDs of the features and the second column has raw counts associated with these features (null or positive integers) (See an example of this type in Table 3). You can use HTSeq-count-0.6.1 to generate this type of directory.
Table 2: Raw counts file for the test dataset (see below):
Contig 5_OP_1 5_OP_2 5_OP_3 33_OP_1 33_OP_2 33_OP_3 5_M_1 5_M_2 5_M_3 33_M_1 33_M_2 33_M_3 5_LL_1 5_LL_2 5_LL_3 33_LL_1 33_LL_2 33_LL_3 oystercontig_1 8 54 10 17 3 1 19 47 42 44 6 2 229 47 4 301 231 11 oystercontig_2 16 4 16 56 2 1 2 3 0 28 0 0 2 19 1 8 0 5 oystercontig_3 2 8 3 13 2 2 1 24 20 41 4 8 23 12 1 70 4 13 oystercontig_4 7 2 24 139 2 2 3 1 2 10 0 0 1 1 0 0 0 3 oystercontig_5 0 2 1 1 0 0 0 0 1 0 0 0 1 0 0 0 3 0 oystercontig_6 0 0 0 3 0 0 7 0 0 2 0 0 1 0 0 2 0 0 oystercontig_7 127 30 9 46 13 7 153 111 60 60 2 13 245 205 0 123 74 6 oystercontig_8 154 386 57 561 91 123 566 693 503 851 47 129 634 928 17 375 788 126 oystercontig_9 1 1 0 0 0 0 20 3 4 1 0 0 33 11 0 0 12 1 Table 3: An example of a counts file within the folder that has two tabs delimited columns without header (see below):
oystercontig_1 301 oystercontig_2 8 oystercontig_3 70 oystercontig_4 0 oystercontig_5 0 oystercontig_6 2 oystercontig_7 123 oystercontig_8 375 oystercontig_9 0 Note
The user should provide the same number of read files inside a directory corresponding to the number of rows in the large file. If the counts and the target files are not supplied in the required formats, the app will not work and you will not be able to run the analysis.
4. Parameters
Project name
: name of the project (must be supplied by the user);Author Name
: author of the analysis (must be supplied by the user);Reference biological condition
: reference biological condition used to compute fold-changes (no default, must be one of the levels target file);Counts-per-million cut-off
: counts-per-million cut-off to filter low counts (Mandatory. default is 1, set to 0 to disable filtering);Replicates: number of replicates (default is 2)
batch
: adjustment variable to use as a batch effect, must be a column of the target file ("day"
for example, orNULL
if no batch effect needs to be taken into account);Variable of Interest
: variable of interest, i.e. biological condition, in the target file (Mandatory."group"
by default);FeaturesToRemove
: character vector containing the IDs of the features to remove before running the analysis (default is"alignment_not_unique"). Other available features are
"ambiguous"
,"no_feature"
,"not_aligned"
,"too_low_aQual"
to remove HTSeq-count specific rows);gene.selection
: method of selection of the features for the MultiDimensional Scaling plot ("pairwise"
by default orcommon
);Normalization Method
: normalization method incalcNormFactors()
:"TMM"
(default) or"upperquartile"
;Significance threshold
: significance threshold applied to the adjusted p-values to select the differentially expressed features (default is0.05
);p-value adjustment method
: p-value adjustment method for multiple testing ("BH"
by default,"BY"
or any value ofp.adjust.methods
);colors
: colors used for the figures (one per biological condition)- Test/sample data
This tutorial uses the test data that is stored in the Data Store at Community Data > iplantcollaborative > example_data > edgeR_multi.
Starting an edgeR (multifactorial pairwise comparisons) job in the DE
- In the DE Apps window, search for and open edgeR (multifactorial pairwise comparisons).
- In the Analysis Name field:
- Change the name for your analysis (optional).
- Enter any comments (optional).
- In the Select output folder field, click Browse and navigate to the folder of your choice, or leave the default name of iplant/home/username/analyses.
- To retain copies of the input files in your analysis results output folder, click the Retain Inputs checkbox.
- Click to open the Input files panel:
- For the Target file, click Browse and navigate to test either the file_type test data or the folder_type test data:
- To test the file_type test data:
- For the Target file, browse to select target3.txt inside file_type.
- For the Raw counts file, browse to select counts3.txt.
- To test the folder_type test data:
- For the Target file, browse to select target3.txt inside file_type.
- For the Raw counts folder, browse to select raw1 inside folder_type.
- Click on the Parameters panel and enter the following:
Project name
: test_edgeR_multiAuthor Name
: UpendraReference biological condition
: OPCounts-per-million cut-off
: 1Replicates: 3
batch
: condVariable of Interest
: groupFeaturesToRemove
: alignment_not_unique,ambiguous,no_feature,not_aligned,too_low_aQualgene.selection
: pairwiseNormalization Method
: TMMSignificance threshold
:0.05
p-value adjustment method
: BHcolors
: dodgerblue,orange,green
5. Click Launch Analysis.
6. After successful completion of running of the app, the following files and figures are generated from the test run.
barplotTC.png
: the total number of reads per sample;barplotNull.png
: percentage of null counts per sample;densplot.png
: estimation of the density of the counts for each sample;majSeq.png
: percentage of reads caught by the feature having the highest count in each sample;pairwiseScatter.png
: pairwise scatter plot between each pair of samples and SERE values (not produced if more than 30 samples);countsBoxplot.png
: boxplots on raw and normalized counts;cluster.png
: hierarchical clustering of the samples (CPM data);MDS.png
: Multi-Dimensional Scaling plot of the samples;BCV.png
: graph of the estimations of the tagwise, trended and common dispersions;rawpHist.png
: histogram of the raw p-values for each comparison;MAplot.png
: MA-plot for each comparison (log ratio of the means vs intensity);volcanoPlot.png
: volcano plot for each comparison ($-\log_{10}\text{(adjusted P value)}$ vs log ratio of the means).
Some tab-delimited files are exported in the tables
directory. They store information on the features as $\log_2\text{(FC)}$ or p-values and can be read easily in a spreadsheet:
TestVsRef.complete.txt
: contains all the features studied;TestVsRef.down.txt
: contains only significant down-regulated features, i.e. less expressed in Test than in Ref;TestVsRef.up.txt
: contains only significant up-regulated features i.e. more expressed in Test than in Ref.
All these parameters will be saved and written at the end of the HTML report in order to keep track of what has been done.
For more information of how to interpret these figures, files, troubleshooting, and FAQs, please refer SARTools vignette for the differential analysis of 2 or more conditions with DESeq2 or edgeR.