edgeR (test)

edgeR: Empirical analysis of digital gene expression data in R

• The package edgeR is freely available under the LGPL license from the Bioconductor web site (http://bioconductor.org).
• To install this package, start R and enter: source("http://www.bioconductor.org/biocLite.R"); biocLite("edgeR")
Dependency should be installed automatically.

• Sample dataset to input and expected results to output

- edgeR requires three pieces of information for input:

counts matrix: row for a genomic feature (gene/exon); column for sample.

group factor: denoting the experimental group of each sample.

"Kidney" "Kidney" "Kidney" "Kidney" "Kidney" "Liver" "Liver" "Liver" "Liver" "Liver"

lib.size vector: giving the total number of reads sequenced for each sample. This can be inferred from the counts and group.

The sample dataset used as example hereafter is the test data of R package DEGseq; it was extracted from the supplementary data of Marioni and et al. (2008).

- The main output is a list: 1) a table of containing the elements logConc, the log-average concentration/abundance for each tag in the two groups being compared, logFC, the log-abundance ratio, i.e. fold change, for each tag in the two groups being compared, p.value, exact p-value for differential expression using the NB model, adj.p.val, the p-value adjusted for multiple testing as found using p.adjust using the method specified;
and 2) a vector giving the names of the two groups being compared.

"Kidney" "Liver"

• Major steps:

*1. Calculating normalization factors*
> d<- DGEList(counts =dat, lib.size = NULL, norm.factors = NULL, group = groups, genes = NULL, remove.zeros = FALSE)
Calculating library sizes from column totals.
> d <- calcNormFactors(d)

Arguments:	DGEList()
counts	numeric matrix containing the read counts.
lib.size	numeric vector containing the total to normalize against for each sample (optional)
norm.factors	numeric vector containing normalization factors (optional, defaults to all 1)
group	vector giving the experimental group/condition for each sample/library
genes	data frame containing annotation information for the tags/transcripts/genes for which we have count data (optional).
remove.zeros	whether to remove rows that have 0 total count; default is FALSE so as to retain all information in the dataset

*2. Producing an MDS plot*
Before proceeding with the computations for differential expression, it is possible to produce a plot showing the sample relations based on multidimensional scaling, as demonstrated for the data above. Outliers can be identified.

*3. Estimating dispersions*
- Method 1: Analysis by estimating common dispersion_

> d <- estimateCommonDisp(d)

- Method 2: Analysis by estimating moderated tagwise dispersions:_
Moderating the tagwise dispersion to estimate the dispersion separately for each tag, while `squeezing' these estimates towards the common dispersion estimate. The goal of this moderation of the dispersion estimates is to improve inference by sharing information between tags.

To run the moderated analysis, we need to determine how much moderation is necessary. Currently we prefer to choose a sensible value for the smoothing parameter a priori.

As for a small sample size, we should not be too confident about the accuracy of the tagwise dispersions. Therefore it is recommended to use a larger value for prior.n, which can be selected a priori, instead of being estimated.

This means that the common likelihood receives the weight of ten individual tags. Therefore, there will be a reasonable degree of `squeezing' towards the common dispersion estimate, but still enough scope to allow flexibility when estimating the individual dispersion for each gene.

> d <- estimateTagwiseDisp(d, prior.n = 10, grid.length = 500)

> d

*4. Estimating the smoothing parameter as an approximate eBayes rule*
> prior.n <- estimateSmoothing(d)
The weight parameter prior.n can be estimated by using an approximate empirical Bayes rule. This estimated value currently is recommended to be used as a guide, but not routinely in analysis.
*5. Testing*
- Method 1: Analysis by estimating common dispersion_
> de.com=exactTest(object, pair=NULL, dispersion=NULL, common.disp=TRUE)
- Method 2: Analysis by estimating moderated tagwise dispersion_
> de.tgw=exactTest(object, pair=NULL, dispersion=NULL, common.disp=FALSE)

Arguments:	exacTest()
object	a DGEList object, output of estimateCommonDisp, on which to compute Fisher-like exact statistics for the pair of groups specified.
pair	vector of length two, either numeric or character, providing the pair of groups to be compared; if a character vector, then should be the names of two groups (e.g. two levels of object$samples$group); if numeric, then groups to be compared are chosen by finding the levels of object$samples$group corresponding to those numeric values and using those levels as the groups to be compared; if NULL, then first two levels of object$samples$group (a factor) are used.
dispersion	optional vector either of length 1 or the same length as the number of tags. If not NULL (default), then the supplied value(s) will be used as the dispersion parameter for calculating p-values for differential expression. If NULL, then either the common or tagwise dispersion estimates from the DGEList object will be used, according to the value of common.disp. If dispersion is zero, then p-values are equivalent to exact Poisson rather than NB p-values.
common.disp	logical, if TRUE, then testing carried out using common dispersion for each tag/gene, if FALSE then tag-wise estimates of the dispersion parameter are used; default TRUE.
Value:	produces an object of class DGEExact containing the following elements
table	a data frame containing the elements logConc, the log-average concentration/abundance for each tag in the two groups being compared, logFC, the log-abundance ratio, i.e. fold change, for each tag in the two groups being compared, p.value, exact p-value for differential expression using the NB model
comparison	a vector giving the names of the two groups being compared
genes	a data frame containing information about each transcript; taken from object and can be NULL

> topTags(object, n=10, adjust.method="BH", sort.by="p.value")

Arguments:	topTags()
object	a DGEExact object (output from exactTest) or a DGELRT object (output from glmLRT), containing the (at least) the elements table: a data frame containing the log-concentration (i.e. expression level), the log-fold change in expression between the two groups/conditions and the p-value for differential expression, for each tag. If it is a DGEExact object, then topTags will also use the comparison element, which is a vector giving the two experimental groups/conditions being compared. The object may contain other elements that are not used by topTags.
n	scalar, number of tags to display/return
adjust.method	character string stating the method used to adjust p-values for multiple testing, passed on to p.adjust; one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, and “none”.
sort.by	character string, indicating whether tags should be sorted by p-value ("p.value") or absolute log-fold change ("logFC"); default is to sort by p-value.
Value:	an object of class TopTags containing the following elements for the top n most differentially expressed tags as determined by sort.by.
table	a data frame containing the elements logConc, the log-average concentration/abundance for each tag in the two groups being compared, logFC, the log-abundance ratio, i.e. fold change, for each tag in the two groups being compared, p.value, exact p-value for differential expression using the NB model, adj.p.val, the p-value adjusted for multiple testing as found using p.adjust using the method specified
comparison	a vector giving the names of the two groups being compared

*6. Visualizing DGE results*
To visualize the result, we plot the log-fold change (i.e. the log of the ratio of expression levels for each tag between two experimential groups) against the log-concentration (i.e. the overall average expression level for each tag across the two groups) using plotSmear function. plotSmear is a more sophisticated and superior way to produce an 'MA plot'. plotSmear resolves the problem of plotting tags that have a total count of zero for one of the groups by adding the 'smear' of points at low A value. The points to be smeared are identified as being equal to the minimum estimated concentration in one of the two groups. The smear is created by using random uniform numbers of width smearWidth to the left of the minimum A. plotSmear also allows easy highlighting of differentially expressed (DE) tags by assign the argument de.tags.
> detags500.com <- rownames(topTags(de.com, n = 500)$table)
> plotSmear(object, pair = NULL, de.tags= detags500.com, xlab = "logConc", ylab ="logFC", pch = 19, cex = 0.2, smearWidth = 0.5, panel.first=grid(), smooth.scatter=FALSE, lowess=FALSE, main = "FC plot using common dispersion", …)
> abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)

Arguments:	plotSmear()
object	DGEList or DGELRT object containing data to produce an MA-plot.
air	pair of experimental conditions to plot (if NULL, the first two conditions are used)
de.tags	rownames for tags identified as being differentially expressed; use exactTest to identify DE genes
xlab	x-label of plot
ylab	y-label of plot
pch	scalar or vector giving the character(s) to be used in the plot; default value of 19 gives a round point.
cex	character expansion factor, numerical value giving the amount by which plotting text and symbols should be magnified relative to the default; default cex=0.2 to make the plotted points smaller
smearWidth	width of the smear
panel.first	an expression to be evaluated after the plot axes are set up but before any plotting takes place; the default grid() draws a background grid to aid interpretation of the plot
smooth.scatter	logical, whether to produce a 'smooth scatter' plot using the KernSmooth::smoothScatter function or just a regular scatter plot; default is FALSE, i.e. produce a regular scatter plot
lowess	logical, indicating whether or not to add a lowess curve to the MA-plot to give an indication of any trend in teh log-fold change with log-concentration
...	further arguments passed on to plot