Ontologizer

Ontologizer

Ontologizer can be used to perform statistical analysis for over-representation of Gene Ontology (GO) terms in sets of genes or proteins derived from an experiment.Ontologizer implements the standard approach to statistical analysis based on the one-sided Fisher’s exact test, the novel parent--child method, as well as topology-based algorithms. A number of multiple-testing correction procedures are provided.

Note that in this version of Ontologizer, additional scripts are used to ensure that the GO annotation (gaf) and structure file (.obo) are up to date. If users choose to select a taxon ID to make their own GO annotation file, the script will query QuickGO and pull all annotations. Since QuickGO works off UniProtKB accessions, please make sure that your data sets use the same accessions or symbols. Users also have the options of uploading their own GO annotation file. For more information on creating your own GO annotation refer to InterProScan & InterProScan_Results_Function wikis.

References:

Bauer S, Grossmann S, Vingron M, Robinson PN. Ontologizer 2.0--a multifunctional tool for GO term enrichment analysis and data exploration. Bioinformatics. 2008 Jul 15;24(14):1650-1. http://bioinformatics.oxfordjournals.org/content/24/14/1650.full

Bauer S, Gagneur J, Robinson PN. GOing Bayesian: model-based gene set analysis of genome-scale data. Nucleic Acids Res. 2010 Jun;38(11):3523-32. http://nar.oxfordjournals.org/content/38/11/3523.long

Quick Start

  • To do a GO enrichment analysis, you need a list of differentially expressed genes, the population (or genome) set to compare them to and the GO annotations associated with the population gene set.
  • Note the the GO annotation files (gene association file or gaf) and the GO OBO file (that defines the structure of the Gene Ontologies) are updated on a regular basis.
  • You may upload your data  (a list of differently expressed genes and the the whole gene population list) as a line separated text file of symbols or accessions.
  • Resources: documentation

Test Data

All files are located in the Community Data directory of the iPlant Discovery Environment at the following path:

Community Data > iplantcollaborative > example_data > ontologizer

Input Files

Like all gene set enrichment analysis tools, Ontologizer requires to input data sets -

(a) the experimental set of differentially expressed genes - referred to in Ontologizer as the study set.

(b) the total set of genes for the same organism - referred to in Ontologizer as the population set.

In the example -

Gene population: arab_uniq_genes_srt.txt

Study set: genesuplist2.txt

Sample Parameters to use in App

When the app is run in the Discovery Environment, use the following parameters with the above input file(s) to get the output provided in the next section below (Parent-Child-Union with Benjamini-Hochberg multiple testing correction). Other methods may be selected. Please refer to Ontologizer manual for instructions on using those methods.

d value: this is used to write out an additional .dot file (GraphViz) containing a GOTerm graph with significant nodes colored. The value should be 0 - 0.5 (0.5 is default). The d value specifies the maximum level on which a term is considered as significantly enriched.

calculation method: specifies the enrichment calculation method to use. Possible values are: MGSA, Term-for_term, None, Parent-Child-Union (default), Topology-Elim, Topology-Weighted

association:   specified the file containing associations from genes to GO terms. In this case the GO is specified but choosing the species of the population.

ignore genes to which no association exists within the calculation: genes with no GO annotation will be discounted.

MTC method: Specifies the Multiple Testing Correction resampling method. Options include Benjamini-Hochberg, Benjamini-Yekutieli, Bonferroni (default), Bonferroni-Holm, Westfall-Young-Down.

create annotation: Create an additional file per study set which contains GO annotations used in this study.

Output File(s)

  • Expect a directory of outputs names outdir. Within that directory, you will find 3 files, one containing the GO annotations for the study set, one a table of genes and the other a .dot file for viewing a graph (will need to be opened with another program).
    • For the example data above, you will receive the following files as output:
      • anno-arab_uniq_genes_srt-Parent-Child-Union-Benjamini-Hochberg.txt (7.68MB)
      • table-arab_unique-genes_srt-Parent-Child-Union-Benjamini-Hochberg.txt (431.85KB)
      • view-arab_uniq_genes_srt-Parent-Child-Union-Benjamini-Hochberg.dot (26 bytes)
    •  

More About Parameters and Options

Gene association file:

Taxon ID: enter taxon number of species being studied to have a GAF calaculated for you from GeneOntology.org files
OR
Location of GAF: upload your own gaf file. For more information on GO annotation file formats see:

http://geneontology.org/page/go-annotation-file-formats

Gene Sets:

Differentially Expressed Genes (Study set): list of all differentially expressed products
Population Set: list of all identified products from the experiment. Alternatively you can use a list of all proteins/genes in the organism as your background.

Both files are sets of gene symbols, line seperated. (Should be accessions)

Additional Options:

Output file Name (Optional): this is used to create the folder where the results are stored

Calculation method: the method used to generate the complete background set, compare the study set to the population set and determine which GO terms are over- or under-represented.

Options are -

    MGSA (Model-based gene set analysis) - Bauer et al. 2010 (see: http://nar.oxfordjournals.org/content/38/11/3523)

    Parent-Child-Union - see Grossman et al. (2007) (see: http://www.ncbi.nlm.nih.gov/pubmed/17848398)
    Parent-Child-Intersection - see Grossman et al. (2007) (See: http://www.ncbi.nlm.nih.gov/pubmed/17848398)
    Term-For-Term (default)- see Ontologizer Overrepresentation Analysis
    Topology-Elim - see Alexa et al. (2006) (see: http://www.ncbi.nlm.nih.gov/pubmed/16606683)
    Topology-Weighted - see Alexa et al. (2006) (See:http://www.ncbi.nlm.nih.gov/pubmed/16606683)

MTC (Multiple Testing Correction) method:

Options are -

    Benjamini-Hochberg (default)
    Benjamini-Yekutieli
    Bonferroni
    Bonferroni-Holm
    Westfall-Young-Single-Step
    Westfall-Young-Step-Down
    None

GraphViz Threshold: specifices the threshols used to identify interesting nodes in the graphical file (.dot) generated
set at 0.5

Check boxes:
ignore genes where no association exists
create annotation file

Tool Source for App