KOBAS 2.0-09052014 (Atmosphere Images Tutorial)


This Atmosphere image is a build of KOBAS 2.0, which is an update of KOBAS (KEGG Orthology Based Annotation System). Its purpose is to identify statistically enriched pathways, diseases, and GO terms for a set of genes or proteins, using pathway, disease, and GO knowledge from multiple famous databases.

KOBAS 2.0 integrates information from the Gene Ontology (GO), KEGG Pathways, Pathway Interaction Database (PID, including BioCarta pathways), Reactome, BioCyc, and PANTHER pathways and disease databases (including OMIM, KEGG DISEASE, FunDO, GAD, and NHGRI GWAS Catalog disease). KOBAS 2.0 can annotate queries to either KEGG genes or KEGG Orthology (KO) terms.

Learn about allocations

Learn about CyVerse's allocation policies here.

How KOBAS 2.0 works

KOBAS 2.0 works in two stages:


Annotate uses as input either a set of genes or proteins and maps to genes with known pathways in the KEGG PATHWAY database

  • Supports ID mapping or sequence similarity mapping
  • ID mapping: Input IDs are mapped directly to genes using the cross-links parsed from KEGG GENES and then, if necessary, IDs are mapped to KO terms
  • Sequence similarity mapping: Each input sequence is BLAST searched against all sequences in KEGG GENES. An input sequence is assigned KO term(s) of the first BLAST hit that (i) has known KO assignments; (ii) has BLAST E-value <10?5; and (iii) has less than five other hits with a lower E-value that do not have KO assignments
  • Users can map against genes in user-specified species instead of all genes


Identify identifies statistically significantly enriched pathways for the annotated genes from Annotate

  • considers only pathways and diseases for which there are at least two genes mapped in the input
  • statistical tests for enrichment: binomial test, chi-square test, Fisher's exact test and hypergeometric test
  • performs FDR correction to reduce the Type-1 errors (false positives) using : QVALUE, Benjamini-Hochberg or Benjamini-Yekutieli

For more information about KOBAS 2.0 see: KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases http://nar.oxfordjournals.org/content/39/suppl_2/W316.full.

For more information about KEGG Orthology see: Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary http://bioinformatics.oxfordjournals.org/content/21/19/3787.long.

Using the KOBAS 2.0 Image

The following information can also be located at /opt/kobas2.0-20140801/docs, once you launch the image.

  • The KEGG Orthology and Human datasets will be automatically loaded upon launching this image.
  • There is an example datasets located in the opt/kobas2.0-20131201/docs directory, which you can copy to your home directory to run the following example:

To run the example dataset

  • Move KOBAS_seq_example.txt to your home directory (cd to /opt/kobas2.0-20140801/docs and cp KOBAS_seq_example.txt to /home/your username)
  • Use the following command: annotate.py -i KOBAS_seq_example.txt -t fasta:pro -s hsa -o (name your output file) -e 1e-8
  • For the identify stage, use the following command: identify.py -f (your output file) -b hsa -d K/k -o (name your output file)
    • for reference: K= KEGG pathway, k = KEGG disease

Example run will compare protein sequence data to the human genome for annotations. It will then provide information as found in the KEGG Pathway and KEGG DISEASE databases.

Information for future runs

  • Make sure your test data is in the same path as the scripts for annotate.py and identify.py.
    • For your convenience, a symbolic link has been placed in your home directory for the scripts. You can then either mount your volume to this location or scp a local copy of your data to your instance to run KOBAS.
    • To view a full listing of command line parameters available for KOBAS, including instructions for running identify, please go to the tutorial.txt file in the /opt/kobas2.0-20131201/docs directory.
  • Current databases (preloaded) in this image and available for comparison are the following:
    • Homo sapien (human) - code hsa
    • KEGG Orthology - code ko

IMPORTANT NOTE: KOBAS will NOT work if you have extraneous information in your header for FASTA file.

  • Please remove any extra details and submit only with the gene ID.
    • Example line that will fail:
      >gi|351720993|ref|NP_001235915.1| transcriptional factor NAC51 Glycine max
    • Example line that will not fail:

Import of additional databases

  • To import another database for comparing your data, please run the select_org command. (at command prompt, enter select_org)
  • This command will allow you to enter a 3 or 4 character abbreviation for organisms you can compare your list to, or will allow you to view the list of all organisms available via the iPlant data store. This will also import your selected datasets for comparison, and will build the appropriate databases for KOBAS annotate and identify to work.
Currently available databases and abbreviations

aag = Aedes aegypti (yellow fever mosquito)
acs = Anolis carolinensis (green anole)
ame = Apis mellifera (honey bee)
ath = Arabidopsis thaliana (thale cress)
bta = Bos taurus (cow)
cin = Ciona intestinalis (sea squirt)
csv = Cucumis sativus (cucumber)
dme = Drosophila melanogaster (fruit fly)
dosa = Oryza sativa japonica (Japanese rice) (RAPDB)
dre = Danio rerio (zebrafish)
ecb = Equus caballus (horse)
gga = Gallus gallus (chicken)
gmx = Glycine max (soybean)
hsa = Homo sapiens (human)
mdo = Monodelphis domestica (opossum)
mmu = Mus musculus (mouse)
ola = Oryzias latipes (Japanese medaka)
sce = Saccharomyces cerevisiae (budding yeast)
tca = Tribolium castaneum (red flour beetle)
zma = Zea mays (maize)

ama = Anaplasma marginale St. Maries
baa = Brucella abortus A13334
bbo = Babesia bovis
bmb = Brucella abortus 9-941
bms = Brucella suis 1330
bov = Brucella ovis
cjr = Campylobacter jejuni RM1221
ece = Escherichia coli O157:H7 EDL933 (EHEC)
eco = Escherichia coli K-12 MG1655
eic = Edwardsiella ictaluri
hie = Haemophilus influenzae R2846 (nontypeable)
hpr = Haemophilus parainfluenzae
hps = Helicobacter pylori Shi470
hsm = Haemophilus somnus 2336
lma = Leishmania major
lmc = Listeria monocytogenes Clip81459
mao = Mycobacterium avium subsp. paratuberculosis MAP4
mbh = Mycoplasma bovis Hubei-1
mgf = Mycoplasma gallisepticum F
mhae = Mannheimia haemolytica D153
pap = Pseudomonas aeruginosa PA7
pul = Pasteurella multocida subsp. multocida 3480
req = Rhodococcus equi
sab = Staphylococcus aureus RF122
tbr = Trypanosoma brucei
tcr = Trypanosoma cruzi
tsp = Trichinella spiralis
vce = Vibrio cholerae O1 2010EL-1786
wbm = Wolbachia wBm

Unable to render {include} The included page could not be found.