BUSCO in the Discovery Environment

Alert:

 

The iPlant App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01'). In critical cases, please report your concern to the iPlant Ask forum or to support@iplantcollaborative.org. Thank you for your patience.

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to upendra@cyverse.org. Thank you.

Rationale and background:

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs 

Felipe A. Simão, Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Kriventseva, & Evgeny M. Zdobnov Zdobnov’s Computational Evolutionary Genomics Group

Bioinformatics, published online June 9, 2015 (doi: 10.1093/bioinformatics/btv351)

BUSCO (Benchmarking UniversalSingle-Copy Orthologs) is a tool that provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB. BUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines. BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade". The software is freely available to download at (http://busco.ezlab.org/). 


Pre-Requisites

  1. A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
  2. Mandatory arguments 
    1. Output folder name
    2. Input file (Genome assembly /gene set/transcriptome) in fasta format
    3. Lineage data (You can select the BUSCO profile files for your species of interest from here : /iplant/home/shared/iplantcollaborative/example_data/BUSCO.sample.data). For version 2.0, there is a new lineage "plantae".
    4. Mode of analysis (genome, ogs, trans Default: genome)
  3. Optional arguments
    1. Species (Select from the pre-computed Augustus metaparameters Selecting a closely-related species usually produces better results Valid options: see Augustus help for list of options - http://augustus.gobics.de/binaries/README.TXT. Default: generic). In the new version 2.0, there are several new species that users can pick from.
    2. E-value (Use a custom blast e-value cutoff. Default: 0.01) 
    3. Custom flanking genomic regions in base pairs (bp) Used when extending selected candidate regions before gene prediction Default: Automatically calculated flank sizes based on genome size. It ranges from 5 to 20bp
    4. Performs full optimization for Augustus gene finding training Default: Off
    5. Force overwriting of results files from a previous run with the same name

Test/sample data

The following test data are provided for testing BUSCO in here - /iplant/home/shared/iplantcollaborative/example_data/BUSCO.sample.data:

  1. Input file - target.fa (genome sequences in fasta format)
  2. lineage data  

Run BUSCO assessment on sequence file ‘target.fa’ in genome mode using 'eukaryota' lineage

Results 

Successful execution of the BUSCO assessment pipeline will create a directory named run_<output folder name>. The directory will contain several files and directories:

1- Files

  1. short_summary_ Contains summary results in BUSCO notation and a brief breakdown of the metrics
  2. full_table_ Complete results in tabular format with coordinates, scores and lengths of BUSCO matches
  3. training_set_ Set of complete BUSCO matches used for training Augustus. Only created during genome assessment
  4. _tblastn Results in tabular format of tBLASTn searches with BUSCO consensus sequences

2- Directories

  1. augustus_ Augustus-predicted genes. Only created during genome assessment
  2. augutus_proteins Corresponding Augustus-predicted proteins. Only created during genome assessment
  3. Selected Complete BUSCO matches, used for training Augustus
  4. gb Complete BUSCO matches, GenBank format 
  5. gffs Complete BUSCO matches, GFF format
  6. hmmer_output Tabular format HMMER output of searches with BUSCO HMMs