BUSCO in the Discovery Environment
The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to upendra@cyverse.org. Thank you.
Rationale and background:
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
Felipe A. Simão, Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Kriventseva, & Evgeny M. Zdobnov Zdobnov’s Computational Evolutionary Genomics Group
Bioinformatics, published online June 9, 2015 (doi: 10.1093/bioinformatics/btv351)
BUSCO (Benchmarking UniversalSingle-Copy Orthologs) is a tool that provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB. BUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines. BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade". The software is freely available to download at (http://busco.ezlab.org/).
Pre-Requisites
A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
Mandatory arguments
Output folder name
Input file (Genome assembly /gene set/transcriptome) in fasta format
Lineage data (You can select the BUSCO profile files for your species of interest from here : /iplant/home/shared/iplantcollaborative/example_data/BUSCO.sample.data). For version 2.0, there is a new lineage "plantae".
Mode of analysis (genome, ogs, trans Default: genome)
Optional arguments
Species (Select from the pre-computed Augustus metaparameters Selecting a closely-related species usually produces better results Valid options: see Augustus help for list of options - http://augustus.gobics.de/binaries/README.TXT. Default: generic). In the new version 2.0, there are several new species that users can pick from.
E-value (Use a custom blast e-value cutoff. Default: 0.01)
Custom flanking genomic regions in base pairs (bp) Used when extending selected candidate regions before gene prediction Default: Automatically calculated flank sizes based on genome size. It ranges from 5 to 20bp
Performs full optimization for Augustus gene finding training Default: Off
Force overwriting of results files from a previous run with the same name
Test/sample data
The following test data are provided for testing BUSCO in here - /iplant/home/shared/iplantcollaborative/example_data/BUSCO.sample.data:
Input file - target.fa (genome sequences in fasta format)
lineage data
Run BUSCO assessment on sequence file ‘target.fa’ in genome mode using 'eukaryota' lineage
Results
Successful execution of the BUSCO assessment pipeline will create a directory named run_<output folder name>. The directory will contain several files and directories:
1- Files
short_summary_ Contains summary results in BUSCO notation and a brief breakdown of the metrics
full_table_ Complete results in tabular format with coordinates, scores and lengths of BUSCO matches
training_set_ Set of complete BUSCO matches used for training Augustus. Only created during genome assessment
_tblastn Results in tabular format of tBLASTn searches with BUSCO consensus sequences
2- Directories
augustus_ Augustus-predicted genes. Only created during genome assessment
augutus_proteins Corresponding Augustus-predicted proteins. Only created during genome assessment
Selected Complete BUSCO matches, used for training Augustus
gb Complete BUSCO matches, GenBank format
gffs Complete BUSCO matches, GFF format
hmmer_output Tabular format HMMER output of searches with BUSCO HMMs