BUSCO-v3.0 in the Discovery Environment
The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
Rationale and background:
Felipe A. Simão, Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Kriventseva, & Evgeny M. Zdobnov Zdobnov’s Computational Evolutionary Genomics Group
Pre-Requisites
- A CyVerse account. (Register for an CyVerse account here -Â user.cyverse.org)
- Mandatory argumentsÂ
- Output folder name (name to use for the run and all temporary files (appended))
- Input file (genome assembly/gene set/transcript set file in FASTA format)
- Lineage data (Location of the BUSCO lineage data to use. You can select the BUSCO profile files for your species of interest from the Data window under Community Data -> iplantcollaborative -> example_data -> BUSCO.sample.data )
- Mode of analysis (genome, protein and trans. Default: genome)
- Optional arguments
- threads (Number of cpus to run the job. The maximum limit is 4)
- species (Chose form the list. If your species is not in the list, selecting a closely-related species usually produces better results).
- e-value (Use a custom blast e-value cutoff. Default: 0.001)
- region_limit (How many candidate regions (contig or transcript) to consider per BUSCO (Default: 3))
- augustus_parameters (Additional parameters for the fine-tuning of Augustus run. For the species, do not use this option.
Use single quotes as follow: '--param1=1 --param2=2', see Augustus documentation for available options)
- Force tblastn (Force tblastn to run on a single core and ignore the threads argument for this step only. Useful if inconsistencies when using multiple threads are noticed. Default: Off)
- long (Performs full optimization for Augustus gene finding training Default: Off
Test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> BUSCO.sample.dataÂ
Execute BUSCO with the following input data
- Output folder - run_example
- Input file - target.faÂ
- lineage data - example
- mode - genome (default)
- species - fly
- e-value - 0.001 (default)
- region_limit - 3 (default)Â
ResultsÂ
Successful execution of the BUSCO assessment pipeline will create a directory named run_example along with logs directory. The directory will contain several files and directories:
1- Files
short_summary_run_sample.txt -Â Contains a plain text summary of the results in BUSCO notation. Also gives a brief breakdown of the metrics.
- Â full_table_run_sample.txt -Â Contains the complete results in a tabular format with scores and lengths of BUSCO matches, and coordinates (for genome mode) or gene/protein IDs (for transcriptome or proteins mode).
- missing_busco_list_run_sample.tsv -Â Contains a list of missing BUSCOs.
2- Directories
- augustus_output -Â Augustus-predicted genes, only created during genome assessment. File: augustus.log = full details on Augustus jobs File: training_set_XXXX.txt = genes used for Augustus training Folder: predicted_genes = Augustus raw gene output Folder: extracted_proteins = Augustus protein FASTA output Folder: retraining_parameters = Augustus training results Folder: gb = GenBank format complete BUSCOs Folder: gffs = General Feature Format complete BUSCOs
- blast_output -Â tBLASTn results, not created for assessment of proteins. File: tblastn_XXXX.txt = tabular tBLASTn results File: coordinates_XXXX.txt = locations of BUSCO matches (genome mode)
- hmmer_output Tabular format HMMER output of searches with BUSCO HMMs
- single_copy_busco_sequences -Â FASTA format file for each complete single-copy BUSCO identified. .faa files contain protein sequences .fna files contain coding sequences (DNA, genome mode only).
More information on BUSCO-v2 inputs, outputs and parameters can be found in this manual