Evaluate Genome Annotation Quality

Rationale and background:


GenomeQC pipeline integrates various quantitative measures to characterize genome assemblies and annotations. We have used the containerized version of GenomeQC to create an app in Discovery Environment.

GenomeQC: A quality assessment tool for genome assemblies and gene structure annotations. Nancy ManchandaJohn L. Portwood IIMargaret R. WoodhouseArun S. SeetharamCarolyn J. Lawrence-DillCarson M. AndorfMatthew B. Hufford. bioRxiv, posted October 8, 2019 (doi: 10.1101/795237)

 

Pre-Requisites

  1. A CyVerse account. (Register for a CyVerse account here - https://user.cyverse.org/register)
  2. Mandatory arguments 
    1. Genome annotation file (genome annotation file in gff format)
    2. Transcript file (transcript set file in FASTA format)
    3. Lineage data (Location of the BUSCO lineage data to use. You can select the BUSCO profile files for your species of interest from the Data window under Community Data -> iplantcollaborative -> example_data -> BUSCO.sample.data )
  3. Optional arguments
    1. Annotation metrics output file name (Default: out_GenomeMetrics)
    2. BUSCO output directory (Default: out_BUSCO)

Test with sample data

Test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> genomeqc_annotation. Execute GenomeQC with following input data.

  1. Genome annotation file  - GCF_000001735.4_TAIR10.1_genomic.gff
  2. Transcript File - GCF_000001735.4_TAIR10.1_rna.fna
  3. BUSCO dataset - eukaryota_odb9
  4. Annotation metrics output file - out_GenomeMetrics (default)
  5. BUSCO output directory- out_BUSCO (default)

Output

  1. out_GenomeMetrics- This file contains different metrics for the gene models in the genome annotation file like the number and length of the genes, number and length of exons, etc.
  2. run_out_BUSCO/short_summary_out_BUSCO.txt- This file contains information on the BUSCO metrics: number of complete, fragmented and missing busco genes. You need to run the pipeline on the reference genome, then you could compare the metrics in the genome of your interest against the reference metrics. 

 

Unable to render {include} The included page could not be found.