B) Assess Assembly

Assess assembly (app: Assess assembly vs whole genome)

The Assess assembly vs whole genome app compares an assembly to a reference genome and evaluates the accuracy of the assembly. (Basic documentation: https://pods.iplantcollaborative.org/wiki/display/DEapps/Assess+Assembly+vs+whole+genome)

  1. Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
  2. Open the Assess assembly vs whole genome app (Public Applications > NGS > Assembly Annotation > Assess assembly vs whole genome)
    1. Change 'Analysis Name' to Assess_Assembly_45kmer, add a 'Description' (optional), and use the default 'output folder'.
  3. Click on the “Inputs” tab.
    1. In the ‘reference fasta’ window, enter a fasta-formatted reference genome sequence (Sample data: Community Data > iplant_training > genome_assembly_soapdenovo > B_Assess_Assembly > rhodobacter_genome.fna. This file contains R. sphaeroides genome NC_007493.2(1) from NCBI).
    2. Note: Reference files for a number of genomes are available in the Data Store (Community Data > iplantcollaborative > genomeservices > builds > 0.2.1).
    3. In the ‘assembly.fasta’ window, enter a fasta formatted scaffold file produced by the previous section (Assemble reads). (Sample data: Community Data > iplant_training > genome_assembly_soapdenovo > B_Assess_Assembly > SoapdenovoOut45.scafSeq. This file was derived from the .scafSeq output file from the 45 kmer assembly run, by inserting the kmer number as identifier to give it a unique name.)
    4. In the ‘header’ window, enter '45rhodobacter'. This will become the prefix for the output files.
  4. Click on "Launch Analysis".
  5. Repeat 1-4 with the .scafSeq outputs from any other assembly to be assessed. (Sample data: repeat with scaffold files produced from Assemble_Reads_47kmer and Assemble_Reads_49kmer). Change the settings to match the kmer values, i.e. change the 'Analysis Name', change the 'header' value.
  6. Click on 'Analyses' from the DE workspace and monitor the 'Status' of the analysis (e.g., Idle, Submitted, Pending, Running, Completed, Failed).
    1. Once launched, an analysis will continue whether the user remains logged in or not.
    2. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
    3. If the analysis fails or does not proceed in the anticipated timeline, check these tips for troubleshooting. (Using the sample data, the analysis should be complete in 5 minutes.)
    4. To re-run an analysis, click the analysis "App" in the 'Analyses' window.
  7. Access analysis results in one of two ways:
    1. In the 'Analyses' window click on the analysis "Name" to open the output folder.
    2. In the 'Data' window, click on user name, then navigate to the folder that holds the output of the analysis. (Find the output for the sample at Community Data > iplant_training > genome_assembly_soapdenovo > B_Assess_Assembly > output_for_sample_data).
  8. Expect text files named after the input value for 'header' as output. The output of interest is a .report file, a text file listing statistics about how the assembled scaffold sequences (or any other input fasta file) compare to the reference genome.
  9. Compare the results for the different assemblies. Some have larger N50 values for the scaffolds. Some have lower numbers of inserts and deletions. Compare the representation values (the amount of the genome that is accurately reproduced in the assembly). Which values are most important? Based on your own project’s needs, which value is more important to you, the N50 value or the representation?

Note: To assess assemblies if a reference genome is not available:

  • Follow the workflow above, using a genome from a species that is as closely related to the assembled genome as possible.
  • Use QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate genome assemblies. It can work both with and without a reference genome. QUAST's web server (beta) can be found here. (Basic documentation: http://bioinf.spbau.ru/QUAST)
  • Use REAPR (Recognising Errors in Assemblies using Paired Reads), a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. (Basic documentation: https://www.sanger.ac.uk/resources/software/reapr/)
  • Use CEGMA (Core Eukaryotic Genes Mapping Approach) to build a set of gene annotations for an assembly in the absence of experimental data. It is used for assessing the completeness of a eukaryotic genome. CEGMA can be found in the DE (Public Apps > Beta > CEGMA). (Basic documentation: https://pods.iplantcollaborative.org/wiki/display/DEapps/CEGMA)
  • Compute Contig Statistics to analyze basic statistics relating to the length of the assembled sequence scaffolds. Can be found in the DE (Public Apps > NGS > Utilities > Compute contig statistics). (Basic documentation: https://pods.iplantcollaborative.org/wiki/display/DEapps/Compute+Contig+Statistics)

Note: For resources on how to evaluate assemblies see:

Next Steps

Refine assembly using GapCloser

The GapCloser app fills gaps in scaffolds produced with the SOAPdenovo assembler. (Basic documentation: http://soap.genomics.org.cn/soapdenovo.html)

GapCloser can be found in the DE at Apps > High-Performance Computing > gapcloser 1.12.0. The input for this app is the entire output directory of the SOAPdenovo2 assembly run, including the input files for the assembly. gapcloser 1.12.0 attempts to identify matches at the internal gap margins of the output scaffold and to reduce the number of gaps. The output will be a text file named after the input scaffold file with a "gapcloser.fa" extension added. After running GapCloser assess the re-assembled genome again using the Assess assembly vs whole genome app described in this section to determine how the two assemblies compare. Is the N50 value higher? Is the representation higher? How have other metrics shifted in the new assembly? Based on your needs is this analysis better?

For other SOAPdenovo related tools, see the SOAPdenovo website.

Improve assembly

Depending on the amount of data and the depth of coverage an initial assembly may still consist of thousands of contigs. Several strategies could be employed to bring this number to a level that more accurately reflects the numbers of chromosomes in the organism.

  • More data: increase sequence coverage to fill gaps.
  • Align assembled contigs against large sequence contigs generated by PacBio or similar sequencing technologies.
  • Use chromosomal mapping or similar comparative visual techniques to form larger contigs of the smaller fragments assembled from your Illumina run.
Unable to render {include} The included page could not be found.