A) Assemble reads

Assemble reads (app: SOAPdenovo 2.0.4)

The SOAPdenovo 2.0.4 app is used to assemble contigs from Illumina reads (Basic documentation: http://soap.genomics.org.cn/).

Sample data: general information on the reads used as sample data in this tutorial was provided with the reads on the GAGE website. This information can be provided by the core facility that performs the sequencing. The GAGE data was trimmed and cleaned up with the applications Scythe and Sickle, as described in a separate tutorial. During the process of trimming, many culled reads left unpaired mates behind, which were moved into a separate single-reads file. Paired-end reads are used to build short contigs and are commonly provided as forward/reverse reads. Mate-pair reads are used to connect contigs and are provided as reverse/forward. Unpaired data is not used for the scaffolding process, which uses the pairing information to link contigs together. For illustration and practice purpose the tutorial will showcase the use of a paired-end reads library (fragScSi_1 and fragScSi_2), a mate-pair reads library (shrtjmpScSi_1 and shrtjmpScSi_2), and a single-reads library (fragScSi_s.fq).

  1. Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
  2. Open the SOAPdenovo app (Apps > High-Performance Computing > SOAPdenovo 2.0.4).
    1. Change 'Analysis Name' to Assemble_Reads_45kmer, add a 'Description' (optional), and use the default 'output folder'. (By default, the app will include the input files in the output directory.)
  3. Click on the “Inputs” tab.
    1. Enter the trimmed reads into the SOAPdenovo app in the Discovery Environment (DE) with the library settings adjusted appropriately for each library of reads. The trimmed reads used for the tutorial are available in the Data Store: (Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads).
      1. Into the '_reads1' field for any of the 'Paired Library' rows, enter the path to the FASTQ or FASTA file that contains the first set of trimmed reads of a paired-end or mate-pairs reads library.
      2. Into the '_reads2' field for any of the 'Paired Library' rows, enter the path to the FASTQ or FASTA file that contains the second set of trimmed reads of that paired-end or mate-pairs reads library.
      3. If available, combine any single-read files into one file and enter the path into the ‘Paired Library5, reads1’ field. (Only the 'Library5' row can handle single-read files. If a single-read file is entered here, leave the second row for 'Library5' empty.) Multiple Single Reads files can be combined using the Concatenate Multiple Files app (Apps > General Utilities > Text and Tabular Data > Concatenate Multiple Files) to consolidate the single-reads into one file.
      4. For the sample data:
        In the ‘Paired Library1, reads1’ field, enter Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > fragScSi_1.fq
        In the ‘Paired Library1, reads2’ field, enter Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > fragScSi_2.fq
        In the ‘Paired Library2, reads1’ field, enter Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > shrtjmpScSi_1.fq
        In the ‘Paired Library2, reads2’ field, enter Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > shrtjmpScSi_2.fq
        In the ‘Paired Library5, reads1’ field, enter Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > fragScSi_s.fq
  4. Click on the “Parameters” tab.
    1. Under the ‘maximum read length’ window enter the value for the longest reads used for the assembly. (Sample data: enter 101.)
    2. For each read library entered, set the following parameters:
      1. average insert size - This value indicates the average insert size of this library or the peak value position in the insert size distribution figure. (Sample data: Enter 180 for library 1, enter 3500 for library 2, leave this value blank for library 5.)
      2. read orientations - This setting tells the assembler if the read sequences need to be complementarily reversed. Set 0 for forward/reverse reads, 1 for reverse/forward reads. (Sample data: Enter 0 for library 1, enter 1 for library 2, leave this value blank for library 5.)
      3. assembly steps - This indicator decides in which part(s) the reads are used. It takes values 1 (for a contig assembly), 2 (for a scaffold assembly), or 3 (for both a contig and scaffold assembly). (Sample data: Enter 3 for both library 1 and library 2, enter 1 for library 5.)
      4. order for scaffolding - Sets the order for use in scaffolding. It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same "rank" are used at the same time during scaffold assembly. (Sample data: Enter 1 for library 1, enter 2 for library 2, leave this value blank for library 5.)
      5. sequence file format - Indicates the file format for the libraries. Enter q for FASTQ files or a for FASTA files. (For the sample data, all of the read files are FASTQ.)
    3. Enter an odd number into the ‘kmer size’ field, typically a value that is 1/2 the read length, plus 1. (For the sample data, enter 45, which is a little less then half the read length plus 1 since many of the reads may be shorter than 101 nucleotides due to trimming.) The parameter 'k-mer size' determines the sequence length the reads are fragmented into prior to being entered into the assembly process. Smaller k-mer settings provide a higher degree of sensitivity which is desirable for genomes with a higher degree of repetition. Larger k-mer settings provide a higher degree of specificity, which works better for more heterogenouos genomes. Transcripts with lower read depths are represented better using lower k-mer values, while transcripts with higher read depths are represented better using higher k-mer values. (It is sometimes suggested to use a range of k-mer settings and to merge the results.)
    4. Under the 'kmer size range' enter either 63 or 127 (Sample data: enter 63).
  5. Click on "Launch Analysis".
  6. Click on 'Analyses' from the DE workspace and monitor the 'Status' of the analysis (e.g., Idle, Submitted, Pending, Running, Completed, Failed).
    1. Once launched, an analysis will continue whether the user remains logged in or not.
    2. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
    3. If the analysis fails or does not proceed in the anticipated timeline, check these tips for troubleshooting. (Using the sample data, the analysis should be complete in 3-5 hrs.)
    4. To re-run an analysis, click the analysis "App" in the 'Analyses' window.
  7. Access analysis results in one of two ways:
    1. In the 'Analyses' window click on the analysis "Name" to open the output folder.
    2. In the 'Data' window, click on user name, then navigate to the folder that holds the output of the analysis. (Find the output for the sample at Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > output_for_sample_data.)
  8. The output files will be labeled "SoapdenovoOut". Output files of interest include:
    1. .scafSeq: a fasta file containing the assembled sequence scaffolds (includes large gaps filled with N’s).
    2. .contig: a fasta file containing the final assembly of contiguous sequence (only small gaps).
    3. .scafStatistics: a text file listing statistics for the assembled sequence scaffolds and contigs including the N50 values for the different k-mer settings. (The N50 value indicates that 50% of the assembly is in contigs of size N50 or larger.)

Note: Creating useful assemblies generally requires running assemblers multiple times with different k-mer settings. Using a k-mer setting that is too small results in too many short contigs used for the assembly whereas using too large a k-mer setting results in a few long contigs used for the assembly, which would require deeper sequence coverage and longer read lengths in order to guarantee an overlap in any genomic location. Generally, optimal k-mer settings will result in larger N50 values. (For the sample data the assembly was repeated with k-mer settings, 47 and 49; see respective folders in Community Data > iplant_training > genome_assembly_soapdenovo > A_Assemble_Reads > output_for_sample_data.) To repeat an analysis click on the Analyses tab in the DE interface, identify the name of the Analysis in the 'Name' column and, for this analysis, click on the app in the 'App' column. This will open a window to re-run the app (preferably with a new name) under changed k-mer settings.

For a general reading see:

  • De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909--912 (2010) doi:10.1038/nmeth.1517

Find resources on how to determine appropriate k-mer sizes at:

Unable to render {include} The included page could not be found.