Assembling a Genome

With the growth of next-gen sequencing, many researchers are stepping up with their own organism of interest, and have sequenced the genome as a way to open it up to more intense study or to help in overall efforts in comparative biology.

There are a number of effective whole genome sequence assemblers, but unless you have great reads, a lot of coverage, and a relatively simple genome (e.g., not with a lot of repeats), it is not likely that the result will be whole chromosomes. Genome sequence assembly is a process, but a good assembly is still very useful for many studies and, if nothing else, it's a starting point.

It is important to remember that if you haven't done an excellent job of cleaning up your reads, don't even think about starting assembly. Your final sequence accuracy depends on your being able to minimize the misassemblies that are created, and it's very easy to create misassemblies or to block connections between sequences, if you have contaminated or error-filled sequences. Take a look at the HTProcess read cleanup pipeline if you want a good place to start.

Determining which kmer setting to use

An important consideration for using many assemblers is what kmer setting to use. A kmer is the unit of length by which each read is divided for searching for matching between reads.

A general rule of thumb suggests that we use a kmer value of approximately half of the length of a read, plus one or a little more. The balance that is sought in choosing this value is that with a smaller kmer, there will be more kmers, and it will be easier to find matches. A longer kmer value offers more selective matching capability, but fewer kmers and matches altogether.

The best choice depends on the read coverage of the genome, the accuracy of the reads, the complexity of the genome, and the number of repeats. In other words, try a value, maybe a few values, but often, if you are in a smaller, reasonable range of values, you will see only small differences in the N50 value of the assembly.

Another approach to choosing a kmer is to use a kmer-counting application. There are two of these available in the DE: HTProcess-kmergenie and HTProcess-jellyfish, which is used for guiding the assembly.

Helpful Links

On This Page:

Related Pages:

HTProcess read cleanup pipeline (a good starting place)

Genome Assembly Applications

Ray: An easy-to-use yet powerful assembler. that allows you to do longer reads, such as 454 reads. Make a folder in the DE, put your reads in it, drag it into the input, and set the kmer value and run. That's it. Just make sure your reads are really clean, and also make sure your paired read files are named in a way that makes it easy to identify that they are paired.
AllpathsLG: One of the most demanding genome assemblers to use, yet very powerful. You need at least 2 paired read libraries: one with fragment reads that overlap in the middle (e.g. 100 bp reads with 180 bp spacing; and one with mate pair reads that are over 1000 bp. If you don't have these libraries, forget it. If you do, it can produce some of the best assemblies that you can hope for. It uses a built-in error-correction routine, so you may not want to do too much quality trimming first. There are essentially 2 versions of apps for AllpathsLG. One is labelled small because it runs on 1 normal node of (currently) the Stampede server at TACC, so it has just 32 GB ram available. The regular AllPathsLG app (like most of the assemblers) runs on a largemem node, so for the current app on Stampede, it has 1 TB of ram to work with. To judge the amount of memory required by your assembly, assume that all your reads need to go into memory, and then add the approximate of memory occupied by your assembled genome. This is a rough guess, but may help guide you. Err on the small side if you are thinking of using the AllpathsLG-small app. It will start its job much sooner on average because it uses a normal node that is not nearly as in short supply. It works very well with bacterial genomes, but other small genomes can be assembled with it, also.
Soapdenovo2: An excellent genome assembler for Illumina reads. The paired-end read files must be entered in the proper order as any of the 5 possible library inputs. Any single reads go in the 5th library input1 only. Enter a maximum read length for all of the libraries, set the kmer value, and make sure all of the parameters are set for the libraries that have data in them. Once it has run, you can drag the output directory into Gapcloser to help finish the assembly.
Velvet: An old standard at this point, but still an effective assembler. Velvet can be used with short or long reads, and even with SAM or BAM files (sorted by name) when the assembly is reference-guided. First enter your data into VelvetH with the appropriate settings, including a kmer value, and run. Then drag the Velvet output directory into the input of VelvetG, and set the parameters, e.g. paired end insert lengths, and run.
Newbler: Roche's supported assembler for 454 data. It is a very effective assembler and uses the SFF file format natively.
SPAdes: A small genome assembler that has been popular for bacterial genomes. Works with Illumina, Ion Torrent, Oxford Nanopore and even PacBio data. There are essentially 2 versions of apps for SPAdes. One is labelled high-mem because it runs on the largemem node of (currently) the Stampede server at TACC, so it has 1 TB ram available. The regular SPAdes-3.8.0 app runs on a single, normal node on the Lonestar 5 server, so it has 64 GB of ram to work with, which is plenty for most bacterial genome assemblies. To judge the amount of memory required by your assembly, assume that all your reads need to go into memory, and then add the approximate of memory occupied by your assembled genome. This is a rough guess, but may help guide you. Err on the small side if you are thinking of using the SPAdes-3.8.0 app. It will start its job much sooner on average because it uses a normal node that is not nearly as in short supply as the largemem node used by the high-mem app.
HTProcess-jellyfish: A relatively fast kmer-counting program. It provides information about the abundance vs. copying number of different kmers.
HTProcess-kmergenie: Though technically not an assembler, this kmer-counting application is used for guiding assembly, and is an alternative to HTProcess-jellyfish. Both help give you an idea of what your kmer coverage is like.

1 Learning Materials

Assembling a Genome

Determining which kmer setting to use

Genome Assembly Applications

Related articles

Filter by label