Backgrounder, NGS Alignment

On Alignment of Second-Generation Sequence Reads

The basis of many analytical applications based on DNA sequencing is that a best match between a given sequence read and a reference sequence (genome, contig, etc) can be identified. If this proceeds for all reads in a library, the relative distribution of sequence reads with respect to physical position to which they align in the reference can be interpreted quantitatively, assuming the user has knowledge of the method and nature by which the sequence reads were produced.

DNA sequence alignments were initially implemented as optimum global solutions via algorithms such as Smith-Waterman and Needleman-Wunsch. These algorithms are computationally intensive and are suited to pairwise alignment of sequences. The BLAST and WUBLAST algorithms were developed to allow a query sequence align against a large library of potential sequences, returning the best hit. BLAST performs local alignments and is based on identification of seed matches which are then algorithmically extended. It's faster than SW or NW for pairwise alignment, but does not give optimal alignments over the entire length of the sequence. SSASHA and BLAT were developed to deal with larger data sets. BLAT is faster than BLAST but less sensitive. For data sets where the number of query sequences was in the tens of thousands, and subject sequences numbered anywhere from one to a few million, BLAST and BLAT provided quite acceptable performance.

However, the advent of Illumina and SOLID sequencing, which generate routinely 10-150 million short (36-100 base) sequences in a single 4-day reaction, and the deployment of multiple sequencing instruments per Sequencing Center has meant that literally billions of sequences per year can be generated, almost all of which need to be aligned against a reference sequence. So, why not scale up BLAST or BLAT by adding more CPUs to the clusters they run upon? There are a couple of reasons for this: first of all, both algorithms are technically tuned to align longer sequences (500 bp and up). Their parameters can be adjusted to align short sequences, but this severely impacts speed. Furthermore, these algorithms are actually designed to identify similar sequences from a large pool of subject sequences, rather than perform alignments to small collection of reference chromosomes or contigs.

A whole series of short-read optimized aligners has sprung up in recent months, detailed at this http://seqanswers.com/forums/showthread.php?t=43. Some focus on precision, others on speed, and others on providing specific types of alignment in support of specific downstream analysis. The general consensus in the informatics community is that BWA and BOWTIE are the best open-source general aligners, while SOAP2 and Novoalign are the best closed-source aligners. Much discussion has ensued in the NGS workgroup about which aligners to support, and the conclusion has been that any aligner whose output is directly the SAM universal format or whose output can be converted to SAM using algorithms implemented in 'samtools' is a reasonable target for inclusion. However, the NGS group has chosen to adopt BWA and BOWTIE to start, with preference given to BWA if a choice has to be made.

Alignment speeds for NGS algorithms vary over an order of magnitude. The best can align a single channel (12 million reads) of Illumina sequence to a human-sized genome in a matter of 2-4 hours, while the slowest (but most sensitive) require a day or more of CPU time. Luckily, NGS alignment is data-parallel. This means a user can break the list of query sequences up into chunks and distribute it to compute nodes running their own copy of the algorithm. Results from the parallel computation can be combined into a single output after all nodes have completed their tasks. A key issue in data parallelization is knowing how many nodes to use. Too few, and the opportunity afforded by parallelization is somewhat deferred by long wait times. Too many and efficiency is lost in the IO of distributing a small job to many nodes, each of which will only compute for a few seconds. Luckily, sensible job size estimation can be implemented by knowing the alignment speed of each algorithm given a) size of the target genome, b) length of the query sequences, c) the number of reads to be aligned.

An example of running one of these aligners in a data-parallel environment is posted at the NGS Confluence site.