Aligners are commonly used for the analysis of sequencing data. Alignments can be used to find where a genome sequencing read falls on reference genome sequence, and possibly help determine whether there are sequence polymorphisms in the reads or how similar a new genome being sequenced is to previously sequenced genome. Alignments can help identify the genes that produce the RNA sequences found in RNASeq data. Alignments, with an exception or two, are essential to determining the gene expression levels for gene products. There are many different sequence aligners that have different properties that determine how they are commonly used. Some are designed for fast-matching of short, accurate NGS read sequences to a genome reference. Some are adapted to finding splice-sites and exons. Many are fast, but the fastest implementations would be the ones that have HPC versions that run on an XSEDE supercomputer at the Texas Advanced Computer Center. These are the best or only choice for mapping very large sequence files, e.g. 25 GB and larger. Bowtie2, BWA-Mem, GeneSeqer, GMAP, GSNAP and Tophat (HTProcess_tophat2) all have HPC versions for those working on larger sequencing projects.
List of Sequence Aligners
- Blast. "The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families." NCBI's Blast (info here) is one of the oldest bioinformatics tools known, but it still is often a preferred choice for scientists looking to identify a new sequence or learn what it is similar to. It comes in several flavors: blastn (nucleotide to nucleotide), blastp (peptide to peptide), blastx (search 6 translated reading frames of a nucleotide sequence against a peptide database), tblastn (search a translated nucleotide database with a peptide query), and tblastx (search a translated nucleotide sequence against a translated nucleotide database). Blast is an especially sensitive gapped mapping tool, but it is also one of the slowest aligners. The versions that use translation, e.g. the very popular blastx, are especially slow. There are usually more efficient (read less wasteful of our precious computational time) and therefore better choices to using the translating blast versions. Before you ask why Blastx is not available in the Discovery Environment, read the previous statement and then check out this tutorial. Blast gives you a long list of choices for alignment outputs, some of them including visual representation of the alignment, some being tab-delimited formats. SAM file output is not one of the choices.
- Blat. The Blast-Like Alignment Tool (info here) differs from Blast in that it is mostly used for DNA to DNA and peptide vs peptide mapping, and it is faster. It is also not as sensitive – it requires higher identity between the query and target sequences. It does not require a database be made for searching, instead it indexes a genome or other reference at the beginning of an alignment job. Blat produces it's own, informative, tab-delimited format output file (psl), which can often be used with genome mapping/visualization software. The psl file can also be crudely converted to the SAM format, which very commonly used. BLAT is a good choice if you need to do a mapped alignment with a long read or transcript sequence against a long genome sequence (for example). It is slower than fast, short-read mappers, but better at dealing with gaps and long sequences.
- Bowtie, Bowtie2. Bowtie is the first main version of these two aligners. See the manual here. It is fast, not for gapped alignments, and sensitive to errors at times. It was designed especially for mapping short NGS reads to genome sequences. Bowtie2 (manual here) is considered even faster than the first Bowtie, and it introduces gapped alignments to its capability. It also can map much longer reads, even thousands of bases long. Bowtie and Bowtie2 produce a SAM format output file, a detailed, tab-delimited file commonly used for RNA-Seq studies and variant analysis. The binary version of the SAM file (BAM file) is often the preferred form to work with.
- BWA (BWA-backtrack), BWA-SW, BWA-MEM (manual here). These aligners are designed to map sequences accurately to a close-matching large reference sequence. BWA-backtrack is intended for reads up to 100 bp. The other two versions can handle much longer reads and chimeric alignments. BWA-MEM is the newest version and is considered faster and more accurate. BWA-Mem is available in 2 apps in the Discovery Environment, one runs on the main iPlant Condor servers, and one runs on an XSEDE system server at the Texas Advanced Computer Center.
- GeneSeqer. A spliced alignment tool (info here) intended for mapping transcripts, especially ESTs to a genome sequence. It uses a splice prediction tool to help find splice sites and it has a very informative output that groups the exons for each gene that is found. GeneSeqer also includes scripts to convert its output to .gff format, which provides a great start on building an annotation for a newly studied genome and transcriptome.
- GMAP. The Genomic Mapping and Alignment Program (info here) is another spliced alignment tool for mapping ESTs, transcripts to a genome. It also provides a .gff format output file and again, is a a great start on building an annotation for a newly studied genome and transcriptome. More info here: gmap_build-2018-03-25.
- GSNAP. The Genomic Short-read Nucleotide Alignment Program (info here) is a relative of GMAP. It is considered to be exceptionally fast, tolerates some gapped alignment, and it has the capability to include SNPs, bisulfite-treated DNA (for detection of C-methylation) conversions, and adenine-to-inosine modifications in its index to increase the sensitivity of mapping. GSNAP has a SAM output format. GSNAP is a good choice for RNA-Seq studies and variant analysis. The app implementation in the Discovery Environment of this tool runs on 8 cores of an XSEDE system server, so it should be a good fast choice for many users. See more info here: Using the GSNAP aligner.
- Stampy. A short-read mapper (info here), Stampy is a good general use mapper, that is set apart by its ability to map cross-species or otherwise where matching is a little more difficult. It also has a SAM output format, which allows it to feed into many different analyses downstream.
- Star. This aligner (info here) is considered exceptionally fast, and well-suited to mapping RNA-Seq reads to a genome reference. It can be used for finding splicing in transcript sequences. It also has a SAM output format.
- Tophat2. A key component of the Tuxedo RNA-Seq workflow (info here), Tophat2 uses Bowtie2 to find matches and splice sites. It keeps a database of splice junctions and produces a list as an output, as well as a BAM output. The output is intended to be used with Cufflinks for Tuxedo's quasi-assembly process, and then the BAM files are used with the .gtf format assembly to determine differential expression with the CuffDiff tool.
- HiSat2. HiSat2 (info here) is a newer, more advanced mapper compared to Tophat2, and the Tophat2 page says: "TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2". HiSat2 is approximately as fast as GSNAP.