Narrative, Variant Detection

Variant Detection Workflow

The scientific objective of this workflow is to use sequencing reads derived from an individual member of a species whose genome has a known sequence to quantitatively identify differences relative to the reference sequence. These differences can be further interpreted in several subsequent workflows. For the time being, let us assume that determining an individual's genome sequence relative to a control is an atomic action within the G2P conceptual framework.

Detection of variations is based on interpretation of alignment of sequencing reads to a reference sequence. For each position in the genome, the null hypothesis is that the alignment algorithm will be able to stack sequence read alignments atop the reference in such a way that every base aligns at every position. This is confounded by:

Presence of sequencing errors in the reads themselves
Genuinely different nucleotide bases in the sample relative to the reference
Presence of extra bases in the sample relative to the reference (real or artifactual)
Absence of some bases in the sample sequence relative to the reference (real or artifactual)

Algorithms exist (MAQ, implemented in SAMtools) which use the alignment to call a consensus base at each position in the genome relative to the reference. The MAQ algorithm calls the base which maximizes the posterior probability and calculates a phred-scaled quality at each position along the consensus sequence. SAMtools basecaller is also capable of detecting small (some fraction of the sequence read size) insertions and deletions relative to the reference sequence. SAMtools is a well-documented community project with wide adoption. The NGS group advocates using it as directly as possible so that advances and updates in the algorithms can be incorporated into future iterations of the iPlant CI.

Turning to variant detection as a workflow: let's revisit Andres:

Andres has a FASTQ file containing his sequencing sample, which is derived from a maize plant. It is the product of the foundational preprocessing and quality control workflow. The FASTQ file can thus be assumed to be Sanger quality-scaled and all nucleotide sequence in it can be considered potentially biologically relevant. Andres' FASTQ file has 3e+7 51 bp reads in it, after QC and pre-processing.

Andres must align the sequence reads to his reference genome. For the purposes of running most alignment algorithms, this means pointing to a simple multi-entry FASTA file where each entry represents a single chromosome, scaffold, or contig. Conceivably, Andres could create his own reference genome sequence via another workflow, but for now we will consider the case that he may select from reference genomes accessible via the Discovery Environment. If metadata exists that describes the sample's source organism, this could be used to suggest an appropriate reference genome for Andres to align against.

The NGS group has indicated that a two-step alignment is probably most appropriate for variant detection. The first alignment is essentially a bulk alignment using thw BWA aligner. Reads which do not map to the reference genome via BWA will be mapped using the free, but not open source Novoalign algorithm. BWA is a sophisticated Burrows-Wheeler seed/extend algortithm capable of even limited gapped alignment, lack of which is a commonly cited weakness of BW-based systems. It is extremely fast and sensitive. However, Novoalign, which provides the only Needleman-Wunsch global alignment algorithm in NGS, is apparently more capable to align reads containing errors or indels. From personal experience (MWV), integrating input and output of these two aligners should be trivial since they accept the same inputs, use the same general command and execution model, and support the universal SAM format for output. Configuring them both to run in a cluster configuration should be trivial as well.

Thus, Andres will perform an Alignment to Genome step for variant detection, which will include a primary alignment with BWA outputting a SAM alignment, interpretation of the SAM report to identify and subselect from the sample library unmapped reads, alignment of those reads via Novoalign outputting to SAM format, and simple merging of the BWA- and Novoalign-generated SAM into a single, unified alignment SAM. Part of the alignment process is specifying the parameters via which the algorithms are invoked as well as specifying a policy that governs reporting of alignments based on their quality, repeat context, etc. Reasonable defaults can be established, while the user can be (perhaps in the future) provided opportunity to change these parameters and policies. Default behavior could be based on sample and experiment metadata provided upstream.

This primary SAM alignment file is a derived data product of great interest to some users. It also represents the sum of a non-trivial amount of calculation in a robust and lossless format. NGS supports treating the SAM file as an intermediate deliverable, suitable for archiving and retrieval at a later date. It should be accessible, as a file, from the outside via a simple URI. Cast in terms of our user persona: Andres downloads the SAM file derived from his genome alignment so that he can test a new algorithm he has been thinking about... but his advisor will be annoyed that he's not doing variant analysis using iPlant's proven, tested Discovery Environment, so...

Andres now needs to call bases in the sample by comparing the SAM output to the reference genome, and he wishes to use the MAQ basecalling model implemented in SAMtools. Andres must specify the following parameters to 'samtools pileup' to generate a draft set of basecalls:

Number of haplotypes in the sample, which can be reasonably suggested from the metadata but must be user-configurable
Expected fraction of differences between a pair of haplotypes. A default value for specific organisms can be pre-selected but this should be user-configurable
Phred-scaled expected probably of finding an insertion/deletion. A default value for specific organisms can be pre-selected but this should be user-configurable.

In addition, the basecaller needs to be passed the SAM file, and the FASTA formatted reference sequence.
The output of basecalling is a pileup file, which is a verbose, text-based format describing the base distribution and mapping quality at each position in the reference sequence. Pileup is a documented format, used by some groups in Statistical Genetics. The pileup contains base calls for entire set of bases in the reference, but includes some errors. As discussed with the SAM file, this primary pileup file may be considered a deliverable data product. It should be associated with provenance and metadata.

Andres currently filters the pileup via simple logic implemented by a combination Perl/Awk script (see Samtools page on SNP calling). Andres must be able to specify a minimum base quality above or below the reasonable and suggested default of Phred-scaled 20. He may need to specify a value for the neighborhood filtering step (we are still researching this). The result is that only basecalls likely to represent real, biologically validatable SNPs are reported in a new pileup file. A refined variation-specific pileup is the major deliverable data product for this workflow. It should be associated with provenance and metadata.

NOTE 10/30/09: There is ongoing discussion about the algorithm/filter using to call bases, as it's more complicated in plants than in some other organisms. A fair statement is that the pileup will be filtered via one of many possible approaches to produce a list of variants. The list of variants may be in the pileup or VCF format, depending on the outcome of some upcoming discussions.