Discover Variants Using SAM Tools (Workflow Tutorial)

The CyVerse App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01'). In critical cases, please report your concern to the iPlant Ask forum or to support@iplantcollaborative.org. Thank you for your patience.

Introduction and overview

Author: Zhenyuan "Jerry" Lu (luj at cshl.edu) / CyVerse, Cold Spring Harbor Laboratory

Goal

The goal of this tutorial is to become familiar with a procedure to detect and call variants from sequence reads using Bowtie and SAM Tools.

The tutorial will take users through the following operations:

Align reads (app: Bowtie-2.2.1--Build-and-Map)
Reformat file (app: SAM to sorted BAM)
Identify variants (app: Calling SNPs INDELs with SAMtools BCFtools)
Verify variants (app: SAMTOOLS-0.1.19 VCF-Utilities varFilter)

Each operation takes less than 5 minutes.

Rationale and Background

Identifying variations among DNA sequences is an essential part of virtually all branches of biological research. This tutorial presents a workflow that is being used to identify single nucleotide polymorphisms (SNPs) and short insertions/deletions (indels).

Next Generation Sequencing (NGS) Large amounts of data being generated at incredibly low cost. NGS approaches involve taking a whole genome, splitting it into millions and millions of very short pieces, and sequencing them all.

SAM format SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

SAMtools SAMtools are a series of popular applications that are used in NGS analysis. SAM Tools provide various utilities for manipulating sequence alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per base position format.

Identifying variants using SAMtools involve the following applications:

Bowtie2 or BWA - Align sequence reads to a reference genome (For this tutorial we will be using Bowtie2). Aligned reads are stored in a SAM formatted file.

Samtools mpileup - Collects summary information from BAM files and calculates the genotype likelihoods supported by the aligned reads. Stores likelihoods in binary call format (BCF).

Bcftools - Calls variants using genotype likelihoods generated from mpileup. Stores variant information in the variant call format (VCF).

Vcfutils - Parses out false positives and filters results in a VCF file.

For more information see this page.

Prerequisites

An CyVerse account. (Register for a CyVerse account at http://user.cyverse.org.)
The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Test data

This tutorial uses fasta-formatted sequence reads taken from E. coli. and can be accessed directly in the Discovery Environment in the Data window under Community Data > iplantcollaborative > example_data > bowtie.

Results

The end result of this workflow is a file that is in variant call format (VCF).

Workflow

Operation 1: Align Reads (app: Bowtie-2.2.1--Build-and-Map)

Requires at least two input files, a FASTQ file containing raw sequence data and a FASTA a file containing a reference genome. Bowtie2 builds an index from the reference genome and then aligns the reads against the index. The index is saved in a compressed file (tar.gz). The alignment is saved in a SAM formatted file. (Basic documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
Open the Bowtie-2.2.1-Build-and-Map app (Apps > Public Apps > NGS > Aligners > Bowtie-2.2.1-Build-and-Map).
1. Name your analysis (Sample data: Leave default name for sample data).
2. Select an "output folder" (Sample data: Leave default folder, "analyses").
Click on the "Reference Index" tab.
1. There are three options to upload a reference index, either SELECT a reference sequence from a drop down menu, INPUT a reference sequence fasta from the CyVerse data store OR INPUT an archived set of index files (also from the data store) for your reference. For sample data, input a reference sequence.
2. For the sample data click "Browse" under the "Reference fasta" field and select Community Data > iplantcollaborative > example_data > bowtie > e_coli.fa.
Click on the "Inputs" tab.
1. Enter a fastq file. (Sample data: Community Data > iplantcollaborative > example_data > bowtie > e_coli_10000snp.fq)
Click "Launch Analysis".
Once the analysis is complete (less than 5 minutes with sample data), select the output file generated (Sample data: "Bowtie-2.2.1-Build-and-Map_analysis1" folder found in "analyses"). The output file is called "BowtieOut.sam". The output is in SAM format, the human readable equivalent of a BAM file. For more information on SAM files see http://samtools.github.io/hts-specs/SAMv1.pdf.

Operation 2: Reformat File (app: SAM to sorted BAM)

The output from the Bowtie2 operation is in SAM format and needs to be converted to a BAM file before being processed by SAMTools. This operation creates a position-sorted BAM file from a SAM alignment file. (Basic documentation: http://samtools.sourceforge.net/samtools.shtml)

Open the SAM to sorted BAM app (Apps > Public Apps > NGS > SAMTools > SAM to sorted BAM).
1. Name your analysis (Sample data: Leave default name for sample data).
2. Select an "output folder" (Sample data: Leave default folder, "analyses").
Click on the "Select input data" tab.
1. Select a SAM file (Sample data: analyses > Bowtie-2.2.1-Build-and-Map_analysis1 > BowtieOut.sam)
Click "Launch Analysis".
Once the analysis is complete (less than 5 minutes with sample data), select the output file generated (Sample data: "SAM_to_sorted_BAM_analysis1" folder found in "analyses"). The output file is called "BowtieOut.sorted.bam", a sorted BAM file created from "BowtieOut.sam". A BAM file is a compressed, binary version of a SAM file. Along with the bam file, the output also contains a file with a "bam.bai" extension. This file contains the index information for the BAM output.

Operation 3: Identify Variants (app: Calling SNPs INDELs with SAMtools BCFtools)

Samtools mpileup generates a binary call formatted (BCF) file from one or multiple BAM files. This tool calculates the genotype likelihoods supported by the aligned reads by scanning each position supported by an aligned read and computes the possible genotypes supported by these reads. Once the BCF file is generated, bcftools uses the genotype likelihoods calculated to call variants. Variants are outputted into a variant call formatted (VCF) file. (Basic documentation: http://samtools.sourceforge.net/mpileup.shtml)

Open the Calling SNPs INDELs with SAMtools BCFtools app (Apps > Public Apps > NGS > Variant Identification > Calling SNPs INDELs with SAMtools BCFtools).
1. Name your analysis (Sample data: Leave default name for sample data).
2. Select an "output folder" (Sample data: Leave default folder, "analyses").
Click on the "SAMTools mpileup - Input data" tab.
1. Select a BAM file (Sample data: analyses > SAM_to_sorted_BAM_analysis1 > BowtieOut.sorted.bam)
2. Select a reference sequence file. (Sample data: Community Data > iplantcollaborative > example_data > bowtie > e_coli.fa)
Click on the "bcftools view - Consensus/variant calling option" tab.
1. Check "SNP Calling".
2. Check "call genotypes at variant sites".
3. Check "output potential variant sites only".
Click "Launch Analysis".
Once the analysis is complete (less than 5 minutes with sample data), select the outpute file generated (Sample data: "Calling_SNPs_INDELs_with_SAMtools_BCFtools_analysis1" folder found in "analyses"). There are two output files, "samtools_mpileup_output" (a BCF file) and "bcftools_view_output" (a raw VCF file). BCF, the binary call format, is the compressed, binary form of VCF, the variant call format. The VCF format is the standard format for storing multiple types of variant data (SNPs, indels, structural variants). VCF consists of a header section and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below). The FORMAT field and sample information are allowed to be absent. For more information on VCF format see http://vcftools.sourceforge.net/specs.html.

Operation 4: Verify Variants (app: SAMTOOLS-0.1.19 VCF-Utilities varFilter)

Used to filter the list of variants according to some set of specific criteria. Filtering after variant calling can be useful for eliminating false positives. Outputs selected variants in a variant call formatted (VCF) file. (Basic documentation: http://vcftools.sourceforge.net/specs.html)

Open the SAMTOOLS-0.1.19 VCF-Utilities varFilter app (Apps > Public Apps > NGS > Variant Identification > SAMTOOLS-0.1.19 VCF-Utilities varFilter).
1. Name your analysis (Sample data: Leave default name for sample data).
2. Select an "output folder" (Sample data: Leave default folder, "analyses").
Click on the "Input data" tab.
1. Select a VCF file. (Sample data: analyses > Calling_SNPs_INDELs_with_SAMtools_BCFtools_analysis1 > bcftools_view_output)
Click "Launch Analysis".
Once the analysis is complete (less than 5 minutes with sample data), select the output file generated (Sample data: "SAMTOOLS-0.1.19_VCF-Utilities_varFilter_analysis1" folder found in "analyses"). The output file is called "Outputfile", a filtered VCF file.

The .vcf file will have the following columns:

Next Steps

Generating a .vcf file with variant data is just the first step. Here are some suggestions for what could be done with the data generated in this tutorial.

Visualize Variant Data

A common use for variant information is to visualize the data. Utilizing only a few steps, the information in .vcf files can be viewed in a number of genome browsers, such as:

GBrowse (Documentation: http://gmod.org/wiki/GBrowse)
When viewing .vcf files in GBrowse, convert the output file into GFF3 format using the VCF to GFF3 app in the DE (Public Apps > NGS > Variant Identification > VCF to GFF3). Depending on which database/GBrowse installation is used, the output file may require editing the data of the first column. For instance, in order to read an output file on GBrowse in TAIR, the chromosome numbers in the first column need to be preceded by "chr" as that is what the GBrowse installation "expects to see."

Databases that use GBrowse include:

Flybase: Drosophila melanogaster genomics portal.
Maize GDB: corn genomics portal.
TAIR: Arabidopsis thaliana genomics portal.
Wormbase: Caenorhabditis elegans genomics portal.

JBrowse (Documentation: http://gmod.org/wiki/JBrowse)
When viewing .vcf files in JBrowse, convert the output file into GFF3 format using the VCF to GFF3 app in the DE (Public Apps > NGS > Variant Identification > VCF to GFF3).

Databases that use JBrowse include:

Phytozome: database for multiple plant genomes at JGI in Walnut Creek, CA. JBrowse can be found under "Tools" at the top of the page.
CoGe: comparative genomics platform, powered by CyVerse. To view variants in CoGe, click on "What do you want to do?," then on "LoadExperiment" (Documentation). Once the data is loaded click on "Continue to ExperimentView". On the ExperimentView page, click on "View" to be taken to the genome browser.

WebApollo (Documentation: http://gmod.org/wiki/WebApollo)
An extension for JBrowse designed to generate and edit sequence annotations. Requires conversion of .vcf file into GFF3 format.

Integrative Genomics Viewer (IGV; Documentation: http://www.broadinstitute.org/software/igv/home)
Prior to uploading to IGV, .vcf output files need to be indexed. See instructions here.

UCSC Browser (Documentation: can be found here and here)
A genome browser that can visualize files in both VCF and GFF3 format. Files in VCF require indexing by using tabix.

Ensembl (Documentation: http://www.ensembl.org/index.html)
Another genome browser that can display files in both, VCF and GFF3 format.

Analyze Variant Data

VCFtools (Documentation: http://vcftools.sourceforge.net/)
A program package that can process .vcf files. Some basic operations performed by VCFtools include:

Filter out specific variants
Compare files
Summarize variants
Convert to different file types
Validate and merge files
Create intersections and subsets of variants

VCFtools is can also calculate basic statistics such as allele frequencies, FST, and Tajima's D.

PLINK (Documentation: can be found here and here)
PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. PLINK format (PED/MAP) is often used to look at SNP data in population genetics and can be used for several tools, such as the clustering program ADMIXTURE or PCA (part of the EIGENSOFT package). PLINK is also useful for running GWAS (Genome-wide Association Studies).

R/Bioconductor (Documentation: can be found here, here and here)
There are multiple R packages that can read data from .vcf files in order to annotate variants. There are also R packages available to run GWAS and QTL studies.

MATLAB (Documentation: can be found here and here)
A high-level language and interactive environment for numerical computation, visualization, and programming. MATLAB has the capability to process and annotate GFF files.