Annotation for Variant Detection Workflow

This used to housed at Google Docs

Second-Generation Sequencing-based Genotyping

Persona: Andres is a bioinformatics-savvy postdoctoral fellow. He is command-line proficient, writes scripts and advanced analytical routines. He is looking for a cluster environment where there are effective, standard tools and algorithms in place to facilitate his work.
Persona: Blanca is a graduate student who needs to analyse some natural variation sequencing data. She's used the command line before, but would prefer a visual workflow manager interface to her work.
Persona: Carlos is a new to second-generation quantitative sequencing. He requires extensive education and desires a web-based interface to interact with his data.
Persona: Dolores is a PI who was trained in molecular biology and is familiar with "traditional" DNA sequencing methods and RNA detection methods. She is driven by biological questions and has little computational expertise. She has used microarrays in her research, but has not been deeply involved in the statistical analysis. She would like to use RNA-seq methods to quantitate gene expression and examine the potential for alternative splicing of her favorite genes. She is most interested in defining the regulatory networks that her favorite genes are a part of and defining new components of this network. For Dolores, the phenotype she is interested in is drought tolerance and wants to extend her basic studies in Arabidopsis into a crop that really matters - corn.

Problem Statement: Andres wants to discover sequence variants in his sample relative to an established reference genome sequence.

NHH-is this the overall goal of this group? Do they want to do this in the discovery environment? Let's see what the workflow is for this.

MWV 10-22-2009
1. See 00_NGS_Framework_v03.pdf at G2P wiki
2. This is a direct goal for Andres, but in the service of a more encompassing project. For the purpose of development, this is an appropriate level of abstraction. The NGS workflow is a tool.
3. Define discovery environment. If you mean a slick GUI system - no. If you mean by accessing a robust interactive command line environment, yes. Not all biologists want GUIs. At this stage in the game, Andres would probably feel constrained by a GUI.

Generation

Andres can generate his own quantitative sequencing data using Illumina, SOLID, 454, or other sources.
Andres can use someone else's quantitative sequencing data derived from Illumina, SOLID, 454, or other sources.
NHH-the discovery environment should serve as a "database" for these sequences?

MWV 10-22-2009
iPlant's foundational infrastructure, upon which this NGS system runs, should provide archiving and curation of these primary data sources. It should also provide the same service for key derived data products arising from the NGS genotyping workflow.

Sequencing data is usually arranged in terms of lanes or channels and is packaged into as a "run" (SOLID) or a "flowcell" (Illumina). Flowcells and Runs are usually assigned a unique ID with respect to their sequencing facility. There is no nomenclature standard. Channels or lanes are usually specified by number (1-8 for Illumina).

NHH-would a standard need to be implemented? How would Andres be able to search for a particular dataset? Would we need to tag it in some way? Is this where the visualization comes in to play?

MWV 10-22-2009
Eventually, some sort of standardization might be needed. Alternatively, data organization would take place entirely based on user-specified tags, and we would impose no structure beyond the minimum to allow the work to take place. I favor more user-centric control in this respect. We are sidestepping a lot of this for "1.0" by assuming that Andres knows exactly what his input file is, and how he needs to treat it, and is basically using our system as a way of accessing cheap cycles and robust analysis.

More than one biological sample can be measured in an individual channel by use of multiplexing facilitated by barcoding, in which case DNA of known sequence is added to the 5' end of the sequencing target. Samples are discriminated by a combination of channel ID and presence of a specific barcode in their sequence.

The primary processing workflow for Illumina, 454, & SOLID generate a directory of files corresponding to base calls for the sequences, arranged as I have described.

FASTQ is a standard interchange format used by the major sequencing centers and supported by all sequencing vendors. Andres can convert the proprietary formats from Illumina, 454, and SOLID systems into FASTQ.

FASTA is a less expressive interchange format used by the major sequencing centers. Andres can convert the proprietary formats from Illumina or SOLID systems into FASTA. Andres can convert FASTQ to FASTA. Andres cannot convert FASTA to FASTQ.

For a given sample, Andres needs to know and be able to specify:
the correspondence between sample identity and channel, barcode (if it exists), and flowcell/run identifier
NHH-What is the significance of his being able to know the channel, barcode and run identifier? Is this for his own data or with regards to the data he would search for?

MWV 10-22-2009
Again, initially, Andres won't be searching but will be working directly on a file he places into a command line discovery environment. He does need to specify some of these terms so that the workflow can proceed properly.

whether sequences in the sample are single- or pair-end sequence runs.
the number of cycles for the sample (it will be the same for all samples on a given flow cell).
NHH-Why is this important to Andres?

MWV: so that the workflow can proceed properly.

the range of insert sizes for a paired-end sample.
a 5' adapter that needs to be trimmed.
a 3' adapter that needs to be trimmed.
metadata that describes the primary generation and preparation of the sample
metadata explaining the relationship of the sample to the larger experimental design
NHH-Could I see some examples of this please? Could you point me to any reference sites where this data is currently stored? Will be looking at Galaxy tutorial and that may answer questions for me.

Staging sequence data to a remote system

Andres can place sequencing data in his working directory on a remote file system

Transports
Andres can use SFTP (secure transfer protocol)
Andres can use FTP (file transfer protocol)
Andres can specify HTTP or HTTPS URIs
Andres can specify a SRA (short read archive) series or sample ID
Andres can specify the ID of an existing data set
Andres can use browser upload (Aspera) NHH- is this open source?
Formats
Individual FASTQ files
A directory of FASTQ files
An archived directory of FASTQ files
Individual FASTA files
A directory of FASTA files
An archived directory of FASTA files
An archived directory of the output from Illumina/SOLID/454 basecalling workflow
A specially packaged SRA archive of the output of from Illumina/SOLID/454 basecalling

Specifying a reference genome
NHH-will need to see overall workflow for this group.

MWV 10-22-2009
My assumption is that there will be a collection of reference genomes, stored as FASTA, on an iPlant hosted system. Andres will specify them either directly by filename/path or via a name that we assign to the reference sequence that hides that bit of complexity from him.

Andres can select a genome against which sequence data will be compared.

Constraints
Reference genome can be a finished genome
Reference genome can be a contig- or scaffold-based assembly
Transports
Andres can use SFTP
Andres can use FTP
Andres can specify HTTP or HTTPS URIs
Andres can specify the ID of an existing reference genome
Andres can use browser upload (Aspera)
Formats
FASTA file
A compressed FASTA file
A directory of compressed or uncompressed FASTA files
An archived directory of compressed or uncompressed FASTA files
A binary formatted genome file in a format such as 2bit

Primary operations on the sequence data

Andres needs to convert proprietary base call files into a standard format
Andres can trim 3' adapters
Andres can trim 5' adapters
Andres needs to split pair-end read records into separate sequences
Andres needs to filter sequence artifacts
Andres needs to filter optical and PCR artifacts
Andres needs to filter low-quality sequences
Andres needs to partition samples from channels by barcode

Curatorial operations on the primary sample data

Andres can tag a sample with arbitrary keys
Andres can tag a sample using a controlled vocabulary
Andres can remove a tag from a sample
Andres can search for and retrieve samples by tag
Andres can organize samples by tag
Andres can organize samples in a traditional 'directory' structure

Analytical operations on the sequence data

Andres can perform de novo assembly on sequence reads - this will be discussed in a separate document

Andres can align the reads to a reference sequence

Parameter specification
algorithm: BWA, SOAP, ELAND, BOWTIE, VMATCH
reference sequence
some algorithms (burrows-wheeler based) require pre-processing of reference sequence
gapped/ungapped
number of allowable mismatches
if pair-end alignment, library insert size range
seed length (some algorithms)
allow iterative matching
allow Smith-Waterman alignment
and more, depending on algorithm
Constraints
CPU time
Memory utilization
Output file size
Outputs
Aligners export proprietary, but documented formats, with exception of BWA which emits SAM
All alignments are tab-delimited text files sharing some common fields
Points of variability among alignment outputs, besides physical formatting, include
statistical basis and representation of alignment quality
treatment of multiple best matches
treatment of non-matches
representation of opposite-strand matches
0- versus 1-based coordinate system
Common emergent format is SAM (text-based, requires access to reference sequence) or BAM (binary, comprehensive representation of alignment results)
Converters exist as part of SAMTOOLS to convert BLAST, SOAP, BOWTIE, ELAND, MAQ output formats to BWA

Andres can extract a reference-based genotype from a sequence alignment
Method
...
Constraints
Outputs
HapMap format (See http://tagzilla.nci.nih.gov/docs.html for details)
Genotype Likelihood Format (GLF; see http://www.1000genomes.org/)
Simple TAB-delim file: reference sequence, position, reference, called consensus, quality/confidence, depth