FALCON (Small Genomes) 0.4.2

The FALCON diploid assembler is an exciting tool, and TACC and CyVerse have worked to install it on Stampede, currently the 10th fastest supercomputer in the world. Each job will take advantage of 30 nodes (30x16=480 cores) and optimized data-access to finish your assembly as fast as possible. The FALCON workflow consists of 4 main sections:

Error Correction - EC

Due to high error-rate of Pacific Biosciences reads, primary reads are first corrected using pairwise alignment and corrected through consensus. Falcon performs the pairwise alignment in parallel using daligner. The parameters for this step are prefixed with EC and are explained below.

LabelDefault ValueDescription
Minimum primary read length12000

All reads longer than 12,000 basepairs (bp) are corrected. Depending on the distribution of read lengths in your data, you may want to set this value lower for older protocols, and higher for future protocols.

Minimum read length for pre-assembly500Align reads longer than 500bp to the primary (12,000bp) reads for consensus correction. Depending on the quality of your run, you may want to set this lower if the bulk of your reads are smaller than 500bp.
Maximum number of occurrences for a k-mer during pre-assembly16If a k-mer occurs more than 16 times in a 200 megabase block, that k-mer is no longer used for seeding. This essentially masks low-complexity regions. This is a difficult parameter to choose and may require tuning since daligner finds the optimal k.
Minimum sequence correlation rate during pre-assembly0.7Only use alignments with greater than 70% correlation for consensus correction. Lowering this value could lead towards over-correction, and raising it may lead to reads that are uncorrected.
Minimum local alignment length during pre-assembly1000Require that the local alignment between the primary and secondary reads reads be at least 1000 bp. This should ideally be less than or equal to the "Minimum read length for pre-assembly".
Minimum k-mer base matches for pre-assembly35The primary and secondary reads must share a 35 bp contiguous run of k-mers for alignment. Lowering this will significantly increase computation time. Increasing it assumes you have at least one large stretch of perfect identity.

Falcon Consensus - FC

After reads have been pairwise aligned to each other with daligner, error-correction continues by correcting bases and major indels through candidate consensus with fc_concensus.py and LA4Falcon (manual). The parameters for this step are prefixed with FC and are explained below.

LabelDefault ValueDescription
Minimum identity for correction0.7Require sequences have a minimum alignment identity of 0.7 to be used for correction. Set this lower if you expect many indels in your data.
Minimum coverage for consensus4Require at least 4 reads to break the consensus.
Maximum reads for consensus200For every primary read, use the 200 longest alignments for consensus correction. The maximum value for this setting is currently 400.

Overlap Assembly - OV

The corrected (preads) are pairwise mapped using daligner once again for final overlapping to construct the final string-graph assembly. The parameters for this step are prefixed with OV and are explained below. Even though all parameters for this step are the same as the error correction step, many of the values have been changed to be more stringent for the final assembly.

LabelDefault ValueDescription
Minimum primary read length12000

All reads longer than 12,000 basepairs (bp) are candidates for final overlap. Depending on the distribution of read lengths in your data, you may want to set this value lower for older protocols, and higher for future protocols.

Minimum read length for overlap500Align reads longer than 500bp to the primary (12,000bp) reads for overlap assembly. Depending on the quality of your run, you may want to set this lower if the bulk of your reads are smaller than 500bp.
Maximum number of occurrences for a k-mer during overlap32If a k-mer occurs more than 32 times in a 200 megabase block, that k-mer is no longer used for seeding. This essentially masks low-complexity regions. This is a difficult parameter to choose and may require tuning since daligner finds the optimal k.
Minimum sequence correlation rate during overlap0.96Only use alignments with greater than 96% correlation for overlap assembly. Lowering this value could lead an over-connected string-graph, but may be necessary for reads with high error rates.
Minimum local alignment length during overlap500Require that the local alignment between the primary and secondary reads reads be at least 500 bp. This should ideally be less than or equal to the "Minimum read length for overlap".
Minimum k-mer base matches for overlap60The primary and secondary reads must share a 60 bp contiguous run of k-mers for alignment. Lowering this will significantly increase computation time. Increasing it assumes you have at least one large stretch of perfect identity.

Overlap Filtering - OF

After re-aligning the corrected reads, the final string-graph is assembled and filtered to minimize complexity. The final graph is then traversed, and the primary and alternate contigs are constructed. The parameters for this step are prefixed with OF and are explained below.

LabelDefault ValueDescription
Maximum difference in strand coverage100An overlap is not used for assembly if the difference in coverage between the 5' and 3' strands exceeds 100.
Maximum strand coverage100Do not use a read for overlap if there are more than 100 reads aligned to either the 5' or 3' strands.
Minimum strand coverage20Require minimum overlap depth of 20 reads on either the 5' or 3' strands.
Number of best overlaps to output10

Only use the 10 best overlaps.

Usage

To use FALCON, open the Apps window and either scroll down FALCON (Small Genomes) 0.4.2 or simply search for falcon and open it.
 
 
After opening the app, you should get a prompt that looks like this.
 
 
Click the Inputs drop-down to specify the data you want to use for input.
 
 
For this example, we're going to use the raw example E. coli that comes with FALCON (link). FALCON and DAZZ_DB expect that each fasta file have a unique barcode, set, and part number. If you are unsure about the formatting of your data, run it through the FALCON-formatter tool. Using the directory tree on the left or by entering the directory in the main panel, select a folder of fasta inputs to use with FALCON.
 
 
After choosing your selection, make sure your input does reflect a folder before moving on to the Parameters section.
 
 
The current defaults are the recommended E. coli parameters and yield a single primary contig in the results. Based on the descriptions above, you can change the parameters in each section (EC, FC, OV, OF) to better fit your data.
 
 
While the execution has been optimized for running on Stampede, FALCON submits many separate jobs that all wait in line to run. Please be patient with these jobs as thousands of other users also use Stampede for their research. When your job completes, you should be left with a FALCON (Small Genomes) 0.4.2 run in your analyses folder.
 

Output 

FileDescription
2-asm-falcon/Folder of final assembly
2-asm-falcon/p_ctg.faPrimary contigs
2-asm-falcon/a_ctg.faAssociated contigs
preads/Corrected preads that can be used for re-assembly if the EC and FC parameters do not change.
*.errError log from your job
*.outOutput log from your job
*.pidJob process id
input.fofnFiles given as input to FALCON
job.cfgFALCON job parameters
pypeflow.logpypeFLOW log
sge_log.tar.gzTarball of job logs for each individual task

If your input data was haploid, then all contigs should be primary (p_ctg.fa). Any associated contigs come from sequencing errors and segmental duplications. If the input data was diploid, then most of the associated contigs will be alternate alleles. We can check the status of the E. coli output with samtools.

$ iget p_ctg.fa
$ samtools faidx p_ctg.fa
$ cat p_ctg.fa.fai
000000F 4635638 85      4635638 4635639

Falcon left us with a single 4.6 megabase contig.

All records in a_ctg.fa are in the form

>[associated contig name] [starting primary contig] [ending primary contig]

More information on the output can be found at: