The FALCON diploid assembler is an exciting tool, and TACC and CyVerse have worked to install it on Stampede, currently the 10th fastest supercomputer in the world. Each job will take advantage of 30 nodes (30x16=480 cores) and optimized data-access to finish your assembly as fast as possible. The FALCON workflow consists of 4 main sections:

Error Correction - EC

Due to high error-rate of Pacific Biosciences reads, primary reads are first corrected using pairwise alignment and corrected through consensus. Falcon performs the pairwise alignment in parallel using daligner. The parameters for this step are prefixed with EC and are explained below.

Label	Default Value	Description
Minimum primary read length	12000	All reads longer than 12,000 basepairs (bp) are corrected. Depending on the distribution of read lengths in your data, you may want to set this value lower for older protocols, and higher for future protocols.
Minimum read length for pre-assembly	500	Align reads longer than 500bp to the primary (12,000bp) reads for consensus correction. Depending on the quality of your run, you may want to set this lower if the bulk of your reads are smaller than 500bp.
Maximum number of occurrences for a k-mer during pre-assembly	16	If a k-mer occurs more than 16 times in a 200 megabase block, that k-mer is no longer used for seeding. This essentially masks low-complexity regions. This is a difficult parameter to choose and may require tuning since daligner finds the optimal k.
Minimum sequence correlation rate during pre-assembly	0.7	Only use alignments with greater than 70% correlation for consensus correction. Lowering this value could lead towards over-correction, and raising it may lead to reads that are uncorrected.
Minimum local alignment length during pre-assembly	1000	Require that the local alignment between the primary and secondary reads reads be at least 1000 bp. This should ideally be less than or equal to the "Minimum read length for pre-assembly".
Minimum k-mer base matches for pre-assembly	35	The primary and secondary reads must share a 35 bp contiguous run of k-mers for alignment. Lowering this will significantly increase computation time. Increasing it assumes you have at least one large stretch of perfect identity.

Falcon Consensus - FC

After reads have been pairwise aligned to each other with daligner, error-correction continues by correcting bases and major indels through candidate consensus with fc_concensus.py and LA4Falcon (manual). The parameters for this step are prefixed with FC and are explained below.

Label	Default Value	Description
Minimum identity for correction	0.7	Require sequences have a minimum alignment identity of 0.7 to be used for correction. Set this lower if you expect many indels in your data.
Minimum coverage for consensus	4	Require at least 4 reads to break the consensus.
Maximum reads for consensus	200	For every primary read, use the 200 longest alignments for consensus correction. The maximum value for this setting is currently 400.

Overlap Assembly - OV

The corrected (preads) are pairwise mapped using daligner once again for final overlapping to construct the final string-graph assembly. The parameters for this step are prefixed with OV and are explained below. Even though all parameters for this step are the same as the error correction step, many of the values have been changed to be more stringent for the final assembly.

Label	Default Value	Description
Minimum primary read length	12000	All reads longer than 12,000 basepairs (bp) are candidates for final overlap. Depending on the distribution of read lengths in your data, you may want to set this value lower for older protocols, and higher for future protocols.
Minimum read length for overlap	500	Align reads longer than 500bp to the primary (12,000bp) reads for overlap assembly. Depending on the quality of your run, you may want to set this lower if the bulk of your reads are smaller than 500bp.
Maximum number of occurrences for a k-mer during overlap	32	If a k-mer occurs more than 32 times in a 200 megabase block, that k-mer is no longer used for seeding. This essentially masks low-complexity regions. This is a difficult parameter to choose and may require tuning since daligner finds the optimal k.
Minimum sequence correlation rate during overlap	0.96	Only use alignments with greater than 96% correlation for overlap assembly. Lowering this value could lead an over-connected string-graph, but may be necessary for reads with high error rates.
Minimum local alignment length during overlap	500	Require that the local alignment between the primary and secondary reads reads be at least 500 bp. This should ideally be less than or equal to the "Minimum read length for overlap".
Minimum k-mer base matches for overlap	60	The primary and secondary reads must share a 60 bp contiguous run of k-mers for alignment. Lowering this will significantly increase computation time. Increasing it assumes you have at least one large stretch of perfect identity.

Overlap Filtering - OF

After re-aligning the corrected reads, the final string-graph is assembled and filtered to minimize complexity. The final graph is then traversed, and the primary and alternate contigs are constructed. The parameters for this step are prefixed with OF and are explained below.

Label	Default Value	Description
Maximum difference in strand coverage	100	An overlap is not used for assembly if the difference in coverage between the 5' and 3' strands exceeds 100.
Maximum strand coverage	100	Do not use a read for overlap if there are more than 100 reads aligned to either the 5' or 3' strands.
Minimum strand coverage	20	Require minimum overlap depth of 20 reads on either the 5' or 3' strands.
Number of best overlaps to output	10	Only use the 10 best overlaps.

Usage

To use FALCON, open the Apps window and either scroll down FALCON (Small Genomes) 0.4.2 or simply search for falcon and open it.

After opening the app, you should get a prompt that looks like this.

Click the Inputs drop-down to specify the data you want to use for input.

For this example, we're going to use the raw example E. coli that comes with FALCON (link). FALCON and DAZZ_DB expect that each fasta file have a unique barcode, set, and part number. If you are unsure about the formatting of your data, run it through the FALCON-formatter tool. Using the directory tree on the left or by entering the directory in the main panel, select a folder of fasta inputs to use with FALCON.

After choosing your selection, make sure your input does reflect a folder before moving on to the Parameters section.

The current defaults are the recommended E. coli parameters and yield a single primary contig in the results. Based on the descriptions above, you can change the parameters in each section (EC, FC, OV, OF) to better fit your data.

While the execution has been optimized for running on Stampede, FALCON submits many separate jobs that all wait in line to run. Please be patient with these jobs as thousands of other users also use Stampede for their research. When your job completes, you should be left with a FALCON (Small Genomes) 0.4.2 run in your analyses folder.

Output

File	Description
2-asm-falcon/	Folder of final assembly
2-asm-falcon/p_ctg.fa	Primary contigs
2-asm-falcon/a_ctg.fa	Associated contigs
preads/	Corrected preads that can be used for re-assembly if the EC and FC parameters do not change.
*.err	Error log from your job
*.out	Output log from your job
*.pid	Job process id
input.fofn	Files given as input to FALCON
job.cfg	FALCON job parameters
pypeflow.log	pypeFLOW log
sge_log.tar.gz	Tarball of job logs for each individual task

If your input data was haploid, then all contigs should be primary (p_ctg.fa). Any associated contigs come from sequencing errors and segmental duplications. If the input data was diploid, then most of the associated contigs will be alternate alleles. We can check the status of the E. coli output with samtools.

$ iget p_ctg.fa
$ samtools faidx p_ctg.fa
$ cat p_ctg.fa.fai
000000F 4635638 85      4635638 4635639

Falcon left us with a single 4.6 megabase contig.

All records in a_ctg.fa are in the form

>[associated contig name] [starting primary contig] [ending primary contig]

More information on the output can be found at:

https://github.com/PacificBiosciences/FALCON/tree/master