FALCON (Small Genomes) 0.4.2
The FALCON diploid assembler is an exciting tool, and TACC and CyVerse have worked to install it on Stampede, currently the 10th fastest supercomputer in the world. Each job will take advantage of 30 nodes (30x16=480 cores) and optimized data-access to finish your assembly as fast as possible. The FALCON workflow consists of 4 main sections:
Error Correction - EC
Due to high error-rate of Pacific Biosciences reads, primary reads are first corrected using pairwise alignment and corrected through consensus. Falcon performs the pairwise alignment in parallel using daligner. The parameters for this step are prefixed with EC and are explained below.
Label | Default Value | Description |
---|---|---|
Minimum primary read length | 12000 | All reads longer than 12,000 basepairs (bp) are corrected. Depending on the distribution of read lengths in your data, you may want to set this value lower for older protocols, and higher for future protocols. |
Minimum read length for pre-assembly | 500 | Align reads longer than 500bp to the primary (12,000bp) reads for consensus correction. Depending on the quality of your run, you may want to set this lower if the bulk of your reads are smaller than 500bp. |
Maximum number of occurrences for a k-mer during pre-assembly | 16 | If a k-mer occurs more than 16 times in a 200 megabase block, that k-mer is no longer used for seeding. This essentially masks low-complexity regions. This is a difficult parameter to choose and may require tuning since daligner finds the optimal k. |
Minimum sequence correlation rate during pre-assembly | 0.7 | Only use alignments with greater than 70% correlation for consensus correction. Lowering this value could lead towards over-correction, and raising it may lead to reads that are uncorrected. |
Minimum local alignment length during pre-assembly | 1000 | Require that the local alignment between the primary and secondary reads reads be at least 1000 bp. This should ideally be less than or equal to the "Minimum read length for pre-assembly". |
Minimum k-mer base matches for pre-assembly | 35 | The primary and secondary reads must share a 35 bp contiguous run of k-mers for alignment. Lowering this will significantly increase computation time. Increasing it assumes you have at least one large stretch of perfect identity. |
Falcon Consensus - FC
After reads have been pairwise aligned to each other with daligner, error-correction continues by correcting bases and major indels through candidate consensus with fc_concensus.py and LA4Falcon (manual). The parameters for this step are prefixed with FC and are explained below.
Label | Default Value | Description |
---|---|---|
Minimum identity for correction | 0.7 | Require sequences have a minimum alignment identity of 0.7 to be used for correction. Set this lower if you expect many indels in your data. |
Minimum coverage for consensus | 4 | Require at least 4 reads to break the consensus. |
Maximum reads for consensus | 200 | For every primary read, use the 200 longest alignments for consensus correction. The maximum value for this setting is currently 400. |
Overlap Assembly - OV
The corrected (preads) are pairwise mapped using daligner once again for final overlapping to construct the final string-graph assembly. The parameters for this step are prefixed with OV and are explained below. Even though all parameters for this step are the same as the error correction step, many of the values have been changed to be more stringent for the final assembly.
Label | Default Value | Description |
---|---|---|
Minimum primary read length | 12000 | All reads longer than 12,000 basepairs (bp) are candidates for final overlap. Depending on the distribution of read lengths in your data, you may want to set this value lower for older protocols, and higher for future protocols. |
Minimum read length for overlap | 500 | Align reads longer than 500bp to the primary (12,000bp) reads for overlap assembly. Depending on the quality of your run, you may want to set this lower if the bulk of your reads are smaller than 500bp. |
Maximum number of occurrences for a k-mer during overlap | 32 | If a k-mer occurs more than 32 times in a 200 megabase block, that k-mer is no longer used for seeding. This essentially masks low-complexity regions. This is a difficult parameter to choose and may require tuning since daligner finds the optimal k. |
Minimum sequence correlation rate during overlap | 0.96 | Only use alignments with greater than 96% correlation for overlap assembly. Lowering this value could lead an over-connected string-graph, but may be necessary for reads with high error rates. |
Minimum local alignment length during overlap | 500 | Require that the local alignment between the primary and secondary reads reads be at least 500 bp. This should ideally be less than or equal to the "Minimum read length for overlap". |
Minimum k-mer base matches for overlap | 60 | The primary and secondary reads must share a 60 bp contiguous run of k-mers for alignment. Lowering this will significantly increase computation time. Increasing it assumes you have at least one large stretch of perfect identity. |
Overlap Filtering - OF
After re-aligning the corrected reads, the final string-graph is assembled and filtered to minimize complexity. The final graph is then traversed, and the primary and alternate contigs are constructed. The parameters for this step are prefixed with OF and are explained below.
Label | Default Value | Description |
---|---|---|
Maximum difference in strand coverage | 100 | An overlap is not used for assembly if the difference in coverage between the 5' and 3' strands exceeds 100. |
Maximum strand coverage | 100 | Do not use a read for overlap if there are more than 100 reads aligned to either the 5' or 3' strands. |
Minimum strand coverage | 20 | Require minimum overlap depth of 20 reads on either the 5' or 3' strands. |
Number of best overlaps to output | 10 | Only use the 10 best overlaps. |
Usage
Output
File | Description |
---|---|
2-asm-falcon/ | Folder of final assembly |
2-asm-falcon/p_ctg.fa | Primary contigs |
2-asm-falcon/a_ctg.fa | Associated contigs |
preads/ | Corrected preads that can be used for re-assembly if the EC and FC parameters do not change. |
*.err | Error log from your job |
*.out | Output log from your job |
*.pid | Job process id |
input.fofn | Files given as input to FALCON |
job.cfg | FALCON job parameters |
pypeflow.log | pypeFLOW log |
sge_log.tar.gz | Tarball of job logs for each individual task |
If your input data was haploid, then all contigs should be primary (p_ctg.fa). Any associated contigs come from sequencing errors and segmental duplications. If the input data was diploid, then most of the associated contigs will be alternate alleles. We can check the status of the E. coli output with samtools.
$ iget p_ctg.fa
$ samtools faidx p_ctg.fa
$ cat p_ctg.fa.fai
000000F 4635638 85 4635638 4635639
Falcon left us with a single 4.6 megabase contig.
All records in a_ctg.fa are in the form
>[associated contig name] [starting primary contig] [ending primary contig]
More information on the output can be found at: