FALCON-formatter

The program FALCON-formatter takes fastq and fasta files from a Pacific Biosciences sequencer and formats them for de novo assembly with FALCON.

Description

Even though it is more convenient to store all reads in a single FASTA or FASTQ file on your system, Dazzler (and therefore FALCON) does not accept this kind of input. All inputs MUST be in FASTA format with files split by barcode, set, and part number. This means that fields 1-6 in the example below must be unique to each input file.

m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230
1yymmdd_hhmmss 33333 4444444444444444444444444444444444 55 66 777 8888888888
  1. “m” = movie
  2. Time of Run Start (yymmdd_hhmmss)
  3. Instrument Serial Number
  4. SMRT Cell Barcode
  5. Set Number
  6. Part Number
  7. ZMW hole number*
  8. Subread Region (start_stop using polymerase read coordinates)*
  9. * These fields are only used in fasta/q headers

More information about file formats can be found at the SMRT-Analysis wiki.

Below is an example that demonstrates this requirement and process by correctly splitting the file Example.fasta.

Example.fasta

>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/103_725
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/973_13390
>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/15030_17394

In the 4 headers, there are two unique 1-6 field sets:

>m140415_143853_42175_c100635972550000001823121909121417_s1_p0
>m140415_143853_42175_c324508543089230982134098587348034_s1_p0

All subreads corresponding to these headers need to be in their own files, so Example.fasta would be split accordingly:

m14041514385342175c100635972550000001823121909121417s1_p0.fasta

>m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/3100_11230 >m140415_143853_42175_c100635972550000001823121909121417_s1_p0/553/15030_17394

m14041514385342175c324508543089230982134098587348034s1_p0.fasta

>m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/103_725 >m140415_143853_42175_c324508543089230982134098587348034_s1_p0/553/973_13390

FALCON-formatter takes FASTA/Q files or folders of files as input, converts the FASTQ to FASTA and writes each read to a file corresponding to fields 1 through 6.

Usage

You first need to find the FALCON-formatter app in the HPC app catalog and launch it. Then, click on the “Inputs” drop down arrow to designate your inputs.

iplant_formatter_01

Then, click the browse button to open up a file explorer to choose your input.

iplant_formatter_02

Select either a single fastq/fastq file or a whole folder to process.

iplant_formatter_03

Click “Launch Analysis” to start your job. You’ll get notifications when the program starts and when it finishes.

iplant_formatter_04