E) Split RefSeq files

Split RefSeq files (app: Split FASTA file)

Description: The Split FASTA file app divides a FASTA file into smaller files. The refseq_protein database used here as sample data was downloaded from NCBI and is available in FASTA format in the DE. The file is very large and may cause subsequent operations such as the BLAT-based mapping to require a long time to complete, because of the amount of memory needed to build the index required by BLAT. (Using BLAST instead of BLAT would take even longer.) By splitting the large FASTA file into smaller files, BLAT will map the sequences in a feasible time frame. It is not essential for the user to perform this step, however, with splitting the RefSeq database used in this example into 3 smaller files will reduce the time required for the BLAT mapping from weeks to just over an hour. Documentation: http://kirill-kryukov.com/study/tools/fasta-splitter/.

  1. Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
  2. Open the Split FASTA file app (Public Applications > BLAST > Split FASTA file).
    1. Change 'Analysis Name' to Split_RefSeq_File, add a 'Description' (optional), and use the default 'output folder'.
  3. Click on the Inputs tab.
    1. Click on the 'Select a FASTA file' field. Browse to the folder that holds the FASTA file containing the refseq_protein database (Sample data: Community Data > iplant_training > rna-seq_without_genome > E_split_refseq_file > refseq_protein.fasta). Select the file, then click on OK.
    2. Click on the 'Number of sequence records per sub-file' field and enter 14000000.
    3. Click on the 'Output prefix' field. Set prefix to 'RefseqProtein'.
  4. Click on "Launch Analysis".
  5. Click on 'Analyses' from the DE workspace and monitor the 'Status' of the analysis (e.g., Idle, Submitted, Pending, Running, Completed, Failed).
    1. Once launched, an analysis will continue whether the user remains logged in or not.
    2. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
    3. If the analysis fails or does not proceed in the anticipated timeline, check these tips for troubleshooting. (Using the sample data, the procedure should be complete in less than 10 minutes.)
    4. To re-run an analysis, click the analysis "App" in the 'Analyses' window.
  6. Access analysis results in one of two ways:
    1. In the 'Analyses' window click on the analysis "Name" to open the output folder.
    2. In the 'Data' window, click on user name, then navigate to the folder that holds the output of the analysis. (Find the output for the sample at Community Data > iplant_training > rna-seq_without_genome > E_split_refseq_file > output_from_sample_data.)
  7. The sample output contains 3 files named RefseqProtein.0, RefseqProtein.1, Refseq_protein.2.
Unable to render {include} The included page could not be found.