fastaRename

fastaRename

Community rating: ?????

A utility to replace headers in a fasta file with simple, incremental sequence IDs to (hopefully) eliminate issues with headers when the fasta file is used as input with other apps.  fastaRename is intended as a step in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow, but can be used to rename fasta sequences for other uses. Sequences are renamed based on a user-defined two-letter genus species abbreviation. 3 files are produced:

1- .fasta - renamed sequences
2- .gg - new sequence names (for downstream OrthoMCL input in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.)
3- .map - maps new sequence names to original fasta headers.  This will be useful to associate sequences with original fasta headers if needed.

Notes: 

  • As part of the workflow referenced above, it is intended that an input fasta file contain sequences from a single species, presumably the entire protein-encoding gene repertoire.  This is why a two-letter abbreviation is suggested.  If using the app to rename sequences beyond the scope of this workflow,  choose an abbreviation that makes sense for your experimental design.
  • It is a good idea to keep track of the numbers of sequences and headers in your input files, and compare them to the outputs to ensure that output faithfully represents input.
  • Please visit Cluster Orthologs and Paralogs and Assemble Custom Gene Sets to see how fastaRename fits into the larger workflow.
  • flattenClusters 1.0 can be used to map renamed sequences back to original FASTA headers
  • App adapted from PERL script originally written by Chih-Horng Kuo.

Quick Start

  • To use fastaRename, import your data in fasta format, choose and input a 2-letter abbreviation, and choose the output directory.

Test Data

Input and output test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 2-fastaRename_input and 3-fastaRename_output

Input File(s)

Use any of the files from the input directory above for testing.  To better understand the output, run the app multiple times, with different test files, and use different two-letter abbreviations for each species.  Test input files are from 4 species: Plasmodium falciparum, Toxoplasma gondii, Neospora caninum, and Theileria annulata.    Each file represents a 'complete' protein-encoding gene repertoire for that species.  For the interested, these are Apicomplexan species, chosen for their relatively small gene repertoires.  Data are for testing purposes.  The latest data for each species can be found at EuPathDB.

Parameters Used in App

When the app is run in the Discovery Environment, use the following parameters with the above input file(s) to get the output provided in the next section below.

  • Use these parameters within the DE app interface:
    •  User-defined two-letter taxon abbreviation: PF
      • Above is a suggested example for a fasta file containing sequences from Plasmodium falciparum.

Output File(s)

Expect 3 output files, named for the two-letter abbreviation used as input.

1- .fasta - renamed sequences

2- .gg - new sequence names (for downstream OrthoMCL input in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.)

3- .map - maps new sequence names to original fasta headers.  This will be useful to associate sequences with original fasta headers if needed.