Create_BLAST_database-2.6.0__with_parse_seqids_option

Rationale:


Building a BLAST database with local sequences


The makeblastdb application produces BLAST databases from FASTA files. In the simplest case the FASTA definition lines are not parsed by makeblastdb and may be completely unstructured. The text in the definition line will be stored in the BLAST database and displayed in the BLAST report, but it will not be possible to fetch individual sequences using blastdbcmd. Use this app to enable retrieval of sequences based upon sequence identifiers.

In this case, each sequence must have a unique identifier, and that identifier must have a specific format. The identifier should begin right after the “>” sign on the definition line, contain no spaces, and follow the formats described in http://www.ncbi.nlm.nih.gov/toolkit/doc/book/ch_demo/#ch_demo.T5 User supplied sequences should make use of the local or general identifiers described in the above table. A FASTA file with general IDs would look like:

$ cat mydb.fsa
>gnl|MYDB|1 this is sequence 1
GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG
TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA
GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT
GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG
GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT
TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG
CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT
TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT
CACAGACAGCCTGAGACAGTTCTTACGGAAACACCCCAGGACACAATTGAATTAAACAGATTGAATTTAGAATCTTCCAA
>gnl|MYDB|2 this is sequence 2
GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG
TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA
GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT
GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG
GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT
TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG
CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT
TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT
CACAGACAGCCTGAGACAGTTCTTACGGAAACACCCCAGGACACAATTGAATTAAACAGATTGAATTTAGAATCTTCCAA
>gnl|MYDB|3 this is sequence 3
GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG
TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA
GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT
GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG
GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT
TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG
CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT
TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT$

Create_BLAST_database-2.6.0__with_parse_seqids_option is invoked using the following:

  1. Input (s)
    1. Input file 
    2. Input Sequence Format
    3. Input type
    4. Prefix to use for database
  2. Parameters
    1. File containing masking data (csv format)
    2. Max per file size 

Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org. Thank you.

Test Data

 The test data 

All files are located in the Community Data directory of the iPlant Discovery Environment at the following path: Community Data > iplantcollaborative > example_data > makeblastdb (/iplant/home/shared/iplantcollaborative/example_data/makeblastdb/)

Run Create_BLAST_database-2.6.0__with_parse_seqids_option as following:

  1. Input (s)
    1. Input file : /iplant/home/shared/iplantcollaborative/example_data/makeblastdb/plant.118.protein.faa
    2. Input Sequence Format: protein
    3. Input type: Fasta
    4. Prefix to use for database: blastdb_p
  2. Parameters
    1. File containing masking data (csv format): None
    2. Max per file size: 1GB

Tool Source for App