parseBlastBpo

parseBlastBpo

Community rating: ?????

A utility to parse a BLAST output into a 'BPO' file, used as input into the OrthoMCL program for the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.

Notes:

If you are using this app as part of the workflow referenced above, the BLASTp output should contain the results of an All-by-All BLASTp of the entire protein-encoding gene repertoires of the species being investigated.
Please see visit Cluster Orthologs and Paralogs and Assemble Custom Gene Sets to see how this app fits into the larger workflow.
App adapted from PERL script originally written by Chih-Horng Kuo

Quick Start

To use parseBlastBpo, simply select the BLAST output file you want to parse.

Test Data

Example input and output test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 6_Blastp_output and _7_parseBlastBpo_output_ respectively.

Input File(s)

Use the BLASTp output file from the input directory above for testing.

Use the parseBlastBpo file above to see example output for this app.

Parameters and Options Used in App

There are no parameters for this app.

Output File(s)

Expect 1 output file, named parsedBLASToutput_bpo.txt.

BPO format

1;At1g01190;535;At1g01190;535;0.0;97;1:1-535:1-535.
2;At1g01190;535;At1g01280;510;2e-56;29;1:69-499:28-474.
3;At1g01190;535;At1g11600;510;1e-45;27;1:59-531:21-509.

Each line represents each query-subject similarity relation. And all the info is separated by ";", which are, in order,
(1)similarity id,
(2)query id,
(3)query length,
(4)subject id,
(5)subject length,
(6)BLAST E-value,
(7)percent identity,
(8)HSP info (each HSP is in the format of
(8.1)HSP_id:
(8.2)query_start-query_end:
(8.3)subject_start-subject_end.
different HSP info are separated by "."

IMPORTANT: 1. Similarity ID represents BPO file line id, so it should start from 1 for the first line, and be consecutive for the whole file. 2. BPO file is a parsing result from BLAST, so for each query gene id, its hits can't be scattered in the file, but should be listed in adjacent lines.