Include Page | ||||
BLASTX Contig Annotation
Blastx Contig Annotation provides a means for identifying gene-space within contigs of genome and transcriptome assemblies on the basis of homology to previously characterized sequences. The method is carried out in two stages. Stage 1 performs alignment of contigs to a reference database and Stage 2 characterizes these data and provides contig-level output and summary tables to help evaluate assembly quality and gene-space coverage.
In Stage 2 the user provides blastx alignment data that was the output of Stage 1. Several text files are returned as output, as detailed below.
Quick Start
- To use Stage 2, you must first perform Stage 1 (blastx alignment). Upload the results.txt file from Stage 1 before launching Stage 2.
Test Data
All files are located in the Community Data directory of the iPlant Discovery Environment at the following path:
Community Data > iplantcollaborative > example_data > blastx_contig_annotation
Input File(s)
As test data, use results.txt
Parameters Used in App
- Default parameters only, no further configuration needed.
- Default E-value cutoff for significant match is set at 1e-02 by default
Output File(s)
There are 5 output text files:
- match_report.txt
- Reports those contigs that gave a significant hit to the database
- example: blastx_contig_annotation/stage_2/output/match_report.txt
Output is a table with the following information:
Column header
Sequence id of contig query
Length (bp) of contig
Number of database hits (always 1 in this case because current default method collects top hit only)
Number of unique species hit (always 1 in this case because current default method collects top hit only)
Sequence id of RefSeq top hit
Length (bp) of RefSeq top hit
E-value of top hit
Number of unique contigs in data set that aligned to this RefSeq protein as top hit. This could be used as an indicator of repetitiveness
Percent of matched amino acids within aligned region of subject RefSeq protein
Percent of nucleotides in contig query that is contained in aligned HSP's
Percent of amino acids in RefSeq top hit that is contained in aligned HSPs
Query Hit Length Ratio: The length of 1/3 contig length / length of hit protein. Provides upper limit on "completeness" of contig.
Species of RefSeq top hit
Description on the fasta header line of the RefSeq top hit
- no_hit_contigs.txt
- Gives list of contig id's that failed to show significant match to RefSeq database
- example: stage_2 > output > no_hit_contigs.txt
- top_hit_species.txt
- Summed counts of contigs with significant hits broken down by hit species
- example: stage_2 > output > top_hit_species.txt
- diversity_report.txt
- Summed counts of unique reference genes in alignments across contig alignments broken down by species.
- example: stage_2 > output > diversity_report.txt
- quality_report.txt
- Summary statistics. Distribution amongst top hit alignments of Query Hit Length Ratios (QHLR), E-value, and hit coverage. Gives length distributions for the set of contigs with significant alignments and the set of contigs without significant alignment.
- example: stage_2 > output > quality_report.txt
Tool Source for App