Include Page | ||||
---|---|---|---|---|
|
BLASTX Contig Annotation
Blastx Contig Annotation provides a means for identifying gene-space within contigs of genome and transcriptome assemblies on the basis of homology to previously characterized sequences. The method is carried out in two stages. Stage 1 performs alignment of contigs to a reference database and Stage 2 characterizes these data and provides contig-level output and summary tables to help evaluate assembly quality and gene-space coverage.
In Stage 2 the user provides blastx alignment data that was the output of Stage 1. Several text files are returned as output, as detailed below.
Quick Start
- To use Stage 2, you must first perform Stage 1 (blastx alignment). Upload the results.txt file from Stage 1 before launching Stage 2.
Test Data
All files are located in the Community Data directory of the iPlant Discovery Environment at the following path:
Community Data > iplantcollaborative > example_data > blastx_contig_annotation
Input File(s)
As test data, use results.txt
Parameters Used in App
- Default parameters only, no further configuration needed.
- Default E-value cutoff for significant match is set at 1e-02 by default
Output File(s)
There are 5 output text files:
- match_report.txt
- Reports those contigs that gave a significant hit to the database
- example: blastx_contig_annotation/stage_2/output/match_report.txt
Output is a table with the following information:
Column header
Description
contig_id
Sequence id of contig query
contig_len
Length (bp) of contig
hit_cnt
Number of database hits (always 1 in this case because current default method collects top hit only)
species_cnt
Number of unique species hit (always 1 in this case because current default method collects top hit only)
top_h_id
Sequence id of RefSeq top hit
top_h_len
Length (bp) of RefSeq top hit
e-value
E-value of top hit
hit_freq
Number of unique contigs in data set that aligned to this RefSeq protein as top hit. This could be used as an indicator of repetitiveness
pct_id
Percent of matched amino acids within aligned region of subject RefSeq protein
contig_cov
Percent of nucleotides in contig query that is contained in aligned HSP's
top_h_cov
Percent of amino acids in RefSeq top hit that is contained in aligned HSPs
qhlr
Query Hit Length Ratio: The length of 1/3 contig length / length of hit protein. Provides upper limit on "completeness" of contig.
top_h_species
Species of RefSeq top hit
hit_desc
Description on the fasta header line of the RefSeq top hit
- no_hit_contigs.txt
- Gives list of contig id's that failed to show significant match to RefSeq database
- example: stage_2 > output > no_hit_contigs.txt
- top_hit_species.txt
- Summed counts of contigs with significant hits broken down by hit species
- example: stage_2 > output > top_hit_species.txt
- diversity_report.txt
- Summed counts of unique reference genes in alignments across contig alignments broken down by species.
- example: stage_2 > output > diversity_report.txt
- quality_report.txt
- Summary statistics. Distribution amongst top hit alignments of Query Hit Length Ratios (QHLR), E-value, and hit coverage. Gives length distributions for the set of contigs with significant alignments and the set of contigs without significant alignment.
- example: stage_2 > output > quality_report.txt
Tool Source for App
...