Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Include Page
docs:_DE_archived_apps_blurb
docs:_DE_archived_apps_blurb

BLASTX Contig Annotation

Blastx Contig Annotation provides a means for identifying gene-space within contigs of genome and transcriptome assemblies on the basis of homology to previously characterized sequences.   The method is carried out in two stages.  Stage 1 performs alignment of contigs to a reference database and Stage 2 characterizes these data and provides contig-level output and summary tables to help evaluate assembly quality and gene-space coverage.

In Stage 2 the user provides blastx alignment data that was the output of Stage 1.  Several text files are returned as output, as detailed below.

Quick Start

  • To use Stage 2, you must first perform Stage 1 (blastx alignment).  Upload the results.txt file from Stage 1 before launching Stage 2.

Test Data

All files are located in the Community Data directory of the iPlant Discovery Environment at the following path:

Community Data > iplantcollaborative > example_data > blastx_contig_annotation

Input File(s)

As test data, use results.txt

Parameters Used in App

  • Default parameters only, no further configuration needed.
  • Default E-value cutoff for significant match is set at 1e-02 by default

Output File(s)

There are 5 output text files:

  • match_report.txt
    • Reports those contigs that gave a significant hit to the database
    • example: blastx_contig_annotation/stage_2/output/match_report.txt
    • Output is a table with the following information:

      Column header

      Description

      contig_id

      Sequence id of contig query

      contig_len

      Length (bp) of contig

      hit_cnt

      Number of database hits (always 1 in this case because current default method collects top hit only)

      species_cnt

      Number of unique species hit (always 1 in this case because current default method collects top hit only)

      top_h_id

      Sequence id of RefSeq top hit

      top_h_len

      Length (bp) of RefSeq top hit

      e-value

      E-value of top hit

      hit_freq

      Number of unique contigs in data set that aligned to this RefSeq protein as top hit.  This could be used as an indicator of repetitiveness

      pct_id

      Percent of matched amino acids within aligned region of subject RefSeq protein

      contig_cov

      Percent of nucleotides in contig query that is contained in aligned HSP's

      top_h_cov

      Percent of amino acids in RefSeq top hit that is contained in aligned HSPs

      qhlr

      Query Hit Length Ratio:  The length of 1/3 contig length / length of hit protein.  Provides upper limit on "completeness" of contig.

      top_h_species

      Species of RefSeq top hit

      hit_desc

      Description on the fasta header line of the  RefSeq top hit 

  • no_hit_contigs.txt
    • Gives list of contig id's that failed to show significant match to RefSeq database
    • example: stage_2 > output > no_hit_contigs.txt
  • top_hit_species.txt
    • Summed counts of contigs with significant hits broken down by hit species
    • example: stage_2 > output > top_hit_species.txt
  • diversity_report.txt
    • Summed counts of unique reference genes in alignments across contig alignments broken down by species.
    • example: stage_2 > output > diversity_report.txt
  • quality_report.txt
    • Summary statistics.  Distribution amongst top hit alignments of Query Hit Length Ratios (QHLR), E-value, and hit coverage.  Gives length distributions for the set of contigs with significant alignments and the set of contigs without significant alignment.
    • example: stage_2 >  output > quality_report.txt

Tool Source for App

...