AgBase GOanna 2.1
Community rating: ?????
GOanna allows users to quickly add more GO annotations to their gene or gene product list by transferring GO annotations based on sequence homology. GOanna is designed to be a special use case of BLAST, and normal BLAST rules about alignments and what constitutes a 'good' alignment apply.
In this version of GOanna, the databases you can search against have been re-designed to provide better results for a broader range of species.
To use GOanna 2.1, import your data as a fasta file of either nucleotide or protein sequences.
Test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> directory.
The input file is a fasta file of either nucleotide or protein sequences.
Parameters Used in App
BLAST method: select BlastP for protein inputs or BlastX for nucleotide inputs.
Database: GOanna provides databases containing only sequences that have GO annotations.
AgBase_community: This database contains GO annotations held at AgBase that have not yet been included in the GO Consortium annotation files.
(a) more general databases
SwissProt: SwissProt is a subset of the UniProt database which contains proteins that are manually annotated and are reviewed. This database contains the most highly annotated subset UniProt.
UniProt: This database contains both SwissProt and TrEMBL proteins. TrEMBL is a subset of the UniProt database which contains proteins that are automatically annotated and these records may have less functional information.
AgBase-UniProt: This is the largest database, containing all GOA proteins from UniProt supplemented with additional proteins in the AgBase files.
(b) Taxonomic Subsets databases
These databases are created by subsetting these larger database based upon taxonomy.
Current GOanna database statistics can be found here:http://www.agbase.msstate.edu/cgi-bin/agbase_species_blast_detail.pl
Please contact AgBase at email@example.com to request customized databases.
Filter database sequences with no GO annotations/only IEA/ND annotations: It is strongly recommended to select this feature if you are using GOanna to transfer GO annotations from one sequence to another. Non-experimental evidence codes may already be based upon sequence analyses and transferring function inferred from a sequence that was inferred from another sequence may result in poor functional information for your data set. If you wish to do functional prediction for a large data set, we recommend InterProScan as a complementary approach.
Number of descriptions: the number of matches returned for each query. This will change based upon database size, we recommend setting at 3-5.
Word-size: The BLAST program works by finding local alignments or 'words' matches between the query and database sequences and extending these matches. For blastp and blastx, non-exact word matches are extended. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.
EXPECT: This specifies the statistical significance threshold for reporting matches. The default value (10) means that 10 such matches are expected to be found merely by chance (Karlin and Altschul, 1990). Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.
*Matrix :*The matrix refers to the substitution matrix used to assign alignment scores for any possible pair of residues. More information about the types of substitution matrices is here. The Gap Cost for the matrix refers to the cost of making a gap in the alignment, as determined by the matrix selected.
Gap costs: Controls how scoring for an alignment will be penalized for inserting/extending a gap.
Low complexity: This function masks segments of the query sequence that have low compositional complexity by substituting the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") and the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Filtering is only applied to the query sequence (or its translation products), not to database sequences. Filtering can eliminate spuriously high alignment scores that reflect compositional bias rather than significant position-by- position alignment (e.g., hits against proline-rich regions or poly-A tails).
This feature lets you choose what types of GO evidence codes to return for the sequence matches. The GO Consortium recommends that if you are using BLAST to transfer GO from one sequence to another, you should only transfer "Experimental Evidence Codes". Experimental Evidence Codes(EXP,IDA,IPI,IMP,IGI,IEP) is set as default. More information about GO Evidence codes is available here:
The results are returned as a zipped file which contains:
.xls file - Opens in Excel and contains the inputs, their matches and GO annotations for the hits that passed filters for Expect value, Percent Identity and Query_Coverage.
align.html file - Optional file output. Depending on format selection, if present in the zip folder it contains the corresponding alignments for the blast matches. Note that this file might contain more hits that appears in Excel (xls) output because the Percent_Identity and Query_Coverage filters are applied after blasting.
Note that the .xls file contains hyperlinks to the align file - to view these correctly the files should be in the same folder
.tsv file - Optional file output. Depending on format selection, if present contains the corresponding blast matches in a tab-delimited format. Note that this file might contain more hits that appear in Excel (xls) output because the Percent_Identity and Query_Coverage filters are applied after blasting. This file is useful for selecting results based upon % coverage and % identity of the alignment.
sliminput.txt - a GOSummary file for the results that can be used in the AgBase GOSlimViewer tool, which summarizes GO results.
Best Practices for using GOanna
GOanna is a Blast-based tool and to get high quality results, you should follow general Blast guidelines for only selecting "good" alignments. So hints for how to effectively use GOanna are:
GO annotation of a large data set
GO annotation of sequences that do not already have GO assigned aloes researchers to do functional enrichment of their functional genomics data sets. To rapidly GO annotate a transcriptome/proteome data set we recommend using a combined approach of
(a) identifying functional motifs/domains using InterProScan (see: https://pods.iplantcollaborative.org/wiki/display/DEapps/InterproScan5-44.0)
(b) transferring function from homologous sequences based upon Blast (using GOanna, Blast2GO, etc).
These two approaches are complementary and we recommend combining the results of each into a single gene association file.
GO Evidence Codes and transferring GO
GO evidence codes can be broadly grouped into
- Experimental Evidence codes
- Computational Analysis evidence codes
- Author Statement evidence codes
- Curatorial Statement codes
- Automatically-Assigned evidence code
GO Consortium best practice is to only transfer GO based upon Experimental evidence codes (direct, experimental evidence in a species).
Transferring annotations with a Qualifier (See: http://geneontology.org/page/go-annotation-conventions#qual) should also be avoided, as these are usually species-specific statements of function.
Tips for using Blast
GOanna is designed to show alignments for matches sequences (align.html file), however we realize that it is not practical to review every alignment for large data sets. We recommend testing the GOanna Blast results using a small subset of your query sequences (of about 100-200 randomly picked sequences seeded to include the shortest and longest sequences). This test set can be run against the default parameters and the alignments reviewed. By noting the parameters for a "good" alignment researchers will be able to set more useful parameters for their full data set.
In addition, GOanna now outputs a tsv file of results - this file can be sorted to ensure that only matches with set % identity and % coverage are used to transfer GO to your sequences. A good starting point for setting % identity and % coverage is 70/70 although these numbers will change based upon the species you are studying and how closely related it is to the species of the matching sequences.
To improve the run time for your Blast results and to decrease spurious results, we recommend using the more specific databases based upon taxonomic subsets.