tbl2asn (gapped)-22.9 using DE
| The CyVerse App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01'). Also, as part of the 2.8 app categorization, a number of apps were deprecated and are no longer available, and there is no longer an Archive category. You can search for a suitable replacement in the List of Applications in this window, or search on an app name or tool used for an app in the Apps window search field. If you need anapp reinstated, please contact support@cyverse.org. |
The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
Rationale and background:
If your contig sequences include runs of N's that represent gaps, you will need to include assembly_gap features with the appropriate linkage evidence. If the sequences meet certain requirements, then you can generate a gapped submission with tbl2asn using the arguments -l (to add linkage evidence) and -a (to add assembly_gaps), as described below. Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank. It uses many of the same functions as Sequin but is driven generally by data files. Tbl2asn generates .sqn files using template for submission to GenBank. Additional manual editing is not required before submission.
Pre-Requisites
A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
Mandatory arguments
- Template file containing a text ASN.1 Submit-block object (suffix .sbt).
- Nucleotide sequence data in FASTA format (suffix .fsa). Can be either a single fasta file (containing a single sequence) or single fasta file (containing multiple sequences)
- Linkage Evidence: Type of evidence used to assert linkage across the gaps. These are the available options (they correspond to the options for column 9 of an AGP file):
- Output file name
Optional arguments
- Feature Table or Annotation file (suffix .tbl). [Required only if including annotation]
- Structured comment file (suffix .cmt)
Gap details
There are two types of gap lengths:
- Estimated Gap length: The approximate gap size is known. This is also used if the gap is known to be small (e. g. gap could be between 10-50 N's).
- Unknown Gap length: The gap size is not known (e.g. gap could be 50 or 50000 N's) but the order and orientation of the contigs are known. We suggest using 100 N's to represent gaps of unknown length rather than a random number because it will allow you to add assembly_gap features using tbl2asn.
Parameters
- Master Genome Flags
- Discrepancy Report: Recommended only for annotated genome submissions, complete or WGS. See the Discrepancy Report page for information about its output.
- Modifiers for FASTA Definition Lines: Allows the addition of source qualifiers that will be the same for each submission
Test/sample data:
The test data are provided for testing tbl2asn (ungapped)-22.9 in here - /iplant/home/shared/iplantcollaborative/example_data/tbl2asn.sample.data:
Use the following inputs/outputs and parameters for testing tbl2asn (gapped)-22.9
1. All the gaps are of estimated lengths: Every run of 5 or more Ns represents a gap of estimated length, and the linkage evidence is paired-ends:
Note that you should only include an assembly_gap for runs of N's that represent gaps. Do not add assembly_gaps for single or short runs of N's that represent ambiguous bases. You will need to check your assembly parameters to determine what the N's represent.
Mandatory argument
Template file - template_BP_BS.sbt
- Fasta file - sample.gapped.unknown.fsa
- Linkage evidence - paired-ends (ie, for paired ends or mate pairs)
- Output file - out.gapped.sqn
- Optional arguments
- Annotation file - multiple.tbl
- Structured comment file - assembly.cmt
- Gap details
- Estimated Gap length - r5k (Runs of 5 or more N's are estimated gaps and shorter runs of N's are ambiguous bases).
- Parameters
- Organism name - [organism=Helicobacter pylori ABC1] [strain=ABC1] [host=Homo sapiens] [isolation-source=blood]
- Master Genome Flag - n (default)
- Run Discrepency report - checked (Recommended)
2. ALL of the gaps are 100bp and are of unknown length: All gaps are 100 Ns and are of unknown length, and the linkage evidence is by alignment to another genome of the same genus:
Note that all of the unknown length gaps must be 100 N's. An assembly_gap will be added for every run of 100 N's. All other N's will be ignored. Please contact us for additional instructions if there are unknown length gaps of other sizes. Note that you must know the order and orientation of the contigs. You cannot randomly link contigs using unknown (or known) length gaps. If you do not have linkage evidence, submit the sequences as individual contigs.
Mandatory argument
Template file - template_BP_BS.sbt
- Fasta file - sample.gapped.known.fsa
- Linkage evidence - align-genus
- Output file - out.gapped.sqn
- Optional arguments
- Annotation file - multiple.tbl
- Structured comment file - assembly.cmt
- Gap details
- Estimated Gap length - r100u (Runs of 5 or more N's are estimated gaps and shorter runs of N's are ambiguous bases).
- Parameters
- Organism name - [organism=Helicobacter pylori ABC1] [strain=ABC1] [host=Homo sapiens] [isolation-source=blood]
- Master Genome Flag - n (default)
- Run Discrepency report - checked (Recommended)
3. There are both estimated length and unknown length gaps: Runs of 10 or more N's are estimated gaps, and shorter runs of N's are just ambiguous bases, and all runs of exactly 100 N's are unknown gaps, and the linkage evidence is paired-ends
Note that all of the unknown length gaps must be 100 N's. The # indicates the size of the minimum number of N's to convert to an estimated length gap. If some run's of 100 N's are unknown length and others are estimated length, please contact us for more information.
Mandatory argument
Template file - template_BP_BS.sbt
- Fasta file - sample.gapped.unknown.fsa
- Linkage evidence - paired-ends (ie, for paired ends or mate pairs)
- Output file - out.gapped.sqn
- Optional arguments
- Annotation file - multiple.tbl
- Structured comment file - assembly.cmt
- Gap details
- Estimated Gap length - r10u
- Parameters
- Organism name - [organism=Helicobacter pylori ABC1] [strain=ABC1] [host=Homo sapiens] [isolation-source=blood]
- Master Genome Flag - n (default)
- Run Discrepency report - checked (Recommended)
Output Reports:
- out.gapped.sqn - sqn file for submission to WGS
- multiple.val - varification report
- discrep - discrepency report
- errorsummary.val - Summary file showing the number, severity and type of errors found in all the .val files.
More information about tbl2asn (gapped)-22.9 can be found at http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/ and http://www.ncbi.nlm.nih.gov/genbank/wgs_gapped/