taco-0.7.0
| The CyVerse App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01'). Also, as part of the 2.8 app categorization, a number of apps were deprecated and are no longer available, and there is no longer an Archive category. You can search for a suitable replacement in the List of Applications in this window, or search on an app name or tool used for an app in the Apps window search field. If you need an app reinstated, please contact support@cyverse.org. |
Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org. Thank you.
TACO: Multi-sample transcriptome assembly from RNA-Seq
Transcriptome assemblers reconstruct full-length transcripts from the short sequence fragments generated by RNA-Seq. Large consortia such as TCGA, ICGC, GTex, ENCODE, the Cancer Cell Line Encyclopedia (CCLE), and others have performed RNA-Seq on thousands of human tissues and cell lines, providing an unparalleled resource for investigating transcriptional diversity and complexity. Transcriptome Assemblies Combined into One (TACO), an algorithm that reconstructs a consensus transcriptome from a collection of individual assemblies. TACO employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. TACO vastly outperforms existing software tools such as Cuffmerge and Stringtie merge.
Citation:
Mandatory arguments
Inputs:
Input files: Paths to the individual gtf files
- Input list file: Path to the list file that contains the list of individual gtf files
Outputs: Directory where output files will be stored
Parameters
- GTF attribute field: GTF attribute field containing expression estimate. The default setting is
FPKM
for Cufflinks GTF input. - Filter min length: Pre-filters input transfrags with
length < N
prior to assembly. Set to0
to disable this filter.Filter min length - Filter max expr: Pre-filters input transfrags with
expression < X
. The units of the expression cutoff valueX
correspond to the units specified by thegtf-expr-attr
parameter, which isFPKM
by default. Set to0.0
to disable this filter. - Isoform fraction: Report transcript isoforms with
expression fraction >=FRAC
relative to the highest expressed gene. For each gene, the highest abundance isoform will be reported with aFRAC
of1.0
.
Test Run
All files are located in the Community Data directory of the CyVerse Discovery Environment at the following path:
Community Data > iplantcollaborative > example_data > taco (/iplant/home/shared/iplantcollaborative/example_data/taco)
Mandatory arguments:
- Inputs:
CCLE-CAL-51-RNA-08.gtf
CCLE-HDQ-P1-RNA-08.gtf
- Input list file:
- list_of_files
CCLE-CAL-51-RNA-08.gtf CCLE-HDQ-P1-RNA-08.gtf
Make sure both the Inputs files and input list file is in the same directory
Parameters:
Leave all the parameters default
Output
TACO writes output to the directory specified by the -o
command line option. Within this directory, the import output files are:
Transcriptome assembly: assembly.gtf
This GTF file contains TACO's assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized (“gene_id”, and “transcript_id”). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. The columns are defined as follows:
Column number | Column name | Example | Description |
---|---|---|---|
1 | seqname | chrX | Chromosome or contig name |
2 | source | taco | The name of the program that generated this file (always taco) |
3 | feature | exon | The type of record (always either “transcript” or “exon”. |
4 | start | 77696957 | The leftmost coordinate of this record (where 1 is the leftmost possible coordinate) |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM) |
7 | strand | + | TACO's guess for which strand the isoform came from. Always one of “+”, “-“, “.” |
7 | frame | . | TACO does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used. |
8 | attributes | … | See below. |
Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
---|---|---|
gene_id | G7 | TACO gene id |
transcript_id | TU56 | TACO transcript id |
locus_id | L1 | TACO locus id |
tss_id | TSS31 | TACO transcription start site id |
expr | 2.441 | Isoform-level abundance. The units correspond to the expression units of the input transfrags (usually FPKM or TPM) |
rel_frac | 0.7647 | Relative abundance of isoform compared to the major isoform in the gene. The most abundant isoform for each gene is assigned a rel_frac of 1.0. Minor isoforms are scored by the ratio (minor expr/major expr) |
abs_frac | 0.7647 | Relative abundance of isoform compared to the total expression of all isoforms in the gene. Isoforms are scored by the ratio (expr / sum(expr(x) for each isoform x)). |
Transcriptome assembly: assembly.bed
This BED file contains TACO's assembled isoforms. Please refer to the UCSC genome browser's detailed description of the BED format. The name
column (Column 4) contains a string of the format gene_id|transcript_id(expr)
, e.g. G7|TU56(2.77)
.
Transfrag coverage profiles: expr.pos.bedgraph, expr.neg.bedgraph, expr.none.bedgraph
TACO outputs 3 bedGraph files with the coverage profile of the input transfrags on the forward, reverse, and unknown/unspecified strands. Please refer to the UCSC genome browser's detailed description of the bedGraph format. These files can be converted to bigWig format using the free conversion tool bedGraphToBigWig
for viewing on genome browsers such as IGV or UCSC.
Transfrag splice junction profiles: splice_junctions.bed
A UCSC BED track of junctions reported by TACO. Each junction consists of two connected BED blocks. The score is the sum of the expression values of transfrags supporting the junction. This file can be converted to bigBed track format for viewing on genome browsers such as IGV or UCSC.
Tool Source for App