taco-0.7.0

Alert:

 

The CyVerse App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01').

Also, as part of the 2.8 app categorization, a number of apps were deprecated and are no longer available, and there is no longer an Archive category. You can search for a suitable replacement in the List of Applications in this window, or search on an app name or tool used for an app in the Apps window search field. If you need an app reinstated, please contact support@cyverse.org.

Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org. Thank you.

TACO: Multi-sample transcriptome assembly from RNA-Seq

Transcriptome assemblers reconstruct full-length transcripts from the short sequence fragments generated by RNA-Seq. Large consortia such as TCGAICGCGTexENCODE, the Cancer Cell Line Encyclopedia (CCLE), and others have performed RNA-Seq on thousands of human tissues and cell lines, providing an unparalleled resource for investigating transcriptional diversity and complexity. Transcriptome Assemblies Combined into One (TACO), an algorithm that reconstructs a consensus transcriptome from a collection of individual assemblies. TACO employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. TACO vastly outperforms existing software tools such as Cuffmerge and Stringtie merge.

Citation:

Niknafs, Y. S., Pandian, B., Iyer, H. K., Chinnaiyan, A. M. & Iyer, M. K. TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat. Methods 14, 68–70 (2017)

Mandatory arguments

  • Inputs

    • Input files: Paths to the individual gtf files 

    • Input list file: Path to the list file that contains the list of individual gtf files
  • OutputsDirectory where output files will be stored

Parameters

  • GTF attribute field: GTF attribute field containing expression estimate. The default setting is FPKM for Cufflinks GTF input.
  • Filter min length: Pre-filters input transfrags with length < N prior to assembly. Set to 0 to disable this filter.Filter min length
  • Filter max expr: Pre-filters input transfrags with expression < X. The units of the expression cutoff value X correspond to the units specified by the gtf-expr-attr parameter, which is FPKM by default. Set to 0.0 to disable this filter.
  • Isoform fraction: Report transcript isoforms with expression fraction >=FRAC relative to the highest expressed gene. For each gene, the highest abundance isoform will be reported with a FRAC of 1.0.

Test Run

All files are located in the Community Data directory of the CyVerse Discovery Environment at the following path:

Community Data > iplantcollaborative > example_data > taco  (/iplant/home/shared/iplantcollaborative/example_data/taco)

Mandatory arguments: 

  • Inputs: 
    • CCLE-CAL-51-RNA-08.gtf
    • CCLE-HDQ-P1-RNA-08.gtf
  • Input list file: 
    • list_of_files
      CCLE-CAL-51-RNA-08.gtf
      CCLE-HDQ-P1-RNA-08.gtf

Make sure both the Inputs files and input list file is in the same directory

Parameters:

Leave all the parameters default

Output

TACO writes output to the directory specified by the -o command line option. Within this directory, the import output files are:

Transcriptome assembly: assembly.gtf

This GTF file contains TACO's assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized (“gene_id”, and “transcript_id”). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. The columns are defined as follows:

Column numberColumn nameExampleDescription
1seqnamechrXChromosome or contig name
2sourcetacoThe name of the program that generated this file (always taco)
3featureexonThe type of record (always either “transcript” or “exon”.
4start77696957The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)
5end77712009The rightmost coordinate of this record, inclusive.
6score77712009The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM)
7strand+TACO's guess for which strand the isoform came from. Always one of “+”, “-“, “.”
7frame.TACO does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used.
8attributesSee below.

Each GTF record is decorated with the following attributes:

AttributeExampleDescription
gene_idG7TACO gene id
transcript_idTU56TACO transcript id
locus_idL1TACO locus id
tss_idTSS31TACO transcription start site id
expr2.441Isoform-level abundance. The units correspond to the expression units of the input transfrags (usually FPKM or TPM)
rel_frac0.7647Relative abundance of isoform compared to the major isoform in the gene. The most abundant isoform for each gene is assigned a rel_frac of 1.0. Minor isoforms are scored by the ratio (minor expr/major expr)
abs_frac0.7647Relative abundance of isoform compared to the total expression of all isoforms in the gene. Isoforms are scored by the ratio (expr / sum(expr(x) for each isoform x)).

Transcriptome assembly: assembly.bed

This BED file contains TACO's assembled isoforms. Please refer to the UCSC genome browser's detailed description of the BED format. The name column (Column 4) contains a string of the format gene_id|transcript_id(expr), e.g. G7|TU56(2.77).


Transfrag coverage profiles: expr.pos.bedgraph, expr.neg.bedgraph, expr.none.bedgraph

TACO outputs 3 bedGraph files with the coverage profile of the input transfrags on the forward, reverse, and unknown/unspecified strands. Please refer to the UCSC genome browser's detailed description of the bedGraph format. These files can be converted to bigWig format using the free conversion tool bedGraphToBigWig for viewing on genome browsers such as IGV or UCSC.


Transfrag splice junction profiles: splice_junctions.bed

UCSC BED track of junctions reported by TACO. Each junction consists of two connected BED blocks. The score is the sum of the expression values of transfrags supporting the junction. This file can be converted to bigBed track format for viewing on genome browsers such as IGV or UCSC.


Tool Source for App