taco-0.7.0

taco-0.7.0

Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org. Thank you.

TACO: Multi-sample transcriptome assembly from RNA-Seq

Transcriptome assemblers reconstruct full-length transcripts from the short sequence fragments generated by RNA-Seq. Large consortia such as TCGAICGCGTexENCODE, the Cancer Cell Line Encyclopedia (CCLE), and others have performed RNA-Seq on thousands of human tissues and cell lines, providing an unparalleled resource for investigating transcriptional diversity and complexity. Transcriptome Assemblies Combined into One (TACO), an algorithm that reconstructs a consensus transcriptome from a collection of individual assemblies. TACO employs change point detection to break apart complex loci and correctly delineate transcript start and end sites, and a dynamic programming approach to assemble transcripts from a network of splicing patterns. TACO vastly outperforms existing software tools such as Cuffmerge and Stringtie merge.

Citation:

Niknafs, Y. S., Pandian, B., Iyer, H. K., Chinnaiyan, A. M. & Iyer, M. K. TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat. Methods 14, 68–70 (2017)

Mandatory arguments

  • Inputs

    • Input files: Paths to the individual gtf files 

    • Input list file: Path to the list file that contains the list of individual gtf files

  • OutputsDirectory where output files will be stored

Parameters

  • GTF attribute field: GTF attribute field containing expression estimate. The default setting is FPKM for Cufflinks GTF input.

  • Filter min length: Pre-filters input transfrags with length < N prior to assembly. Set to 0 to disable this filter.Filter min length

  • Filter max expr: Pre-filters input transfrags with expression < X. The units of the expression cutoff value X correspond to the units specified by the gtf-expr-attr parameter, which is FPKM by default. Set to 0.0 to disable this filter.

  • Isoform fraction: Report transcript isoforms with expression fraction >=FRAC relative to the highest expressed gene. For each gene, the highest abundance isoform will be reported with a FRAC of 1.0.

Test Run

All files are located in the Community Data directory of the CyVerse Discovery Environment at the following path:

Community Data > iplantcollaborative > example_data > taco  (/iplant/home/shared/iplantcollaborative/example_data/taco)

Mandatory arguments: 

  • Inputs: 

    • CCLE-CAL-51-RNA-08.gtf

    • CCLE-HDQ-P1-RNA-08.gtf

  • Input list file: 

    • list_of_files

      CCLE-CAL-51-RNA-08.gtf CCLE-HDQ-P1-RNA-08.gtf

Make sure both the Inputs files and input list file is in the same directory

Parameters:

Leave all the parameters default

Output

TACO writes output to the directory specified by the -o command line option. Within this directory, the import output files are:

Transcriptome assembly: assembly.gtf

This GTF file contains TACO's assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized (“gene_id”, and “transcript_id”). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. The columns are defined as follows:

Column number

Column name

Example

Description

Column number

Column name

Example

Description

1

seqname

chrX

Chromosome or contig name

2

source

taco

The name of the program that generated this file (always taco)

3

feature

exon

The type of record (always either “transcript” or “exon”.

4

start

77696957

The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)

5

end

77712009

The rightmost coordinate of this record, inclusive.

6

score

77712009

The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM)

7

strand

+

TACO's guess for which strand the isoform came from. Always one of “+”, “-“, “.”

7

frame

.

TACO does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used.

8

attributes

See below.

Each GTF record is decorated with the following attributes:

Attribute

Example

Description

Attribute

Example

Description

gene_id

G7

TACO gene id

transcript_id

TU56

TACO transcript id

locus_id

L1

TACO locus id

tss_id

TSS31

TACO transcription start site id

expr

2.441

Isoform-level abundance. The units correspond to the expression units of the input transfrags (usually FPKM or TPM)

rel_frac

0.7647

Relative abundance of isoform compared to the major isoform in the gene. The most abundant isoform for each gene is assigned a rel_frac of 1.0. Minor isoforms are scored by the ratio (minor expr/major expr)

abs_frac

0.7647

Relative abundance of isoform compared to the total expression of all isoforms in the gene. Isoforms are scored by the ratio (expr / sum(expr(x) for each isoform x)).

Transcriptome assembly: assembly.bed

This BED file contains TACO's assembled isoforms. Please refer to the UCSC genome browser's detailed description of the BED format. The name column (Column 4) contains a string of the format gene_id|transcript_id(expr), e.g. G7|TU56(2.77).



Transfrag coverage profiles: expr.pos.bedgraph, expr.neg.bedgraph, expr.none.bedgraph

TACO outputs 3 bedGraph files with the coverage profile of the input transfrags on the forward, reverse, and unknown/unspecified strands. Please refer to the UCSC genome browser's detailed description of the bedGraph format. These files can be converted to bigWig format using the free conversion tool bedGraphToBigWig for viewing on genome browsers such as IGV or UCSC.



Transfrag splice junction profiles: splice_junctions.bed

UCSC BED track of junctions reported by TACO. Each junction consists of two connected BED blocks. The score is the sum of the expression values of transfrags supporting the junction. This file can be converted to bigBed track format for viewing on genome browsers such as IGV or UCSC.



Tool Source for App