StringTie1.3.3

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to kchougul@cshl.edu. Thank you.

Rationale and background:

 

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

Mihaela Pertea,    Geo M Pertea,    Corina M Antonescu,    Tsung-Cheng Chang,    Joshua T Mendell    & Steven L Salzberg

 doi:10.1038/nbt.3122

 

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.)


Version: 1.3.3


Pre-Requisites

  1. A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
  2. Mandatory arguments -StringTie takes as input a binary SAM (BAM) file sorted by reference position. This files contains spliced read alignments such as the ones produced by TopHat. If you have a text file in SAM format you should first convert it to the BAM format using the samtools view command:
    1. Input bam files (in bam format)
  3. Optional arguments
    1. Annotation: provide gtf file or select from the list (a reference annotation file in GTF/GFF3 format can be provided to StringTie)
    2. minimum isoform fraction:0.1(Sets the minimum isoform abundance of the predicted transcripts as a fraction of the most abundant transcript assembled at a given locus. Lower abundance transcripts are often artifacts of incompletely spliced precursors of processed transcripts)
    3. minimum assembled transcript length:200(Sets the minimum length allowed for the predicted transcripts)
    4. minimum reads per bp coverage to consider for transcript assembly:2.5(Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output)
    5. Minimum locus gap separation value:50(Minimum locus gap separation value. Reads that are mapped closer than this distance are merged together in the same processing bundle)
    6. only estimates the abundance of given reference transcripts: check ( With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example.)
    7. enables the output of Ballgown input table files (*.ctab)(need reference annotation): check (This switch enables the output of Ballgown input table files (*.ctab) containing coverage data for the reference transcripts given with the -G reference annotation option. (See the Ballgown documentation for a description of these files.) With this option StringTie can be used as a direct replacement of the tablemaker program included with the Ballgown distribution.)
    NOTE: Select option f and g only if you are providing a reference annotation file

Test/sample data 

The following test data are provided for testing StringTie1.3.3 in here - /iplant/home/shared/iplantcollaborative/example_data/StringTie/StringTie1.3.3:

  1. reference genome file - reference.fasta
  2. reference gtf file- reference.gtf
  3. Directory of bam files in (.bam format sorted by their genomic location) -
    1.  bam_output/sample1.bam
    2. bam_output/sample2.bam

Run StringTie1.3.3 on bam files using reference files.

Results 

Successful execution of the StringTie1.3.3 will contain several files and directories:

  1. StringTie_output: all the default output files from the StringTie1.3.3 run
    1. sample1
      1. e2t.ctab:table with two columns, e_id and t_id, denoting which exons belong to which transcripts
      2. e_data.ctab:exon-level expression measurements
      3. i2t.ctab:table with two columns, i_id and t_id, denoting which introns belong to which transcripts
      4. i_data.ctab: intron- (i.e., junction-) level expression measurements. One row per intron. Columns are i_id (numeric intron id), chrstrandstartend (genomic location of the intron), and the following expression measurements for each sample
      5. t_data.ctab:transcript-level expression measurements. One row per transcript.
      6. sample1.abund.tab:Gene abundances will be reported (tab delimited format) in the output file with the given name
      7.  sample1.gtf: Assembled transcript gtf file
      8. sample1.refs.gtf:StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads
  2. ballgown_input_files: Use this directory as input to StringTie-1.3.3_to_DESeq2_and_edegeR and Ballgown apps for differential expression analysis
    1. sample1: all *ctab files for sample1 and sample1.gtf
    2. sample2: all *ctab files for sample2 and sample2.gtf
  3. gtf_files: all transcript assembly files: Use this directory for StringTie-1.3.3_merge app
    1. sample1.gtf
    2. sample2.gtf



More information on the tool can be found here - https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual