This App runs Cufflinks (version 2+) to assemble transcripts using Sequence alignments (BAM) generated by TopHat/Bowtie.
- Cufflinks 2.2.1 requires a text file of SAM alignments or a binary SAM (BAM) file as input. For more details, see the specification. It is recommended that you use reads generated by TopHat as input files.
Test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplant_training -> intro_rna-seq -> 02_tophat
Use the accepted_hits.bam files from the hy5 and WT rep1 and rep2 directories for an example run. Notes below are directly from Cufflinks user manual.
General example of input file type is the following:
_s6.25mer.txt-913508 16 chr1 4482736 255 14M431N11M * 0 0 _
CAAGATGCTAGGCAAGTCTTGGAAG IIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 XS:A:-
Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).
The SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already. If you used another tool, you may want to make sure they are properly sorted as follows:
sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted
Parameters Used in App
When the app is run in the Discovery Environment, use the following parameters with the above input file(s) to get the output provided in the section below.
- Reference annotation :Arabidopsis thaliana (Ensembl 14)
Leave all other parameters as default.
Cufflinks produces 3 main output files (notes directly from Cufflinks user manual):
- This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id", and "transcript_id"). There is one GTF record per row, and each record represents either a transcript or an exon within a transcript.
- This file contains the estimated isoform-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.
- This file contains the estimated gene-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.
In the directory Community Data -> iplant_training -> intro_rna-seq -> 03_cufflinks, you will see directories for each of the selected bam files used as inputs. These directories also contain a "skipped.gtf" file.
RNA-Seq Tutorial (DE 1.8)