Table of Contents |
---|
Rational and background
RNA-Seq involves preparing the mRNA which is converted to cDNA and provided as input to next generation sequencing library preparation method. Prior to RNA-Seq there were hybridization based microarrays used for gene expression studies, the main drawback was the poor quantification of lowly and highly expressed genes. RNA-Seq provides distinct advantages over microarrays, it provides better insights into alternative gene splicing, post-transcriptional modifications, gene fusion and deferentially expressed genes and thus helping to understanding the gene structure and expression patterns of genes across different samples, treatment conditions and time points. The ease of sequencing and the low cost have made RNA-Seq a workhorse in transcriptomic studies and viable option even for small scale labs. But the main challenge remains in analyzing the sequenced data.
...
If you do not have access to Discovery Environment, please register for CyVerse account - http://user.cyverse.org
Specific Objectives
By the end of this module, you should
...
Alessandra Fracasso, Luisa M. Trindade and Stefano Amaducci; DOI: 10.1186/s12870-016-0800-x, May 2016
The Staged Fastq Data can be found in the
No Format |
---|
Community Data -> iplantcollaborative -> example_data -> HISAT2_StringTie_Ballgown -> reads |
Section 1: Align reads to reference using HISAT2 aligner
Hisat2 is another efficient splice aligner which is a replacement for Tophat2 in the new Tuxedo protocol. Like Tophat2 it uses one global FM index along with several small local FM indexes to build an efficient data structure which helps speed its alignment several times faster than Tophat2.
...
We will use the bam_output folder to assemble transcripts using StringTie.
Section 2: Assemble transcripts with StringTie-1.3.3
StringTie like Cufflinks2 assembles transcripts from RNA seq reads aligned to the reference and does quantification. It follows a netflow algorithm where it assembles and quantitates simultaneously the highly expressed transcripts and removes reads associated with that transcripts and repeats the process until all the reads are used. This algorithm improves the run time for StringTie using less memory compared to Cufflinks2. If provided with a reference annotation file StringTie uses it to construct assembly for low abundance genes, but this is optional. Alternatively, you can skip the assembly of novel genes and transcripts, and use StringTie simply to quantify all the transcripts provided in an annotation file
...
No Format |
---|
# stringtie IS20351_DS_1_1.sorted.bam -o IS20351_DS_1_1.gtf -G Sorghum_bicolor.Sorbi1.20.gtf -p 4 -m 200 -c 2.5 -g 50 -f 0.1 -C IS20351_DS_1_1.refs.gtf -A IS20351_DS_1_1.abund.tab # StringTie version 1.3.3 8 StringTie transcript 10354 14469 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "3.489598"; FPKM "0.801415"; TPM "0.990349"; 8 StringTie exon 10354 10640 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "1.000000"; 8 StringTie exon 13241 13900 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "2"; cov "5.615151"; 8 StringTie exon 13975 14469 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "3"; cov "2.098990"; 8 StringTie transcript 43183 43920 1000 + . gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "2.972900"; FPKM "0.682751"; TPM "0.843710"; 8 StringTie exon 43183 43920 1000 + . gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "2.972900"; 8 StringTie transcript 43997 44781 1000 + . gene_id "STRG.3"; transcript_id "STRG.3.1"; cov "6.772846"; FPKM "1.555440"; TPM "1.922136"; 8 StringTie exon 43997 44216 1000 + . gene_id "STRG.3"; transcript_id "STRG.3.1"; exon_number "1"; cov "9.109091"; |
Section 3: Merge all StringTie-1.3.3 transcripts into a single transcriptome annotation file using StringTie-1.3.3_merge
StringTie-merge like Cufflinks-merge uses the same principle of merging the transcript assemblies from samples into a consolidated annotation set. This merging steps helps restore full length of the structure transcript especially for transcript assembled with low coverage.The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Ballgown. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies.
...
We will use the merged.out.gtf with StrignTie-1.3.3 app again
Section 4: Create Ballgown input files using with StringTie-1.3.3
We will use StringTie-1.3.3 again to assemble transcripts but this time will will use the consolidated annotation file we got from StringTie-1.3.3_merge step. This will re-estimate the transcript abundances using the merged structures but reads may need to be re-allocated for transcripts whose structures were altered by the merging step. We will set the option -B, e of StringTie which will create table count files (*ctab files) for each sample to be used in Ballgown for differential expression.Name the analysis "StringTie-1.3.3_from_merged_annotation"
...
Run the App. When it is complete, you should see the following outputs under
Ballgown_input_files: count files (*ctab files) for each sample
e_data.ctab: exon-level expression measurements.
i_data.ctab: intron- (i.e., junction-) level expression measurements
t_data.ctab: transcript-level expression measurements
e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts
i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts
More details of the ctab files please refere the Ballgown documentation
Section 5: Compare expression analysis using Ballgown
Ballgown is a R package that uses abundance data produced by StringTie to perform differential expression analysis at gene, transcript, exon or junction level. It does both time series and fixed condition differential expression analysis.
...
ID condition reps
IS20351_DS_1_1 DS 1
IS20351_DS_1_2 DS 2
IS20351_DS_1_3 DS 3
IS20351_WW_1_1 WW 1
IS20351_WW_1_2 WW 2
IS20351_WW_1_3 WW 3
Here we are doing a pairwise comparison for differential expression in sensitive genotype under Drought Stress (DS) and Well Watered (WW) condition. We have 3 replicates under each condition.
...
- Rplots.pdf- Boxplot of FPKM distribution of each smaple
- results_gene.tsv- Gene level Differential expression with no filtering
- results_gene_filter.sig.tsv- Identify genes with p value < 0.05
- results_gene_filter.tsv- Filter low-abundance genes, here we remove all genes with a variance across samples less than one
- results_trans.tsv-transcript level Differential expression with no filtering
- results_trans_filter.sig.tsv- Identify transcripts with p value < 0.05
- results_trans_filter.tsv-Filter low-abundance genes, here we remove all transcript with a variance across samples less than one
Running Ballgown for Differential gene expression and visualization in using RStudio-Ballgown
For this, we will first download the data (Section 4 "/iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/StringTie-1.3.3_from_merged_annotation/ballgown_input_files") needed to run Ballgown as well as the Desgin matrix file ("/iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/design_matrix"). We will use Rstudio-Ballgown app on DE to do the interactive analysis using Ballgown R package.
1. Launch the Rstudio-Ballgown app in DE
Search for "Rstudio-Ballgown" app in the search window under Apps
...
Note |
---|
Rstudio-Ballgown will take few minutes to launch and you'll sometimes see this screen. Don't worry, eventually you'll be able to see the Rstudio-Ballgown app |
2. Enter the user name and password (`rstudio` and `rstudio`) to launch the R studio on browser
When you log-in, you'll not see the files in the files window. They are located under `/de-app-work`. You can navigate to that directory using the following steps
3. paste the below R commands to begin the analysis. Start File -> New File -> R script, then paste the commands.
Iframe | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...