Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Table of Contents

Rational and background

RNA-Seq involves preparing the mRNA which is converted to cDNA and provided as input to next generation sequencing library preparation method. Prior to RNA-Seq there were hybridization based microarrays used for gene expression studies, the main drawback was the poor quantification of lowly and highly expressed genes.  RNA-Seq provides distinct advantages over microarrays, it provides better insights into alternative gene splicing, post-transcriptional modifications, gene fusion and deferentially expressed genes and thus helping to understanding the gene structure and expression patterns of genes across different samples, treatment conditions and time points. The ease of sequencing and the low cost have made RNA-Seq a workhorse in transcriptomic studies and viable option even for small scale labs. But the main challenge remains in analyzing the sequenced data.

...

If you do not have access to Discovery Environment, please register for CyVerse account - http://user.cyverse.org

Specific Objectives

By the end of this module, you should

...

Alessandra Fracasso,  Luisa M. Trindade and Stefano Amaducci; DOI: 10.1186/s12870-016-0800-x, May 2016

The Staged Fastq Data can be found in the

No Format
Community Data -> iplantcollaborative -> example_data -> HISAT2_StringTie_Ballgown -> reads

Section 1: Align reads to reference using HISAT2 aligner

Hisat2 is another efficient splice aligner which is a replacement for Tophat2 in the new Tuxedo protocol. Like Tophat2 it uses one global FM index along with several small local FM indexes to build an efficient data structure which helps speed its alignment several times faster than Tophat2.

...

We will use the bam_output folder to assemble transcripts using StringTie.

Section 2: Assemble transcripts with StringTie-1.3.3

StringTie like Cufflinks2 assembles transcripts from RNA seq reads aligned to the reference and does quantification. It follows a netflow algorithm where it assembles and quantitates simultaneously the highly expressed transcripts and removes reads associated with that transcripts and repeats the process until all the reads are used. This algorithm improves the run time for StringTie using less memory compared to Cufflinks2. If provided with a reference annotation file StringTie uses it to construct assembly for  low abundance genes, but this is optional. Alternatively, you can skip the assembly of novel genes and transcripts, and use StringTie simply to quantify all the transcripts provided in an annotation file

...

No Format
# stringtie IS20351_DS_1_1.sorted.bam -o IS20351_DS_1_1.gtf -G Sorghum_bicolor.Sorbi1.20.gtf -p 4 -m 200 -c 2.5 -g 50 -f 0.1 -C IS20351_DS_1_1.refs.gtf -A IS20351_DS_1_1.abund.tab
# StringTie version 1.3.3
8       StringTie       transcript      10354   14469   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "3.489598"; FPKM "0.801415"; TPM "0.990349";
8       StringTie       exon    10354   10640   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "1.000000";
8       StringTie       exon    13241   13900   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "2"; cov "5.615151";
8       StringTie       exon    13975   14469   1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "3"; cov "2.098990";
8       StringTie       transcript      43183   43920   1000    +       .       gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "2.972900"; FPKM "0.682751"; TPM "0.843710";
8       StringTie       exon    43183   43920   1000    +       .       gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "2.972900";
8       StringTie       transcript      43997   44781   1000    +       .       gene_id "STRG.3"; transcript_id "STRG.3.1"; cov "6.772846"; FPKM "1.555440"; TPM "1.922136";
8       StringTie       exon    43997   44216   1000    +       .       gene_id "STRG.3"; transcript_id "STRG.3.1"; exon_number "1"; cov "9.109091"; 

Section 3: Merge all StringTie-1.3.3 transcripts into a single transcriptome annotation file using StringTie-1.3.3_merge

StringTie-merge like Cufflinks-merge uses the same principle of merging the transcript assemblies from samples into a consolidated annotation set. This merging steps helps restore full length of the structure transcript especially for transcript assembled with low coverage.The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Ballgown. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies.

...

We will use the merged.out.gtf with StrignTie-1.3.3 app again 

Section 4: Create Ballgown input files using  with StringTie-1.3.3

We will use StringTie-1.3.3 again to assemble transcripts but this time will will use the consolidated annotation file we got from StringTie-1.3.3_merge step. This will re-estimate the transcript abundances using the merged structures but reads may need to be re-allocated for transcripts whose structures were altered by the merging step. We will set the option -B, e  of StringTie which will create table count files (*ctab files) for each sample to be used in Ballgown for differential expression.Name the analysis "StringTie-1.3.3_from_merged_annotation"

...

Run the App. When it is complete, you should see the following outputs under

  • Ballgown_input_files: count files (*ctab files) for each sample

    • e_data.ctab: exon-level expression measurements.

    • i_data.ctab: intron- (i.e., junction-) level expression measurements

    • t_data.ctab: transcript-level expression measurements

    • e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts

    • i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts

More details of the ctab files please refere the Ballgown documentation 

Section 5: Compare expression analysis using Ballgown

Ballgown is a R package that uses abundance data produced by StringTie to perform differential expression analysis at gene, transcript, exon or junction level. It does both time series and fixed condition differential expression analysis.

...

ID      condition   reps

IS20351_DS_1_1    DS     1

IS20351_DS_1_2     DS     2

IS20351_DS_1_3     DS     3

IS20351_WW_1_1    WW     1

IS20351_WW_1_2    WW     2

IS20351_WW_1_3    WW     3

Here we are doing a pairwise comparison for differential expression in sensitive genotype under Drought Stress (DS) and Well Watered (WW) condition. We have 3 replicates under each condition.

...

  1.   Rplots.pdf- Boxplot of FPKM distribution of each smaple
  2.   results_gene.tsv- Gene level Differential expression with no filtering
  3.   results_gene_filter.sig.tsv- Identify genes with p value < 0.05
  4.   results_gene_filter.tsv- Filter low-abundance genes, here we remove all genes with a variance across samples less than one
  5.   results_trans.tsv-transcript level Differential expression with no filtering
  6.   results_trans_filter.sig.tsv- Identify transcripts with p value < 0.05
  7.   results_trans_filter.tsv-Filter low-abundance genes, here we remove all transcript with a variance across samples less than one 

Running Ballgown for Differential gene expression and visualization in using RStudio-Ballgown 

For this, we will first download the data (Section 4  "/iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/StringTie-1.3.3_from_merged_annotation/ballgown_input_files") needed to run Ballgown as well as the Desgin matrix file ("/iplant/home/shared/iplantcollaborative/example_data/HISAT2-StringTie-Ballgown/design_matrix"). We will use Rstudio-Ballgown app on DE to do the interactive analysis using Ballgown R package. 

1. Launch the Rstudio-Ballgown app in DE

Search for "Rstudio-Ballgown" app in the search window under Apps

...

Note

Rstudio-Ballgown will take few minutes to launch and you'll sometimes see this screen. Don't worry, eventually you'll be able to see the Rstudio-Ballgown app

2. Enter the user name and password (`rstudio` and `rstudio`) to launch the R studio on browser

When you log-in, you'll not see the files in the files window. They are located under `/de-app-work`. You can navigate to that directory using the following steps

3. paste the below R commands to begin the analysis. Start File -> New File -> R script, then paste the commands.

Iframe
scrollingyes
srchttp://rpubs.com/upendra_35/466542
width1000
frameborder1
alignmiddle
height1000

...