Using Kallisto RNA seq tool in DE

Using Kallisto RNA seq tool in DE

This tutorial is for using Kallisto  and Sleuth RNA seq tool in DE

Input Data:


Data taken from the "Cuffdiff2 paper"

Differential analysis of gene regulation at transcript resolution with RNA-seq by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology 31, 46–53 (2013).

The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.

 

run_accession   

experiment_accession

spots

condition

sequencer

sample

run_accession   

experiment_accession

spots

condition

sequencer

sample

SRR493366    

SRX145662

15117833

scramble

hiseq

A

SRR493367

SRX145663

17433672

scramble

hiseq

B

SRR493368

SRX145664

21830449

scramble

hiseq

C

SRR493369

SRX145665

17916102

HOXA1KD

hiseq

A

SRR493370

SRX145666

20141813

HOXA1KD

hiseq

B

SRR493371

SRX145667

23544153

HOXA1KD

hiseq

C

 

Location of data in iPlant data store:

/iplant/home/shared/iplantcollaborative/example_data/kallisto/Human

 

Kallisto RNA seq analysis

The kallisto pipeline is quite simple. There are basically two steps:

  1. build an index (once per organism or annotation)

  2. quantify against the index (once per each experimental sample)

 

You may have noticed that there is no alignment step. This is because part of the kallisto algorithm performs a very fast "alignment" which we call the pseudoalignment. This simplifies things from the user's point of view since there are no extra intermediate files.

 

  1. Index-

Before you can quantify with kallisto, you must create an index from an annotation file. For RNA-Seq, an annotation file is the set of cDNA transcripts in FASTA format (the "transcriptome"). If you are not working with a model organism, one option is to do a de novo assembly. Kallisto-index simply takes a FASTA file and outputs an index in a binary format that is designed for kallisto. There is no default extension for the index.

Open Kallisto-0.42.3-INDEX (Apps > Public Apps > Kallisto-0.42.3-INDEX)

  1. Name your analysis.( Give a name of your preference or keep the default)

  2. Select "Input"

    1. enter a Index name "human_trans"

    2. Select fasta file- Community Data -> iplantcollaborative -> example_data -> kallisto -> Human -> Index -> "Homo_sapiens.GRCh38.cdna.all.fa"

  3. Click on "Launch Analysis."


    Unknown Attachment

  4. Once the analysis is completed (approx. 5 min. with the sample data), click on "Analysis," and then click on the analysis "Name" to open the output folder.

  5. Output- This creates an index at human_trans from the annotation file with the default k-mer size (k = 31). Note, that if you have very short reads (e.g. 35bp), you should change k to something smaller (e.g. -k 21). We will use this index file in the next step of kallisto.

 

     2. Quantification-

Once you index an annotation, you can quantify any number of samples against it. The quantification step includes pseudoalignment as well as running the EM algorithm to do estimation of transcript level abundances.

The parameters are pretty minimal. You must supply an index, an output location, and a set of reads. In the case of single-end data, you must use the Kallisto-0.42.3-Qaunt-SE app, as well provide  a fragment length distribution of reads. In the case of paired-end data use Kallisto-0.42.3-Qaunt-PE, and the tool can infer the fragment length distribution from the data.

There is also one other important parameter: the number of bootstrap iterations. By default, kallisto runs zero bootstrap iterations. If you do not plan to run sleuth for differential expression analysis, this is okay. But if you plan to run sleuth, you must provide a nonzero number of bootstraps. In general, this number should be at least 30. In your human RNA seq data example we will set it to 100.

 

Open Kallisto-0.42.3-Qaunt-PE (Apps > Public Apps > Kallisto-0.42.3-Qaunt-PE)

  1. Name your analysis.( Give a name of your preference or keep the default)

  2. Select "Input"

    1. enter a Index file iplant -> home -> <your_iplant_username> -> analyses -> <your_index_analysis_name> -> "human_trans"

    2. Output directory name - > "SRR493366"

    3. Select fastq files - Community Data -> iplantcollaborative -> example_data -> kallisto -> Human -> Reads -> "SRR493366.sra_1.fastq SRR493366.sra_2.fastq"

      Unknown Attachment

    4. Set the number of bootstraps too 100.

      Unknown Attachment

    5. Click on "Launch Analysis."

    6. Once the analysis is completed (approx. 30 min. with the sample data), click on "Analysis," and then click on the analysis "Name" to open the output folder.

    7. Output:

      After quantification, you will get a number of files in the output directory.

      • run_info.json - some high-level information about the run, including the command and versions of kallisto used to generate the output

      • abundance.tsv - a plain text file with transcript level abundance estimates. This file can be read into R or any other statistical language easily (e.g. read.table('abundance.tsv'))

      • abundance.h5 - a HDF5 file containing all of the quantification information including bootstraps and other auxiliary information from the run. This file is read by sleuth