Kallisto Tutorial (TMP)

Kallisto Tutorial (TMP)

This is not the official tutorial for these materials and will not be updated. Please see the CyVerse Learning Center for the latest tutorials and materials.

Input Data:


Data taken from the "Cuffdiff2 paper"

Differential analysis of gene regulation at transcript resolution with RNA-seq by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology 31, 46–53 (2013).

The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.

HOXA1 is a critical regulator of embryonic development and body patterning, in maintaining adult cells.HOXA1 knockdown perturbs the expression of thousands of genes

 

run_accession   experiment_accessionspotsconditionsequencersample
SRR493366    SRX14566215117833scramblehiseqA
SRR493367SRX14566317433672scramblehiseqB
SRR493368SRX14566421830449scramblehiseqC
SRR493369SRX14566517916102HOXA1KDhiseqA
SRR493370SRX14566620141813HOXA1KDhiseqB
SRR493371SRX14566723544153HOXA1KDhiseqC

 

Location of data in iPlant data store:

/iplant/home/shared/iplantcollaborative/example_data/kallisto/Human

 

Kallisto RNA seq analysis

The kallisto pipeline is quite simple. There are basically two steps:

  1. build an index (once per organism or annotation)
  2. quantify against the index (once per each experimental sample)

 

You may have noticed that there is no alignment step. This is because part of the kallisto algorithm performs a very fast "alignment" which we call the pseudoalignment. This simplifies things from the user's point of view since there are no extra intermediate files.

 

  1. Index-

Before you can quantify with kallisto, you must create an index from an annotation file. For RNA-Seq, an annotation file is the set of cDNA transcripts in FASTA format (the "transcriptome"). If you are not working with a model organism, one option is to do a de novo assembly. Kallisto-index simply takes a FASTA file and outputs an index in a binary format that is designed for kallisto. There is no default extension for the index.

Open Kallisto-0.42.3-INDEX (Apps > Public Apps > Kallisto-0.42.3-INDEX)

  1. Name your analysis.( Give a name of your preference or keep the default)
  2. Select "Input"
    1. enter a Index name "human_trans"
    2. Select fasta file- Community Data -> iplantcollaborative -> example_data -> kallisto -> Human -> Index -> "Homo_sapiens.GRCh38.cdna.all.fa"
  3. Click on "Launch Analysis."

  4. Once the analysis is completed (approx. 5 min. with the sample data), click on "Analysis," and then click on the analysis "Name" to open the output folder.
  5. Output- This creates an index at human_trans from the annotation file with the default k-mer size (k = 31). Note, that if you have very short reads (e.g. 35bp), you should change k to something smaller (e.g. -k 21). We will use this index file in the next step of kallisto.

 

     2. Quantification-

Once you index an annotation, you can quantify any number of samples against it. The quantification step includes pseudoalignment as well as running the EM algorithm to do estimation of transcript level abundances.

The parameters are pretty minimal. You must supply an index, an output location, and a set of reads. In the case of single-end data, you must use the Kallisto-0.42.3-Qaunt-SE app, as well provide  a fragment length distribution of reads. In the case of paired-end data use Kallisto-0.42.3-Qaunt-PE, and the tool can infer the fragment length distribution from the data.

There is also one other important parameter: the number of bootstrap iterations. By default, kallisto runs zero bootstrap iterations. If you do not plan to run sleuth for differential expression analysis, this is okay. But if you plan to run sleuth, you must provide a nonzero number of bootstraps. In general, this number should be at least 30. In your human RNA seq data example we will set it to 100.

 

Open Kallisto-0.42.3-Qaunt-PE (Apps > Public Apps > Kallisto-0.42.3-Qaunt-PE)

  1. Name your analysis.( Give a name of your preference or keep the default)
  2. Select "Input"
    1. enter a Index file iplant -> home -> <your_iplant_username> -> analyses -> <your_index_analysis_name> -> "human_trans"
    2. Output directory name - > "SRR493366"
    3. Select fastq files - Community Data -> iplantcollaborative -> example_data -> kallisto -> Human -> Reads -> "SRR493366.sra_1.fastq SRR493366.sra_2.fastq"
    4. Set the number of bootstraps too 100.
    5. Click on "Launch Analysis."
    6. Once the analysis is completed (approx. 30 min. with the sample data), click on "Analysis," and then click on the analysis "Name" to open the output folder.
    7. Output:

      After quantification, you will get a number of files in the output directory.

      • run_info.json - some high-level information about the run, including the command and versions of kallisto used to generate the output
      • abundance.tsv - a plain text file with transcript level abundance estimates. This file can be read into R or any other statistical language easily (e.g. read.table('abundance.tsv'))
      • abundance.h5 - a HDF5 file containing all of the quantification information including bootstraps and other auxiliary information from the run. This file is read by sleuth