Kallisto Tutorial (TMP)
This is not the official tutorial for these materials and will not be updated. Please see the CyVerse Learning Center for the latest tutorials and materials.
Input Data:
Data taken from the "Cuffdiff2 paper"
Differential analysis of gene regulation at transcript resolution with RNA-seq by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology 31, 46–53 (2013).
The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.
HOXA1 is a critical regulator of embryonic development and body patterning, in maintaining adult cells.HOXA1 knockdown perturbs the expression of thousands of genes
run_accession | experiment_accession | spots | condition | sequencer | sample |
---|---|---|---|---|---|
SRR493366 | SRX145662 | 15117833 | scramble | hiseq | A |
SRR493367 | SRX145663 | 17433672 | scramble | hiseq | B |
SRR493368 | SRX145664 | 21830449 | scramble | hiseq | C |
SRR493369 | SRX145665 | 17916102 | HOXA1KD | hiseq | A |
SRR493370 | SRX145666 | 20141813 | HOXA1KD | hiseq | B |
SRR493371 | SRX145667 | 23544153 | HOXA1KD | hiseq | C |
Location of data in iPlant data store:
/iplant/home/shared/iplantcollaborative/example_data/kallisto/Human
Kallisto RNA seq analysis
The kallisto pipeline is quite simple. There are basically two steps:
- build an index (once per organism or annotation)
- quantify against the index (once per each experimental sample)
You may have noticed that there is no alignment step. This is because part of the kallisto algorithm performs a very fast "alignment" which we call the pseudoalignment. This simplifies things from the user's point of view since there are no extra intermediate files.
- Index-
Before you can quantify with kallisto, you must create an index from an annotation file. For RNA-Seq, an annotation file is the set of cDNA transcripts in FASTA format (the "transcriptome"). If you are not working with a model organism, one option is to do a de novo assembly. Kallisto-index simply takes a FASTA file and outputs an index in a binary format that is designed for kallisto. There is no default extension for the index.
Open Kallisto-0.42.3-INDEX (Apps > Public Apps > Kallisto-0.42.3-INDEX)
- Name your analysis.( Give a name of your preference or keep the default)
- Select "Input"
- enter a Index name "human_trans"
- Select fasta file- Community Data -> iplantcollaborative -> example_data -> kallisto -> Human -> Index -> "Homo_sapiens.GRCh38.cdna.all.fa"
- Click on "Launch Analysis."
- Once the analysis is completed (approx. 5 min. with the sample data), click on "Analysis," and then click on the analysis "Name" to open the output folder.
- Output- This creates an index at
human_trans
from the annotation file with the default k-mer size (k = 31). Note, that if you have very short reads (e.g. 35bp), you should changek
to something smaller (e.g.-k 21
). We will use this index file in the next step of kallisto.
2. Quantification-
Once you index an annotation, you can quantify any number of samples against it. The quantification step includes pseudoalignment as well as running the EM algorithm to do estimation of transcript level abundances.
The parameters are pretty minimal. You must supply an index, an output location, and a set of reads. In the case of single-end data, you must use the Kallisto-0.42.3-Qaunt-SE app, as well provide a fragment length distribution of reads. In the case of paired-end data use Kallisto-0.42.3-Qaunt-PE, and the tool can infer the fragment length distribution from the data.
There is also one other important parameter: the number of bootstrap iterations. By default, kallisto runs zero bootstrap iterations. If you do not plan to run sleuth for differential expression analysis, this is okay. But if you plan to run sleuth, you must provide a nonzero number of bootstraps. In general, this number should be at least 30. In your human RNA seq data example we will set it to 100.
Open Kallisto-0.42.3-Qaunt-PE (Apps > Public Apps > Kallisto-0.42.3-Qaunt-PE)
- Name your analysis.( Give a name of your preference or keep the default)
- Select "Input"
- enter a Index file iplant -> home -> <your_iplant_username> -> analyses -> <your_index_analysis_name> -> "human_trans"
- Output directory name - > "SRR493366"
- Select fastq files - Community Data -> iplantcollaborative -> example_data -> kallisto -> Human -> Reads -> "SRR493366.sra_1.fastq SRR493366.sra_2.fastq"
- Set the number of bootstraps too 100.
- Click on "Launch Analysis."
- Once the analysis is completed (approx. 30 min. with the sample data), click on "Analysis," and then click on the analysis "Name" to open the output folder.
- Output:
After quantification, you will get a number of files in the output directory.
run_info.json
- some high-level information about the run, including the command and versions of kallisto used to generate the outputabundance.tsv
- a plain text file with transcript level abundance estimates. This file can be read into R or any other statistical language easily (e.g.read.table('abundance.tsv')
)abundance.h5
- a HDF5 file containing all of the quantification information including bootstraps and other auxiliary information from the run. This file is read by sleuth