Protocols
Pre-workshop Workflow Overview
At the workshop we will go over all of these steps in detail, including more background information. If you have other protocols and useful reading (remember, we are all training each other!) please keep them handy as we build our knowledgebase. For now...
Choose Dataset
The Dataset you will assemble your transcriptome from should meet the criteria below. If you are fortunate to have a reference genome for your species of interest, try the assembly so that you can compare stats and results.
- Paired end illumina reads; minimum 50bp length
- Stranded libraries are preferred
- High coverage is preferred...
- With a transcriptome we already know coverage is going to be uneven, however if you have an estimate of genome size, and the length of the reads, you can get a rough guess at what coverage might be given the number of illumina reads you have. See the section on Coverage(RNA) here: https://genohub.com/next-generation-sequencing-guide/ (Spoiler: A Drosophila-sized genome would need 70-100 MBP for a good transcriptome assembly.
Collect metadata
Since you may not be using your own data, you may not have all of this information. We will work with the dataset providers to get as much of it collected. Here is what you need to get started, and other things that you will ultimately want before you would feel comfortable analyzing and publishing on the data.
Minimum
- Species name
- Sequence read length
- Paired/Single end sequence
Additional Metadata
- Collection protocol (RNA-Later, collected into liquid N2, etc.)
- Tissue type
- RNA-prep method (Trizol, Qiagen)
- Amplification method
- RNA-Prep (poly A selection, vs. Ribosomal depletion)
- RNA Quality control (RIN/Bioanalyzer)
- Insert size
- Illumina kit info including primer sets
- Sequencer
- Spike-ins (if used)
Data Upload and Sharing
Setup a Shared folder
One person who is designated by the team (usually the person doing the upload) can setup the shared folder:
1. Log in to the Discovery Environment.
2. Create a folder that you will upload the dataset to; make sure your username (home folder) is highlighted and then click File > New Folder.
3. Share the folder with others in your team, and also with Kapeel (username: kapeelc) and Jason (username: williams); Highlight the folder you just created and then Click Share > Share with Collaborators. Search for your team members (by name or username) and also add Kapeel and Jason. For now, it will be simplest to assign ownership permission to everyone. If you don't feel comfortable with that, then select a lower level of permission (e.g. read). Everyone should now see this folder in the Discovery Environment under Shared with me in the data window.
IMPORTANT TIP: When you do an analysis on the data (since not everyone on the team needs to do replicate every step, you can output results to this shared folder so everyone can see!
Put the Discovery Environment window aside for now until step 4.
Upload Data
Note: If your data is already in iPlant Data Store/Discovery Environment, you can simply move \it to the shared folder (drag and drop). Copying data (which is ususaly discouraged is available through iCommands).
1. Download the latest version of iDrop (http://www.iplantcollaborative.org/learning-center/get-started/import-data) and follow the install instructions on this page (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iDrop+Desktop). Using the installed iDrop program, drag the files to be uploaded into the folder you created in step 3.
Optional: If you are comfortable with the command line, please download iCommands (http://www.iplantcollaborative.org/learning-center/get-started/import-data) and follow the instructions for setup on this page (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iCommands). Use the iput command along with the -PVT option flags to see progress and reset your connection socket. Use the -r option for recursive upload of directories/folders.
Need help: Ask your team mates, as we have tried to balance the teams so that some have experience using iPlant. You can of course email Jason and Kapeel.
Tip: All test datasets are (or will be) placed in
Community Data > iplant_training > ars_workshop > test_dataset
Please use these data for your work.
Quality control Reads
NOTE: If you haven't already done so, you may want to unzip/decompress your reads. See the apps in the Discovery Environment: Public Apps >> General Utilities >> Compress and Decompress
1. In the Discovery Environment use: Apps > FASTC 0.10.0 (multi-file)
2. Name your analysis and set your output to the shared folder
3. Add your files and the click Launch Analysis
See documentation here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Check the Analysis panel in the Discovery Environment for the status. You will usually get emails when jobs complete (unless you have turned that function off in preferences). You may also need to refresh this window from time to time. This job should take 10-20 minutes depending on how many files/how much data. Once you have the results, you can confer with your team on how to trim and filter in the next step.
Trim/Filter Reads
How you trim and filter here is entirely up to you. We can't give many specifics without seeing the data, so working with the team will be valuable. Here is how to use the FASTX tools in the DE.
IMPORTANT: Its usually the case that the sequencing facility has trimmed adapter sequences. A negligible <0.5% of sequences may still not be fully trimmed and show up in your FASTQC report as overrepresented. If trimming has not be done, you will need to clip the sequences before any filtering. You can use Scythe or other apps to do this.
1. In the Discovery Environment use: Apps > FASTX (there are several Apps, e.g.: Clipper, quality filter, quality trimmer, trimmer).
- What Apps and parameters you will use will vary according to the sequence quality
2. For each tool you use, name your analysis and set your output to the shared folder.
See documentation here: http://hannonlab.cshl.edu/fastx_toolkit/
Optional (if you are filtering reads (i.e. dropping reads that do not meet a minimum quality score) you will need to re-pair the reads: use app: rePair Fix Read Paring)
In the Discovery Environment use: Apps > rePair-Fix Read Parining
- select both sets of fastq files (e.g. left( _1) and right reads( _2) )
- For prefix, use the same prefix/name that was used in the original fastq files (e.g. "honeybee_control_replicate1")
Normalize reads
If more than 200 million PE sequences are to be assembled, the user may consider performing an in silico normalization of the sequencing reads. Normalization improves run-time by reducing the total number of reads, while largely maintaining transcriptome complexity and capability for full-length transcript reconstruction.
1. In the Discovery Environment use: Apps > Trinity normalize by k-mer coverage r2103-08-14
- Choose the appropriate Kmer value(Recommended 25-31) and coverage value(50% of read length + 1)
- Click on launch to submit the job
See documentation here: http://trinityrnaseq.sourceforge.net/trinity_insilico_normalization.html
2. Your output file names will include the kmer and coverage value choosen. Use these normalized read
Optional (if you are filtering reads (i.e. dropping reads that do not meet a minimum quality score) you will need to re-pair the reads: use app: rePair Fix Read Paring)
In the Discovery Environment use: Apps > rePair-Fix Read Parining
- select both sets of fastq files (e.g. left( _1) and right reads( _2) )
- For prefix, use the same prefix/name that was used in the original fastq files
Assemble transcriptome with Trinity
Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.( Documentation: http://trinityrnaseq.sourceforge.net/)
- Open Trinity r2013-08-14 Apps > Public Apps > Assemblers > _Trinity r2013-08-14
- Name the analysis, add any comments, and output to your team's shared folder.
- Exaimine the readme tab, and then click on the "Inputs" tab to continue
- Provide the read1 and read2 input files along with type of file format and strand specific library type (Choose the data from your team's shared folder. Alternatively, our sample data are being provided in Community Data > iplant_training > ars_workshop >test_dataset)
- Click on the "Options" tab.
- Would set Min count for Kmers options under Inchworm options tab to value 2 if your datasets has 200M PE reads or more.
- Optional-keep all other options as default
- Click on "Launch Analysis."
- Once the analysis is completed , click on "Analysis," and then click on the analysis "Name" to open the output folder.
- Examine the contents of the results folder.
- The output folder you should see a file named Trinity.fasta ( Assembled transcriptome) and trintiy_out.tgz ( Entire output directory from Trinity run including intermediary files)
- Monitor the analysis progress - open 'Analyses' from the DE workspace and review 'Status' (e.g., Idle, Submitted, Pending, Running, Completed, Failed). Once launched, an analysis will continue whether the user remains logged in or not. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
- A description of what the expected outputs are/should be is here: http://trinityrnaseq.sourceforge.net/#Downstream_analyses (See 'Output of Trinity')
Assemble Transcriptome with SoapdenovoTrans
- Log into the Discovery Environment.
- Under Apps in the DE work space, click on Public Apps > NGS > Assemblers > Soapdenovo-Trans-1.0.
- Name the analysis, add any comments, and output to your team's shared folder.
- Click on the General Settings tab.
- Set 'maximum read size' according to the read length of your data.
- Set 'kmer range' to 63mer maximum, the initial 'kmer size' to an odd integer (many people start with 1/2 the read length - 1), and set the 'Output file Prefix' to soaptrans# where # is equal to your kmer choice.
- Click on the Paired Reads 1 tab.
- Set 'library1 insert size' to insert size for your reads (this is often aroun 300-400bp)
- Set 'library1 pair orientation' to Normal (fr) orientation, 'library1 steps' to 3, and set the 'library 1 rank for scaffolding' to 1.
- Click in the 'library1 seq1' field. Browse to the folder that holds the file containing the sequence data. Select the FASTA or FASTQ file that contains the forward reads - usually labeled _1
(Choose the data from your team's shared folder. Alternatively, our sample data are being provided in Community Data > iplant_training > ars_workshop >test_dataset) - Click in the 'library1 seq2' field. Browse to the folder that holds the file containing the sequence data. Select the FASTA or FASTQ file that contains the reverse reads - usually labeled _2
(Choose the data from your team's shared folder. Alternatively, our sample data are being provided in Community Data > iplant_training > ars_workshop >test_dataset) - Click on Launch Analysis to start the analysis.
- Monitor the analysis progress - open 'Analyses' from the DE workspace and review 'Status' (e.g., Idle, Submitted, Pending, Running, Completed, Failed). Once launched, an analysis will continue whether the user remains logged in or not. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
- A description of what the expected outputs are/should be is here: http://www.iplantcollaborative.org/learning-center/discovery-environment/assemble-transcripts (bottom of page)