DNA Subway Followup Webinar 10092015
Collecting Metadata
Eventually, you will be submitting your sequence data. Think about some of the metadata you will need, and review the iPlant protocol if you will be using the DE to submit your data: NCBI Sequence Read Archive (SRA) Submission (Workflow Tutorial)
Since you may not be using your own data, you may not have all of this information. We will work with the dataset providers to get as much of it collected. Here is what you need to get started, and other things that you will ultimately want before you would feel comfortable analyzing and publishing on the data.
Minimum
- Species name
- Sequence read length
- Paired/Single end sequence
Additional Metadata
- Collection protocol (RNA-Later, collected into liquid N2, etc.)
- Tissue type
- RNA-prep method (Trizol, Qiagen)
- Amplification method
- RNA-Prep (poly A selection, vs. Ribosomal depletion)
- RNA Quality control (RIN/Bioanalyzer)
- Insert size
- Illumina kit info including primer sets
- Sequencer
- Spike-ins (if used)
Data Upload and Sharing
Setup a Shared folder
One person who is designated by the team (usually the person doing the upload) can setup the shared folder:
1. Log in to the Discovery Environment.
2. Create a folder that you will upload the dataset to; make sure your username (home folder) is highlighted and then click File > New Folder.
3. Share the folder with others in your team, and also with Kapeel (username: kapeelc) and Jason (username: williams); Highlight the folder you just created and then Click Share > Share with Collaborators. Search for your team members (by name or username) and also add Kapeel and Jason. For now, it will be simplest to assign ownership permission to everyone. If you don't feel comfortable with that, then select a lower level of permission (e.g. read). Everyone should now see this folder in the Discovery Environment under Shared with me in the data window.
IMPORTANT TIP: When you do an analysis on the data (since not everyone on the team needs to do replicate every step, you can output results to this shared folder so everyone can see!
Put the Discovery Environment window aside for now until step 4.
Upload Data
Note: If your data is already in iPlant Data Store/Discovery Environment, you can simply move \it to the shared folder (drag and drop). Copying data (which is ususaly discouraged is available through iCommands).
1. Download the latest version of iDrop (http://www.iplantcollaborative.org/learning-center/get-started/import-data) and follow the install instructions on this page (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iDrop+Desktop). Using the installed iDrop program, drag the files to be uploaded into the folder you created in step 3.
Optional: If you are comfortable with the command line, please download iCommands (http://www.iplantcollaborative.org/learning-center/get-started/import-data) and follow the instructions for setup on this page (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iCommands). Use the iput command along with the -PVT option flags to see progress and reset your connection socket. Use the -r option for recursive upload of directories/folders.
Need help: Ask your team mates, as we have tried to balance the teams so that some have experience using iPlant. You can of course email Jason and Kapeel.
Tip: All test datasets are (or will be) placed in
/iplant/home/shared/iplant_DNA_subway
Please use these data for your work.
- /iplant/home/shared/iplant_DNA_subway/sample_data/fastq/zea_mays/paired_end
Quality control Reads
NOTE: If you haven't already done so, you may want to unzip/decompress your reads. See the apps in the Discovery Environment: Public Apps >> General Utilities >> Compress and Decompress
1. In the Discovery Environment use: Apps > FASTQC 0.10.0 (multi-file)
2. Name your analysis and set your output to the shared folder
3. Add your files and the click Launch Analysis
See documentation here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Check the Analysis panel in the Discovery Environment for the status. You will usually get emails when jobs complete (unless you have turned that function off in preferences). You may also need to refresh this window from time to time. This job should take 10-20 minutes depending on how many files/how much data. Once you have the results, you can confer with your team on how to trim and filter in the next step.
Trim/Filter Reads
How you trim and filter here is entirely up to you. We can't give many specifics without seeing the data, so working with the team will be valuable. Here is how to use the FASTX tools in the DE.
IMPORTANT: Its usually the case that the sequencing facility has trimmed adapter sequences. A negligible <0.5% of sequences may still not be fully trimmed and show up in your FASTQC report as overrepresented. If trimming has not be done, you will need to clip the sequences before any filtering. You can use Scythe or other apps to do this.
Either use FASTX and the rePAIR App OR Trimmommatic. If you use Trimmommatic you do not need to do FASTX/rePAIR
For FASTX
1. In the Discovery Environment use: Apps > FASTX (there are several Apps, e.g.: Clipper, quality filter, quality trimmer, trimmer).
- What Apps and parameters you will use will vary according to the sequence quality
2. For each tool you use, name your analysis and set your output to the shared folder.
See documentation here: http://hannonlab.cshl.edu/fastx_toolkit/
if you are filtering reads (i.e. dropping reads that do not meet a minimum quality score you will need to re-pair the reads: use app: rePair Fix Read Paring). This is more like the 'hidden' step in DNA Subway where read paring is re-established.
In the Discovery Environment use: Apps > rePair-Fix Read Parining
- select both sets of fastq files (e.g. left( _1) and right reads( _2) )
- For prefix, use the same prefix/name that was used in the original fastq files (e.g. "honeybee_control_replicate1")
For Trimmomatic
- Use the app HTProcess_trimmomatic_0.32 (click the info button to verify you are using the copy installed in Feb 2015)See Trimmomatic Manual
Align RNA-seq reads using Tophat
- Tophat is a specialized alignment software for RNA-seq reads that is aware of splice junctions when aligning to a reference assembly. It can be found by clicking on the Apps icon. The most recent App for Tophat will accept multiple fastq files.
- Navigate to the Apps window by clicking on the icon. Look for TopHat2-PE (TopHat2 for Illumina paired-end reads) in the 'Public Applications' category under NGS->transcriptome_profiling->Tuxedo RNA-Seq 2. Click on the name of the App to open it.
- In the Input Data section, you can click 'Add' at the top right of the FASTQ File(s) box to navigate to the folder containing the FASTQ files:
- In the Ref. Genome section, Use the drop-down menu to select your reference genome.
- In the Ref. Annotations section, use the drop-down menu to select reference gene annotations.
- Start the analysis by clicking the Launch Analysis button, naming the analysis 'tophat' in the dialog box.
If your reference genome/annotation is not in the list look for it in:
/iplant/home/shared/iplant_DNA_subway/genomes
The .fas file will be your genome sequence
The .gtf file will be your annotation
When the analysis is complete, navigate to the results under your 'analyses' folder. The principle outputs from TopHat are BAM files, each set of alignments is in its own folder, corresponding to the name of the original Fastq file. This is the most time-consuming step of this training module.
Assemble transcripts with Cufflinks
Cufflinks assembles RNA-Seq alignments into a parsimonious set of transcripts, then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. A detailed manual can be found at http://cole-trapnell-lab.github.io/cufflinks/manual/. We will be using a reference annotation file for '''Arabidopsis thaliana''' to perform Reference annotation based assembly (RABT), which facilitates discovery of known and novel transcripts. For further discussion, see http://cole-trapnell-lab.github.io/cufflinks/papers/
- Click on the Apps icon and find Cufflinks2. Open it.
- In the 'Select Input Data' section, open a separate data window for the 'bam' folder from tophat_out (above) and select and drag all four bam files into the input box. For convenience, a batch of TopHat bam files can be analyzed together but these files can also be processed concurrently in independent Cufflinks runs.
- In the Reference Sequence section, select the reference genome used for the TopHat stepThe outputs of cufflinks are GTF and FPKM files. For example, look in one of the four output folders:
Examine the GTF file, transcripts.gtf. This file contains annotated transcripts assembled by cufflinks, using the annotated transcripts selected from the pulldown menu in the Cufflinks app as a guide.
1 Cufflinks transcript 3631 5899 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 3631 3913 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 3996 4276 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 4486 4605 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 4706 5095 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 5174 5326 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; 1 Cufflinks exon 5439 5899 1 + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
Examine the FPKM file, genes.fpkm_tracking. This files gives the normalized abundance of each transcript, expressed in fragments per kilobase of exon per million fragments mapped (FPKM) and their upper and lower 95% confidence interval.
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status AT1G01010 - - AT1G01010 ANAC001 - 1:3630-5899 - - 0 0 0 OK AT1G01020 - - AT1G01020 ARV1 - 1:5927-8737 - - 6.22574 1.51952 10.9319 OK AT1G01030 - - AT1G01030 NGA3 - 1:11648-13714 - - 0.921783 0 2.22538 OK AT1G01040 - - AT1G01040 DCL1 - 1:23145-31227 - - 0.129701 0 0.389104 OK AT1G01050 - - AT1G01050 AtPPa1 - 1:31169-33153 - - 6.08883 1.11732 11.0603 OK AT1G01046 - - AT1G01046 MIR838A - 1:28499-28706 - - 0 0 0 OK AT1G01070 - - AT1G01070 AT1G01070 - 1:38751-40944 - - 1.92728 0 4.75919 OK AT1G01073 - - AT1G01073 AT1G01073 - 1:44676-44787 - - 0 0 0 OK
Merge all Cufflinks transcripts into a single transcriptome annotation file using Cuffmerge
Cuffmerge merges together several Cufflinks assemblies. It handles also handles running Cuffcompare. The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Cuffdiff. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies. For an informative discussion on this topic, see http://seqanswers.com/forums/showthread.php?t=16422.
- Open the Cuffmerge2 app. Under 'Input Data', browse to the results of the cufflinks analyses (cufflinks_out). Select the four GTF files in cufflinks_out/gtf as input for Cuffmerge2.
- Under Reference Data, select the reference genomes and annotations used in the tophat step.
- Run the App.
Section 4: Compare expression analysis using CuffDiff
Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for significant differences. Comprehensive documentation of this tool is given through the Nature Protocols paper linked at http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html.
- Open a separate data window for the bam files generated by tophat (above).
- In Apps select CuffDiff2
- Enter the name of your first Sample (WT).
- Add BAM files all replicates from this sample, either from your tophat output directory or the pre-staged data. Ignore the BAM index (.bai) files. Cuffdiff merges the replicate data and uses comparison between replicates as part of calculating the test statistic.
- Repeat steps (b) and (c) for the second sample
- Next, open the Reference Data section and add a custom reference annotation file using the "merged_with_ref_ids.gtf" file from the cuffmerge output folder. (This is optional, and if you skipped Cufflinks you can omit)
- Launch the analysis and examine the output files.
After a successful run you will find many output files under cuffdiff_out. We will focus on four files related to differential gene expression at the gene and transcript levels:
- genes.fpkm_tracking Transcript abundance at the gene level in units of fragments per Kb of exon per million mapped reads (FPKM)
- gene_exp.diff Results of differential expression testing between the two samples, WT and hy5, at the gene level
- isoforms.fpkm_tracking Transcript abundance at the transcript isoform level in units of fragments per Kb of exon per million mapped reads (FPKM)
- isoform_exp.diff Results of differential expression testing between the two samples, WT and hy5, at the transcript isoform level
Explanation of Outputs:
gene_exp.diff contains gene FPKM values for the WT and hy5 samples and ln(fold change) values.
Statistical significance is given by the p_value and q_value and a determination of whether the there is significant differential expression is given in the last column (with values of 'yes' or 'no'). The q-value is a corrected p-value for multiple tests based on the user supplied false discovery rate (FDR: default 0.05).
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant AT1G01010 AT1G01010 ANAC001 1:3630-5899 WT hy5 NOTEST 0 1.68901 1.79769e+308 1.79769e+308 0.0786496 1 no AT1G01020 AT1G01020 ARV1 1:5927-8737 WT hy5 OK 7.00799 8.85126 0.336881 -0.405048 0.685442 0.816558 no AT1G01030 AT1G01030 NGA3 1:11648-13714 WT hy5 NOTEST 1.44458 1.10952 -0.380717 0.22392 0.822819 1 no AT1G01040 AT1G01040 DCL1 1:23145-33153 WT hy5 NOTEST 0.312212 3.04201 3.28442 -2.31653 0.0205292 1 no AT1G01046 AT1G01046 MIR838A 1:23145-33153 WT hy5 NOTEST 0 0 0 0 1 1 no AT1G01050 AT1G01050 AtPPa1 1:23145-33153 WT hy5 OK 12.4652 24.0248 0.946618 -1.48078 0.138664 0.34761 no AT1G01060 AT1G01060 LHY 1:33378-37871 WT hy5 OK 5.2891 8.01668 0.599983 -0.961352 0.336375 0.55581 no AT1G01070 AT1G01070 AT1G01070 1:38751-40944 WT hy5 NOTEST 1.99104 3.22537 0.695944 -0.355638 0.722111 1 no AT1G01073 AT1G01073 AT1G01073 1:44676-44787 WT hy5 NOTEST 0 0 0 0 1 1 no AT1G01080 AT1G01080 AT1G01080 1:45295-47019 WT hy5 OK 15.4469 20.5669 0.413009 -0.750314 0.453065 0.647742 no AT1G01090 AT1G01090 PDH-E1 ALPHA 1:47484-49286 WT hy5 OK 71.0777 41.2999 -0.783261 2.9202 0.00349811 0.0284027 yes