DNA Subway Followup Webinar 10092015

DNA Subway Followup Webinar 10092015

 

Collecting Metadata

Eventually, you will be submitting your sequence data. Think about some of the metadata you will need, and review the iPlant protocol if you will be using the DE to submit your data: NCBI Sequence Read Archive (SRA) Submission (Workflow Tutorial)

Since you may not be using your own data, you may not have all of this information. We will work with the dataset providers to get as much of it collected. Here is what you need to get started, and other things that you will ultimately want before you would feel comfortable analyzing and publishing on the data.

Minimum

  • Species name
  • Sequence read length
  • Paired/Single end sequence

Additional Metadata

  • Collection protocol (RNA-Later, collected into liquid N2, etc.)
  • Tissue type
  • RNA-prep method (Trizol, Qiagen)
  • Amplification method
  • RNA-Prep (poly A selection, vs. Ribosomal depletion)
  • RNA Quality control (RIN/Bioanalyzer)
  • Insert size
  • Illumina kit info including primer sets
  • Sequencer
  • Spike-ins (if used)

Data Upload and Sharing

Setup a Shared folder

One person who is designated by the team (usually the person doing the upload) can setup the shared folder:

1. Log in to the Discovery Environment.

2. Create a folder that you will upload the dataset to; make sure your username (home folder) is highlighted and then click File > New Folder.

3. Share the folder with others in your team, and also with Kapeel (username: kapeelc) and Jason (username: williams); Highlight the folder you just created and then Click Share > Share with Collaborators.  Search for your team members (by name or username) and also add Kapeel and Jason. For now, it will be simplest to assign ownership permission to everyone. If you don't feel comfortable with that, then select a lower level of permission (e.g. read). Everyone should now see this folder in the Discovery Environment under Shared with me in the data window.

IMPORTANT TIP: When you do an analysis on the data (since not everyone on the team needs to do replicate every step, you can output results to this shared folder so everyone can see!

Put the Discovery Environment window aside for now until step 4.

Upload Data

Note: If your data is already in iPlant Data Store/Discovery Environment, you can simply move \it to the shared folder (drag and drop). Copying data (which is ususaly discouraged is available through iCommands).

1. Download the latest version of iDrop (http://www.iplantcollaborative.org/learning-center/get-started/import-data) and follow the install instructions on this page (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iDrop+Desktop). Using the installed iDrop program, drag the files to be uploaded into the folder you created in step 3.

Optional: If you are comfortable with the command line, please download iCommands (http://www.iplantcollaborative.org/learning-center/get-started/import-data) and follow the instructions for setup on this page (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iCommands). Use the iput command along with the -PVT option flags to see progress and reset your connection socket. Use the -r option for recursive upload of directories/folders.

Need help: Ask your team mates, as we have tried to balance the teams so that some have experience using iPlant. You can of course email Jason and Kapeel.

Tip: All test datasets are (or will be) placed in
/iplant/home/shared/iplant_DNA_subway
Please use these data for your work.

  • /iplant/home/shared/iplant_DNA_subway/sample_data/fastq/zea_mays/paired_end

Quality control Reads

NOTE: If you haven't already done so, you may want to unzip/decompress your reads. See the apps in the Discovery Environment: Public Apps >> General Utilities >> Compress and Decompress

1. In the Discovery Environment use: Apps > FASTQC 0.10.0 (multi-file)

2. Name your analysis and set your output to the shared folder

3. Add your files and the click Launch Analysis

See documentation here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Check the Analysis panel in the Discovery Environment for the status. You will usually get emails when jobs complete (unless you have turned that function off in preferences). You may also need to refresh this window from time to time. This job should take 10-20 minutes depending on how many files/how much data. Once you have the results, you can confer with your team on how to trim and filter in the next step.

Trim/Filter Reads

How you trim and filter here is entirely up to you. We can't give many specifics without seeing the data, so working with the team will be valuable. Here is how to use the FASTX tools in the DE.

IMPORTANT: Its usually the case that the sequencing facility has trimmed adapter sequences. A negligible <0.5% of sequences may still not be fully trimmed and show up in your FASTQC report as overrepresented. If trimming has not be done, you will need to clip the sequences before any filtering. You can use Scythe or other apps to do this.

Either use FASTX and the rePAIR App OR Trimmommatic. If you use Trimmommatic you do not need to do FASTX/rePAIR

For FASTX

1. In the Discovery Environment use: Apps > FASTX (there are several Apps, e.g.: Clipper, quality filter, quality trimmer, trimmer).

  • What Apps and parameters you will use will vary according to the sequence quality

2. For each tool you use, name your analysis and set your output to the shared folder.

See documentation here: http://hannonlab.cshl.edu/fastx_toolkit/

if you are filtering reads (i.e. dropping reads that do not meet a minimum quality score you will need to re-pair the reads: use app: rePair Fix Read Paring). This is more like the 'hidden' step in DNA Subway where read paring is re-established.

In the Discovery Environment use: Apps > rePair-Fix Read Parining

  • select both sets of fastq files (e.g. left( _1) and right reads( _2) )
  • For prefix, use the same prefix/name that was used in the original fastq files (e.g. "honeybee_control_replicate1")

For Trimmomatic

  1. Use the app HTProcess_trimmomatic_0.32 (click the info button to verify you are using the copy installed in Feb 2015)See Trimmomatic Manual

Align RNA-seq reads using Tophat

  1. Tophat is a specialized alignment software for RNA-seq reads that is aware of splice junctions when aligning to a reference assembly.  It can be found by clicking on the Apps icon. The most recent App for Tophat will accept multiple fastq files.
  2. Navigate to the Apps window by clicking on the icon. Look for TopHat2-PE (TopHat2 for Illumina paired-end reads) in the 'Public Applications' category under NGS->transcriptome_profiling->Tuxedo RNA-Seq 2. Click on the name of the App to open it.
  3. In the Input Data section, you can click 'Add' at the top right of the FASTQ File(s) box to navigate to the folder containing the FASTQ files:
  4.  In the Ref. Genome section, Use the drop-down menu to select your reference genome. 
  5. In the Ref. Annotations section, use the drop-down menu to select reference gene annotations.
  6.  Start the analysis by clicking the Launch Analysis button, naming the analysis 'tophat' in the dialog box. 

If your reference genome/annotation is not in the list look for it in:

/iplant/home/shared/iplant_DNA_subway/genomes

The .fas file will be your genome sequence

The .gtf file will be your annotation

 When the analysis is complete, navigate to the results under your 'analyses' folder. The principle outputs from TopHat are BAM files, each set of alignments is in its own folder, corresponding to the name of the original Fastq file. This is the most time-consuming step of this training module.

Cufflinks assembles RNA-Seq alignments into a parsimonious set of transcripts, then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. A detailed manual can be found at http://cole-trapnell-lab.github.io/cufflinks/manual/. We will be using a reference annotation file for '''Arabidopsis thaliana''' to perform Reference annotation based assembly (RABT), which facilitates discovery of known and novel transcripts. For further discussion, see http://cole-trapnell-lab.github.io/cufflinks/papers/

  1. Click on the Apps icon and find Cufflinks2. Open it.
  2. In the 'Select Input Data' section, open a separate data window for the 'bam' folder from tophat_out (above) and select and drag all four bam files into the input box. For convenience, a batch of TopHat bam files can be analyzed together but these files can also be processed concurrently in independent Cufflinks runs.
  3. In the Reference Sequence section, select the reference genome used for the TopHat stepThe outputs of cufflinks are GTF and FPKM files. For example, look in one of the four output folders:

 

Examine the GTF file, transcripts.gtf. This file contains annotated transcripts assembled by cufflinks, using the annotated transcripts selected from the pulldown menu in the Cufflinks app as a guide.

1	Cufflinks	transcript	3631	5899	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	3631	3913	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	3996	4276	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	4486	4605	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	4706	5095	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	5174	5326	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1	Cufflinks	exon	5439	5899	1	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

Examine the FPKM file, genes.fpkm_tracking. This files gives the normalized abundance of each transcript, expressed in fragments per kilobase of exon per million fragments mapped (FPKM) and their upper and lower 95% confidence interval.

tracking_id	class_code	nearest_ref_id	gene_id	gene_short_name	tss_id	locus	length	coverage	FPKM	FPKM_conf_lo	FPKM_conf_hi	FPKM_status
AT1G01010	-	-	AT1G01010	ANAC001	-	1:3630-5899	-	-	0	0	0	OK
AT1G01020	-	-	AT1G01020	ARV1	-	1:5927-8737	-	-	6.22574	1.51952	10.9319	OK
AT1G01030	-	-	AT1G01030	NGA3	-	1:11648-13714	-	-	0.921783	0	2.22538	OK
AT1G01040	-	-	AT1G01040	DCL1	-	1:23145-31227	-	-	0.129701	0	0.389104	OK
AT1G01050	-	-	AT1G01050	AtPPa1	-	1:31169-33153	-	-	6.08883	1.11732	11.0603	OK
AT1G01046	-	-	AT1G01046	MIR838A	-	1:28499-28706	-	-	0	0	0	OK
AT1G01070	-	-	AT1G01070	AT1G01070	-	1:38751-40944	-	-	1.92728	0	4.75919	OK
AT1G01073	-	-	AT1G01073	AT1G01073	-	1:44676-44787	-	-	0	0	0	OK

Merge all Cufflinks transcripts into a single transcriptome annotation file using Cuffmerge

Cuffmerge merges together several Cufflinks assemblies. It handles also handles running Cuffcompare. The main purpose of this application is to make it easier to make an assembly GTF file suitable for use with Cuffdiff. A merged, empirical annotation file will be more accurate than using the standard reference annotation, as the expression of rare or novel genes and alternative splicing isoforms seen in this experiment will be better reflected in the empirical transcriptome assemblies. For an informative discussion on this topic, see http://seqanswers.com/forums/showthread.php?t=16422.

  1. Open the Cuffmerge2 app. Under 'Input Data', browse to the results of the cufflinks analyses (cufflinks_out). Select the four GTF files in cufflinks_out/gtf as input for Cuffmerge2.
  2.  Under Reference Data, select the reference genomes and annotations used in the tophat step.
  3.  Run the App.

Section 4: Compare expression analysis using CuffDiff

Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for significant differences. Comprehensive documentation of this tool is given through the Nature Protocols paper linked at http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html.

  1. Open a separate data window for the bam files generated by tophat (above).
  2. In Apps select CuffDiff2
  3. Enter the name of your first Sample (WT).
  4. Add BAM files all replicates from this sample, either from your tophat output directory or the pre-staged data. Ignore the BAM index (.bai) files. Cuffdiff merges the replicate data and uses comparison between replicates as part of calculating the test statistic.
  5.  Repeat steps (b) and (c) for the second sample 
  6. Next, open the Reference Data section and add a custom reference annotation file using the "merged_with_ref_ids.gtf" file from the cuffmerge output folder. (This is optional, and if you skipped Cufflinks you can omit)
  7.  Launch the analysis and examine the output files.

After a successful run you will find many output files under cuffdiff_out. We will focus on four files related to differential gene expression at the gene and transcript levels:

  1. genes.fpkm_tracking Transcript abundance at the gene level in units of fragments per Kb of exon per million mapped reads (FPKM)
  2. gene_exp.diff Results of differential expression testing between the two samples, WT and hy5, at the gene level
  3. isoforms.fpkm_tracking Transcript abundance at the transcript isoform level in units of fragments per Kb of exon per million mapped reads (FPKM)
  4. isoform_exp.diff Results of differential expression testing between the two samples, WT and hy5, at the transcript isoform level


Explanation of Outputs:
gene_exp.diff contains gene FPKM values for the WT and hy5 samples and ln(fold change) values.

Statistical significance is given by the p_value and q_value and a determination of whether the there is significant differential expression is given in the last column (with values of 'yes' or 'no'). The q-value is a corrected p-value for multiple tests based on the user supplied false discovery rate (FDR: default 0.05).

test_id	gene_id	gene	locus	sample_1	sample_2	status	value_1	value_2	log2(fold_change)	test_stat	p_value	q_value	significant
AT1G01010	AT1G01010	ANAC001	1:3630-5899	WT	hy5	NOTEST	0	1.68901	1.79769e+308	1.79769e+308	0.0786496	1	no
AT1G01020	AT1G01020	ARV1	1:5927-8737	WT	hy5	OK	7.00799	8.85126	0.336881	-0.405048	0.685442	0.816558	no
AT1G01030	AT1G01030	NGA3	1:11648-13714	WT	hy5	NOTEST	1.44458	1.10952	-0.380717	0.22392	0.822819	1	no
AT1G01040	AT1G01040	DCL1	1:23145-33153	WT	hy5	NOTEST	0.312212	3.04201	3.28442	-2.31653	0.0205292	1	no
AT1G01046	AT1G01046	MIR838A	1:23145-33153	WT	hy5	NOTEST	0	0	0	0	1	1	no
AT1G01050	AT1G01050	AtPPa1	1:23145-33153	WT	hy5	OK	12.4652	24.0248	0.946618	-1.48078	0.138664	0.34761	no
AT1G01060	AT1G01060	LHY	1:33378-37871	WT	hy5	OK	5.2891	8.01668	0.599983	-0.961352	0.336375	0.55581	no
AT1G01070	AT1G01070	AT1G01070	1:38751-40944	WT	hy5	NOTEST	1.99104	3.22537	0.695944	-0.355638	0.722111	1	no
AT1G01073	AT1G01073	AT1G01073	1:44676-44787	WT	hy5	NOTEST	0	0	0	0	1	1	no
AT1G01080	AT1G01080	AT1G01080	1:45295-47019	WT	hy5	OK	15.4469	20.5669	0.413009	-0.750314	0.453065	0.647742	no
AT1G01090	AT1G01090	PDH-E1 ALPHA	1:47484-49286	WT	hy5	OK	71.0777	41.2999	-0.783261	2.9202	0.00349811	0.0284027	yes