ChIPseq Using the iPlant Discovery Environment

ChIPseq Using the iPlant Discovery Environment

ChIPseq analysis is used to answer the questions such as "where is my protein binding" and "where are specific histone modifications found". You generate a minimum of two sequence libraries: One is input DNA which serves as a baseline against which the enriched sample is compared. Its role is to illuminate positional biases in sequencing efficiency. The other library is generated by immunoprecipitating chromatin with an antibody raised against your protein epitope of choice. You then co-analyze the libraries to detect significant domains of enrichment (or depletion if that's your experimental design) of chromosomal domains in the immunoprecipitated sample.

According to Zhang et al (Genome Biology 2009, 10:R6), "H3K4me3 accumulates predominantly in promoters and 5' genic regions... In addition, H3K4me3-containing genes are highly expressed with low levels of tissue specificity...". In this tutorial, we will map a H3K4me3 ChIPseq data to confirm these findings.

  1. Import data from the SRA
  2. Convert the SRA data into FASTQ format
  3. Run a quality control report on the resulting files to confirm data quality
  4. Perform a clean-up task on our ChIPseq data
  5. Align the data to the Arabidopsis genome
  6. Use PeakRanger to find enriched domains
  7. Inspect the results in IGV using Atmosphere

The NCBI SRA

Despite the fact that this resource has been reduced in scope by funding cuts, it's an enornously valuable resource. Today, we're going to extract some data from it and learn to do basic ChIPseq analysis. I've already navigated to the SRA and identified an appopriate experiment Genome-wide maps of histone modifications in Arabidopsis, data for which can be found at http://www.ncbi.nlm.nih.gov/sra/SRP002650. In a window separate from the iPlant DE, we will navigate to this URL to begin the process of importing into the DE. We will be importing data sets GSM554342 (input DNA) and GSM554340 (H3K4me3 IP)

NCBI SRA Import

The NCBI SRA Import tool takes the FTP address for an sra or sra-lite data set and imports the data found there using the high-speed Aspera protocol. Resulting files are deposited in the analyses folder in the iPlant DE as though an algorithmic analysis had been performed.

  1. Click on the GSM554342 link. Right click on the sra-lite link and select Copy URL to Clipboard (or equivalent on your operating system and browser)
    1. It will look like so: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX022/SRX022329
  2. Switch browser windows to the iPlant DE and select the NCBI SRA Import tool from the Perform Analysis / Choose Analysis window. 
  3. Paste the URL into the box that reads Enter full SRA FTP URL and click 'Launch Job'. Name the job Import_GSM554342_input
    1. You'll be notified when the data has been imported. Because we are using the Aspera protocol in the background, this transfer takes place much more quickly than it would over FTP.
  4. Repeat this process with the GSM554340, naming the job Import_GSM554340_h3k4me3

After you have successfully imported these data sets, you will have two folders in your analyses directory: Import_GSM554342_input and Import_GSM554340_h3k4me3. Inside each of them will be a folder called SRA_Import, which contain folders that recapitulate the sample structure of the experiment. At the end of this line of folders you will find files called *.lite.sra. These are your imported data archives. Feel free to move them around, but for the sake of this excercise I will leave them in place.

The NCBI SRA Toolkit

NCBI uses a specialized file format called SRA for storing compressed, optimized versions of sequence data. You cannot use SRA files natively in any analyses: you must extract FASTQ sequence data from them first. We will now do this.

NCBI SRA Toolkit fastq-dump

  1. Select NCBI SRA Toolkit fastq-dump from the Choose Analysis window.
  2. Click on the 'Browse' button beneath 'Choose an SRA file' and navigate to one of your SRA import files. First, I will choose SRR058390.lite.sra which is the archive file for the input DNA. 
  3. Under Parameters enter the accession number into 'SRA Accession Name'. In this case, SRR058390.
  4. Click on Launch Job and name the job FASTQ_SRR058390_input
  5. Repeat the process for the H3K4me3 SRA file, naming its job FASTQ_SRR058388_h3k4me3

Quality Control

Sometimes sequence data doesn't come off the machine perfect. There can be fall-offs in quality, unanticipated nonbiological sequences, and so on that will prevent the data from being useful for downstream analysis. To help with this, iPlant has included several tools: Almost all of the FASTX Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ as well as FASTQC from the Babraham Institute http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/. In the following steps, we will run FASTQC on our imported data and take appropriate steps to fix any issues we find.

FASTQC

  1. Select FastQC from the Choose Analysis window.
  2. Choose one of our recently imported FASTQ files. In this case, I will choose ~/analyses/FASTQ_SRR058390_input/SRR058390.fastq
  3. Launch the job, naming it FastQC_SRR058390_input
  4. Repeat these steps with the H3K4me3 FASTQ file

When the jobs complete, navigate to the output directories and download the fastqc_report.png files you see there. You can also download the PDF if you prefer. You can can also download a zipped HTML version of the report as fastqc_report.zip. Open one of the reports on your desktop (and zoom in so you can see the details if you are using the PNGs).

Notice that in both the Input and H3K4me3 data, there is a terrible decline in sequence quality after 45 bases. We should trim off that bit of sequence from each read before aligning to the genome.

FASTX_Trimmer

  1. Select FASTX Trimmer from the Choose Analysis window.
  2. Choose an imported FASTQ file. In this case, I will choose ~/analyses/FASTQ_SRR058390_input/SRR058390.fastq to start
  3. Indicate that base 1 is the first to keep and base 45 is the last.
  4. Indicate that the sequence file is in PHRED33 format (this is the default for SRA data)
  5. Name the job Trimmer_SRR058390_input
  6. Repeat with the H3K4me3 FASTQ file.

ChIPseq analysis

Now that the data is cleaned up and ready to go, align it to the Arabidopsis genome using BWA http://bio-bwa.sourceforge.net/, then look for significant differences between the input and IP channels using PeakRanger http://ranger.sourceforge.net/, an algorithm developed in partnership between iPlant and the modENCODE project.

BWA Aligner for Single-End Illumina Reads

  1. Select BWA Aligner for Single-End Illumina Reads from the Choose Analysis window.
  2. Indicate that you wish to align to Arabidopsis thaliana v10 (this is the TAIR10 release and is the most recent coordinate set)
  3. Select the trimmed FASTQ file (out.fastq) from Trimmer_SRR058390_input
  4. Leave the various Options (For now – Feel free to modifiy them in your own work)
  5. Launch Job, naming it 'BWA_SRR058390_input'
  6. Repeat this process with the H3K4me3 out.fq file

Because the input files are not massive, and because we are running BWA in 'threaded' mode, these alignments should not take more than about 30 minutes. This is a good time for a coffee break.

PeakRanger

  1. Select PeakRanger from the Choose Analysis window
  2. Select the SAM file from the BWA alignment of H3K4me3 data for 'Immunoprecipitated reads alignment file'
  3. Select the SAM file from the BWA alignment of input data for 'Control reads alignment file'
  4. Leave the parameters in Options and Run Modes alone for now. You are encouraged to read the PeakRanger paper if you decide to use it for your own analyses.
  5. Launch the job

The most interesting files resulting from this analysis are PeakRanger_Analysis_peaks.bed and *PeakRanger_Analysis_peaks_with_regions.bed. If you inspect them, you will see they look like so:

Chr1	7105091	7105092	1.959061e-12	1.249010e-10
Chr2	13523375	13523376	7.618219e-06	7.166121e-05
Chr1	8183093	8183094	3.727605e-09	1.273155e-07
Chr4	6101129	6101130	6.532641e-11	3.536254e-09
Chr4	7866955	7866956	2.324581e-06	2.700090e-05
Chr3	19311859	19311860	3.729861e-07	6.080098e-06
Chr1	1018749	1018750	9.650597e-06	8.652364e-05
Chr4	2297368	2297369	8.022027e-12	4.512784e-10
Chr3	17678553	17678554	1.414439e-06	1.827939e-05
Chr4	12646130	12646131	1.732726e-06	2.142754e-05
Chr2	7953412	7953413	2.863883e-08	6.734819e-07
Chr4	10308958	10308959	7.618219e-06	7.237308e-05
Chr1	8309916	8309917	4.667442e-09	1.487877e-07
Chr2	2460902	2460903	5.483627e-06	5.351200e-05
Chr3	17132848	17132849	3.229536e-06	3.496430e-05
Chr3 17132848 17132849 3.229536e-06 3.496430e-05

This is certainly interesting, but it would be nice to look at the results in a more graphical manner. The DE doesn't support (yet) complex graphical viewers, so let's download the BED files to your desktop and open them in the Broad Institute Integrated Genomics Viewer.

Integrated Genomics Viewer

  1. In a new browser window, navigate to the IGV launch page http://www.broadinstitute.org/software/igv/download
  2. If you are using Windows, click on 'Launch with 1.2 GB' and if you are using a Macintosh, click on 'Launch with 2 GB'
  3. The viewer will begin to launch via Java Webstart. You may be prompted to authorize acceptance of code from the Broad Institute. I think we can trust them, so please do accept it.
    1. Some browsers will not automatically launch IGV, but will instead download a file called 'igv_lm.jnlp'. Navigate to that file and double-click on it to commence launching IGV.
  4. The viewer will default to viewing the human genome, which is not what we want today. Select A. thaliana TAIR10 from the top left drop-down menu in the IGV interface to switch to the Arabidopsis genome.
    1. YOu will now be presented with a view of the genome with all 5 chromosomes lined up end-to-end and a Gene Density track along the bottom of the page. Like all other browsers, you can click and zoom to zero in on domains you are interested in.

Download files

  1. Return to the Discovery Environment browser window.
  2. In the Manage Data window, navigate to your PeakRanger output directory.
  3. For each file you wish to download, click on the select box next to the filename, then choose the 'Download' link in the bottom right panel of the Manage Data window. The file will be downloaded to your desktop.

Return to the IGV

  1. In the IGV application, select "File...Load from File..." and select one of your downloaded BED files. The intervals detected by the PeakRanger application will be displayed in the context of the Arabidopsis genome. Navigate to a few genes, promoters, and intergenic regions and see if your results match up with the Zhang findings.