ChIP-seq Tutorial - BWA, Picard, HOMER, and IntersectBed

ChIP-seq workflow using the Discovery Environment

 

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to xwang@cshl.edu. Thank you.


Rational and background

ChIP-Seq is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins, such as transcription factor, covalently modified histone marks, or other nuclear protein (https://en.wikipedia.org/wiki/ChIP-sequencing). The current ecosystems of ChIP-Seq provide a varied ways to identify the binding sites of DNA-associated proteins. Among those,  HOMER's routines cater to the analysis of ChIP-Seq data. Here, we describe a pipeline suitable for both proteins that are expected to bind in a punctate manner, such as specific DNA sequences or specific chromatin configurations,  or associate with DNA over longer regions or domains.


Pre-Requisites

  1. An iPlant account. (Register for an iPlant account at user.iplantcollaborative.org.)
  2. The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Test Data

This tutorial uses the sequencing data stored /iplant/home/xiaofei_iplant/Sorghum_chr8/chr8_test/G3_P_K4me3_chr8.

Workflow

The tutorial will take users through the following operations:

Operation 1: Align reads to reference using BWA

  1. Mandatory arguments 
    1. Sequences folder for protein of interest (Note: the files could be in FASTA or FASTQ format but should be named including reads end information for PE reads, e.g., test_R1.fq and test_R2.fq)
    2. Sequences folder for background control (Same as b)
    3. Reference genome sequence in FASTA format
    4. Read type: SR vs PE
  2. Optional arguments
    1. Minimum score: Don’t output alignments with score lower than  INT
    2. Type of sequencing reads: Illumina, PacBio, Oxford Nanopore, Intra-species contains to ref
    3. Sort method for BAM: Sort alignments by leftmost coordinates, or by read name
    4. Mark shorter split: Mark shorter split hits as secondary (for Picard compatibility)
    5. Sam output: keep or purge the alignments in SAM
  3. Test output
    1. G3_P_H3_chr8_BWA_bam
      1. G3_P_H3_rep1_chr8_R.sorted.bam
      2. G3_P_H3_rep1_chr8_R.sorted.bam.bai
      3. G3_P_H3_rep2_chr8_R.sorted.bam
      4. G3_P_H3_rep2_chr8_R.sorted.bam.bai
    2. G3_P_K4me3_chr8_BWA_bam
      1. G3_P_K4me3_rep1_chr8_R.sorted.bam
      2. G3_P_K4me3_rep1_chr8_R.sorted.bam.bai
      3. G3_P_K4me3_rep2_chr8_R.sorted.bam
      4. G3_P_K4me3_rep2_chr8_R.sorted.bam.bai

Operation 2: Picard removing duplicates

  1. Mandatory arguments 
    1. Input directory: Directory of SAM/BAM files to analyze. The alignment files must be coordinate sorted. 
  2. Optional arguments:
    1. VALIDATION_STRINGENCY: Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: LENIENT. This option can be set to 'null' to clear the default value. Possible values: {STRICT, LENIENT, SILENT}
    2. REMOVE_DUPLICATES: If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: true.
  3. Test output
     
    1. BAM files
      G3_P_H3_chr8_BWA_bam_rmDup_BAM:
      1. G3_P_H3_rep1_chr8_rmDup.sorted.bam
      2. G3_P_H3_rep1_chr8_rmDup.sorted.bam.bai
      3. G3_P_H3_rep2_chr8_rmDup.sorted.bam
      4. G3_P_H3_rep2_chr8_rmDup.sorted.bam.bai
      G3_P_K4me3_chr8_BWA_bam_rmDup_BAM:
      1. G3_P_K4me3_rep1_chr8_rmDup.sorted.bam
      2. G3_P_K4me3_rep1_chr8_rmDup.sorted.bam.bai
      3. G3_P_K4me3_rep2_chr8_rmDup.sorted.bam
      4. G3_P_K4me3_rep2_chr8_rmDup.sorted.bam.bai

Operation 3: HOMER creating tag directory

  1. Mandatory arguments 
    1. Input alignment folder
    2. Output folder
  2. Optional arguments:
    1. -tbp: maxium tags per bp
    2. -format: format of the alignment files
    3. -single: setting this option will place all reads into a single tag file instead of separate tag files for each chromosome
  3. Test outputs
    1. G3_P_H3_rep1_chr8_R_rmDup_tagDir
    2. G3_P_H3_rep1_chr8_R_rmDup_tagDir
    3. G3_P_K4me3_rep1_chr8_R_rmDup_tagDir
    4. G3_P_K4me3_rep2_chr8_R_rmDup_tagDir

Operation 4: HOMER finding peaks

  1. Mandatory arguments 
    1. Tag directory for mark of interest 
    2. Tag directory for input background (use an input or IgG as a control is highly recommended)
  2. Optional arguments:
    1. -STYLE: specify an analysis strategy
      1. factor
      2. histone
      3. groseq
      4. gss
      5. dans
      6. super
      7. mC
    2. -Peak Size: peak size
    3. -minDist: maximum distance used to stitch peaks together
    4. -gsize: effective mappable genome size
    5. -region: stitch adjacent enriched peaks into regions
    6. -rep: biological replicates, self-defined rather than HOMER, the name of the tag dirctory should contain "rep" followed by a number (e.g.:G3_P_H3_rep1_tagDir)
  3. Test outputs
     
    1. homerPeaks:
      1. G3_P_K4me3_rep1_chr8_R_rmDup.txt
      2. G3_P_K4me3_rep2_chr8_R_rmDup.txt
    2. homerBed:
      1. G3_P_K4me3_rep1_chr8_R_rmDup.bed
      2. G3_P_K4me3_rep2_chr8_R_rmDup.bed

Operation 5: IntersectBed overlapping peaks between replicates

  1. Mandatory arguments 
    1. A bed/gff/vcf file for "B".
  2. Optional arguments:
    1. A bed/gff/vcf file or a bam file for "A" (Note: one of these has to be provided).
    2. -F: Minimum overlap required as a fraction of B.
    3. -f: Minimum overlap required as a fraction of A.
    4. Output format when using BAM as input.
    5. -wa: Write the original entry in A for each overlap.
    6. -wb: Write the original entry in B for each overlap. Useful for knowing what A overlaps. Restricted by -f and -r.
    7. -wo: Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported. Restricted by -f and -r.
    8. -wao: Write the original A and B entries plus the number of base pairs of overlap between the two features. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. Restricted by -f and -r.
    9. -u: Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.
    10. -c: For each entry in A, report the number of hits in B while restricting to -f. Reports 0 for A entries that have no overlap with B. Restricted by -f and -r.
    11. -v: Only report those entries in A that have no overlap in B. Restricted by -f and -r.
    12. -r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
    13. Strandedness: Force “strandedness”. That is, only report hits in B that overlap A on the same/opposite strand. By default, overlaps are reported without respect to strand.
  3. Test outputs

    1. commonPeaks/SmpName_interSect.bed (e.g. G1_P_K4me3_interSect.bed)

Operation 6: HOMER annotating common peaks

  1. Mandatory arguments 
    1. Inputs: Peak files to be annotated, annotatePeaks.pl accepts HOMER peak files or BED files.
  2. Optional arguments:
    1. -Reference Fasta: For organisms with relatively incomplete genomes, annotatePeaks.pl can still provide some functionality.  If the genome is not available as a pre-configured genome in HOMER, then you can supply the path to the full genome FASTA file or path to directory containing chromsome FASTA files as the 2nd argument.
    2. -Genome Anno File (using custom annotations): 
      HOMER can process GTF (Gene Transfer Format) files and use them for annotation purposes ("-gtf <gtf filename>"). 
      You may also find a custom annotation file for the organism, such as banana_slug_genes.gtf, or banana_slug_genes.gff from the community website. 
    3. -Gene Expression Data: annotatePeaks.pl can add gene-specific information to peaks based on each peak's nearest annotated TSS.  To add gene expression or other data types, first create a gene data file (tab delimited text file) where the first column contains gene identifiers, and the first row is a header describing the contents of each column.  In principle, the contents of these columns doesn't mater. 
    4. -GENOME: Pre-configured genome in HOMER, e.g. hg18.
  3. Test outputs
    1. Peak annotation file:
      e.g., homerPeakAnno/G3_P_K4me3_interSect_PeakAnno.txt









Unable to render {include} The included page could not be found.