ChIP-seq Tutorial - BWA, Picard, HOMER, and IntersectBed
ChIP-seq workflow using the Discovery Environment
The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
Rational and background
ChIP-Seq is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins, such as transcription factor, covalently modified histone marks, or other nuclear protein (https://en.wikipedia.org/wiki/ChIP-sequencing). The current ecosystems of ChIP-Seq provide a varied ways to identify the binding sites of DNA-associated proteins. Among those, HOMER's routines cater to the analysis of ChIP-Seq data. Here, we describe a pipeline suitable for both proteins that are expected to bind in a punctate manner, such as specific DNA sequences or specific chromatin configurations, or associate with DNA over longer regions or domains.
Pre-Requisites
- An iPlant account. (Register for an iPlant account at user.iplantcollaborative.org.)
- The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.
Test Data
This tutorial uses the sequencing data stored /iplant/home/xiaofei_iplant/Sorghum_chr8/chr8_test/G3_P_K4me3_chr8.
Workflow
The tutorial will take users through the following operations:
Operation 1: Align reads to reference using BWA
- Mandatory arguments
- Sequences folder for protein of interest (Note: the files could be in FASTA or FASTQ format but should be named including reads end information for PE reads, e.g., test_R1.fq and test_R2.fq)
- Sequences folder for background control (Same as b)
- Reference genome sequence in FASTA format
- Read type: SR vs PE
- Optional arguments
- Minimum score: Don’t output alignments with score lower than INT
- Type of sequencing reads: Illumina, PacBio, Oxford Nanopore, Intra-species contains to ref
- Sort method for BAM: Sort alignments by leftmost coordinates, or by read name
- Mark shorter split: Mark shorter split hits as secondary (for Picard compatibility)
- Sam output: keep or purge the alignments in SAM
- Test output
- G3_P_H3_chr8_BWA_bam
- G3_P_H3_rep1_chr8_R.sorted.bam
- G3_P_H3_rep1_chr8_R.sorted.bam.bai
- G3_P_H3_rep2_chr8_R.sorted.bam
- G3_P_H3_rep2_chr8_R.sorted.bam.bai
- G3_P_K4me3_chr8_BWA_bam
- G3_P_K4me3_rep1_chr8_R.sorted.bam
- G3_P_K4me3_rep1_chr8_R.sorted.bam.bai
- G3_P_K4me3_rep2_chr8_R.sorted.bam
- G3_P_K4me3_rep2_chr8_R.sorted.bam.bai
- G3_P_H3_chr8_BWA_bam
Operation 2: Picard removing duplicates
- Mandatory arguments
- Input directory: Directory of SAM/BAM files to analyze. The alignment files must be coordinate sorted.
- Optional arguments:
- VALIDATION_STRINGENCY: Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: LENIENT. This option can be set to 'null' to clear the default value. Possible values: {STRICT, LENIENT, SILENT}
- REMOVE_DUPLICATES: If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: true.
- Test output
- BAM files
G3_P_H3_chr8_BWA_bam_rmDup_BAM:- G3_P_H3_rep1_chr8_rmDup.sorted.bam
- G3_P_H3_rep1_chr8_rmDup.sorted.bam.bai
- G3_P_H3_rep2_chr8_rmDup.sorted.bam
- G3_P_H3_rep2_chr8_rmDup.sorted.bam.bai
- G3_P_K4me3_rep1_chr8_rmDup.sorted.bam
- G3_P_K4me3_rep1_chr8_rmDup.sorted.bam.bai
- G3_P_K4me3_rep2_chr8_rmDup.sorted.bam
- G3_P_K4me3_rep2_chr8_rmDup.sorted.bam.bai
- G3_P_H3_rep1_chr8_rmDup.sorted.bam
- BAM files
Operation 3: HOMER creating tag directory
- Mandatory arguments
- Input alignment folder
- Output folder
- Optional arguments:
- -tbp: maxium tags per bp
- -format: format of the alignment files
- -single: setting this option will place all reads into a single tag file instead of separate tag files for each chromosome
- Test outputs
- G3_P_H3_rep1_chr8_R_rmDup_tagDir
- G3_P_H3_rep1_chr8_R_rmDup_tagDir
- G3_P_K4me3_rep1_chr8_R_rmDup_tagDir
- G3_P_K4me3_rep2_chr8_R_rmDup_tagDir
Operation 4: HOMER finding peaks
- Mandatory arguments
- Tag directory for mark of interest
- Tag directory for input background (use an input or IgG as a control is highly recommended)
- Optional arguments:
- -STYLE: specify an analysis strategy
- factor
- histone
- groseq
- gss
- dans
- super
- mC
- -Peak Size: peak size
- -minDist: maximum distance used to stitch peaks together
- -gsize: effective mappable genome size
- -region: stitch adjacent enriched peaks into regions
- -rep: biological replicates, self-defined rather than HOMER, the name of the tag dirctory should contain "rep" followed by a number (e.g.:G3_P_H3_rep1_tagDir)
- -STYLE: specify an analysis strategy
- Test outputs
- homerPeaks:
- G3_P_K4me3_rep1_chr8_R_rmDup.txt
- G3_P_K4me3_rep2_chr8_R_rmDup.txt
- homerBed:
- G3_P_K4me3_rep1_chr8_R_rmDup.bed
- G3_P_K4me3_rep2_chr8_R_rmDup.bed
- homerPeaks:
Operation 5: IntersectBed overlapping peaks between replicates
- Mandatory arguments
- A bed/gff/vcf file for "B".
- Optional arguments:
- A bed/gff/vcf file or a bam file for "A" (Note: one of these has to be provided).
- -F: Minimum overlap required as a fraction of B.
- -f: Minimum overlap required as a fraction of A.
- Output format when using BAM as input.
- -wa: Write the original entry in A for each overlap.
- -wb: Write the original entry in B for each overlap. Useful for knowing what A overlaps. Restricted by -f and -r.
- -wo: Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported. Restricted by -f and -r.
- -wao: Write the original A and B entries plus the number of base pairs of overlap between the two features. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. Restricted by -f and -r.
- -u: Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.
- -c: For each entry in A, report the number of hits in B while restricting to -f. Reports 0 for A entries that have no overlap with B. Restricted by -f and -r.
- -v: Only report those entries in A that have no overlap in B. Restricted by -f and -r.
- -r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
- Strandedness: Force “strandedness”. That is, only report hits in B that overlap A on the same/opposite strand. By default, overlaps are reported without respect to strand.
Test outputs
- commonPeaks/SmpName_interSect.bed (e.g. G1_P_K4me3_interSect.bed)
Operation 6: HOMER annotating common peaks
- Mandatory arguments
- Inputs: Peak files to be annotated, annotatePeaks.pl accepts HOMER peak files or BED files.
- Optional arguments:
- -Reference Fasta: For organisms with relatively incomplete genomes, annotatePeaks.pl can still provide some functionality. If the genome is not available as a pre-configured genome in HOMER, then you can supply the path to the full genome FASTA file or path to directory containing chromsome FASTA files as the 2nd argument.
- -Genome Anno File (using custom annotations):
HOMER can process GTF (Gene Transfer Format) files and use them for annotation purposes ("-gtf <gtf filename>").
You may also find a custom annotation file for the organism, such as banana_slug_genes.gtf, or banana_slug_genes.gff from the community website. - -Gene Expression Data: annotatePeaks.pl can add gene-specific information to peaks based on each peak's nearest annotated TSS. To add gene expression or other data types, first create a gene data file (tab delimited text file) where the first column contains gene identifiers, and the first row is a header describing the contents of each column. In principle, the contents of these columns doesn't mater.
- -GENOME: Pre-configured genome in HOMER, e.g. hg18.
- Test outputs
- Peak annotation file:
e.g., homerPeakAnno/G3_P_K4me3_interSect_PeakAnno.txt
- Peak annotation file: