Development of a Bioinformatics Pipeline for Identification of Hypermethylated Regions in Arabidopsis thaliana RdDM Mutants

Abstract

RNA-directed DNA methylation (RdDM) is a pathway that causes the methylation of DNA through the action of 24 nt siRNAs. This pathway normally results in transcriptional gene silencing. RdDM depends on two RNA polymerases; Pol IV which generates the siRNAs, and Pol V which transcribes a scaffold and facilitates interaction between elements of the methylation machinery (including the methyltransferase DRM 2). At most loci methylation is lost in Pol IV and Pol V mutants, however, there are a few loci that behave in the opposite fashion, becoming hypermethylated in RdDM-deficient genetic backgrounds. In order to ascertain the identity of these loci, and hypothesize about potential common functions, it is necessary to establish a bioinformatics-based analysis pipeline to identify differentially methylated regions, and hypermethylated regions in particular, from bisulfite sequencing datasets. Since published methylomics sequencing data is available for download this data was used to test and construct such an analysis pipeline. The pipeline was successful in processing data to be read by its constituent programs as well as for loading into a genome browser. With a few extra steps it would be possible to use this approach to identify differentially methylated regions of interest.

Introduction

RNA-directed DNA methylation is a pathway that results in the de novo methylation of genomic elements and is mediated by 24 nt siRNAs, RNA polymerases IV and V, argonaute 4, as well as many other accessory proteins (4). DNA methylation, the end result of RdDM, is a epigenetic silencing mark and leads to transcriptional gene silencing (TGS, repression of the ability for mRNA to be made from a locus). Known targets of RdDM include diverse categories of genomic features including genes and transposable elements (TEs) (4). The process has been implicated in pathogen response, development, as well as maintaining genome integrity against the aforementioned TEs. The ability of RdDM to regulate gene expression at the transcriptional level in an epigenetic fashion has lead to it being the focus of much study lately, however, many details about its function remain elusive. The importance of RdDM is highlighted by its ability to establish methylation in all three sequence contexts; CG, CHG, and CHH (H represents any nucleotide other than G). This ability is most important in the CHH context, where no maintenance methylation pathways exist, and DNA methylation must be established de novo following DNA replication (4).

In 2008 Mosher et al. unexpectedly discovered that DNA methylation is actually increased at some loci in RdDM-deficient A. thaliana mutants (1). Specifically, those which are null mutants for the largest subunits of Pol IV and Pol V. Not much has been done since to identify the loci exhibiting this atypical behavior as the currently accepted model of RdDM indicates that it is involved only in methylation of genomic DNA. Using published whole-genome bisulfite sequencing datasets it is possible to assess DNA methylation at single cytosine resolution and determine which regions are differentially methylated in different genetic backgrounds. Using this method to identify hypermethylation in different sequence contexts proved somewhat challenging as most available tools are only useful for analyzing methylation in the CG (also called CpG) context, which is common outside of plants. However, by leveraging currently available bioinformatics tools and writing custom python and shell scripts we were able to describe a bioinformatics analysis pipeline that can be used in the future to determine the hypermethylated regions and format methylation summary files for viewing in a genome browser to compare differentially methylated loci visually.

Results

The Sequencing Data is Comprised of Multiple Experiments

Upon running fastQC on the reads used for downstream analysis it is apparent that the dataset was comprised of multiple sequencing experiments (Fig 1)

Fig 1. Sequence Length Distribution from the wild-type Col-0 ecotype reads

While the presence of multiple sequence lengths within the file may pose problems for downstream analysis, it is not clear what effect this had. There were other findings from the quality check, such as the nucleotide skew and kmer content. Nucleotide skew is to be expected since bisulfite sequencing converts unmethylated cytosines to thymines. However, the kmer content of the reads changes quite a bit through the length of the read, which may indicate that the reads were not trimmed and sequencing adapters are still present on some subset of reads within the dataset (Fig 2).

Fig 2. Kmer Content Over the Read Length

Based on these results verification of read length and the potential presence of adapters should be checked as a first step in the analysis pipeline.

Bismark is a Flexible Tool for Analysis of BS-seq Data

Many of the readily available tools for aligning bisulfite sequencing reads ignore non-CG methylation as artifacts. However, the CHG and CHH methylation is equal importance for investigation into RdDM in plants. Luckily Bismark analyzes methylation in all three sequence contexts and can output a .sam file containing these. For downstream analyses this file must be sorted, which can be done by samtools (see Methods). An additional step using the supplied Bismark_Methylation_Extractor utility will also output sequence context specific methylation status summaries on a per read basis. This file can also be sorted and the header removed for later use. The custom developed Bismark_Sorter.sh script leverages the gnu tools sed and sort to accomplish this, outputting sorted and processed context-specific methylation summaries (Fig 3).

Fig 3. Example of a Sorted-Processed Bismark Methylation Extractor Output File

The sorting is likely not necessary for further processing, however, searching the file by regular expressions may prove easier. This would be helpful if there was a particular locus of interest, the reads could be screened for differential methylation directly.

CoGe-ifier for Bismark: A Tool for Visualization of Epigenomic Features Using iPlant Cyberinfrastructure and CoGe

To enable visualization of methylation status CoGe-ifier for Bismark, a python script, was created to further process the methylation summaries into a format that can be viewed by the genome browser in CoGe. This script uses a complex nested dictionary structure to enable the output of percent methylation on a per-cytosine basis. It also reformats the file into .csv. It doesn't depend upon the special tab character so it may be more compatible across systems. The source code for the program can be found in the github link in the methods section. Figure 4 illustrates the running of the program on the command line, figure 5 contains an example of the resulting file.

Figure 4. Running CoGe-ifier for Bismark Figure 5. Resulting CoGe-formatted Methylation Summary

Following this, the fully processed methylation summary can be loaded into CoGe from the iPlant data store with the LoadExperiment feature.

Finding Hypermethylated Regions With MethylKit

The necessary commands for finding HMRs with the R package Methylkit, using the sorted .sam files from Bismark and samtools are listed in the methods section.

Establishment of the Analysis Pipeline

Upon completion of the analyses which were completed it was determined that the first step in the BS-seq analysis pipeline should be checking the reads with fastQC. Following the quality check, the reads may need trimming to remove the presence of potential sequencing adapters. After ensuring that the reads are of sufficient quality Bismark may be run on each category of read lengths individually with optimal settings. The resulting .sam file output by Bismark then will be sorted by samtools and MethylKit will identify HMRs. Following identification of HMRs a window containing them can then be used with BLAST determine the putative function of genomic features they may be contained within.

Concurrently, the Bismark_Sorter and CoGe-ifier scripts can be used to process the Bismark Methylation Extractor outputs for upload into the genome browser in CoGe. This will enable visual verification of methylation at known RdDM target loci as well as the ability to quickly compare different sequencing experiments. The overall workflow is represented in figure 6.

Fig 6. Complete Workflow for Analysis of Bisulfite-sequencing Data

Discussion

As ever larger sets of data can now be generated through sequencing and associated computational processes it's becoming vital for biologists to add programming and bioinformatics skills to their repertoire, in addition to their laboratory work. There are many questions amenable to bioinformatic analysis that can be answered in a short time frame as well. However, one problem that often arises is the need to reformat files for analysis by different programs in your pipeline. In situations such as this gnu tools and programming languages that work well with text files are extremely helpful. Through working with a large bisulfite sequencing dataset I was able to develop an analysis pipeline for my future-use.

I also discovered how vital it is to check that experiments downloaded from NCBI's short read archive are indeed from one sequencing run. In quality-checking the data I found that there were multiple read lengths within the file. This is potentially problematic because when running Bismark it is not optional to inform the program of the read length. It is unclear what effect this had on downstream analyses, but if this was to be used in the future it may be necessary to run Bismark with the optimal settings from each category of read lengths.

At the completion of the course I was able to determine the necessary commands to identify hypermethylated regions using the R package methylKit. I was also successful in writing a python program which calculated percent methylation at each cytosine passed to it from a bisulfite-seq read alignment. This methylation summary is output in a format that is readable by the genome browser within CoGe enabling one to check known RdDM target loci in your dataset as a visual confirmation that the analysis pipeline worked as intended. Additionally, this could be used to compare a the methylation status of a few loci in different genetic backgrounds.

Downstream of the currently completed analyses it would be possible to extract the sequences identified as hypermethylated by methylKit and use BLAST to determine their identity. If this is done the question of whether loci which become hypermethylated in RdDM deficient genetic backgrounds have similar function could be assessed. Following the development of the analysis pipeline the next logical step would be to use it with that as the end-goal.

Methods

Availability of Sequencing Data

Bisulfite sequencing data for wild-type columbia-0, nrpd1-3, nrpe1-11, and the nrpd + nrpe double mutant was obtained from NCBI's short read archive, SRA054962 (2). The TAIR10 Arabidopsis thaliana genome assembly was obtained from TAIR (5).

Quality Check of Reads

The quality of the experimental reads was determined with fastQC in the iPlant (6) discovery environment using default parameters.

Bisulfite-Seq Read Alignment

Bismark (3) was used to align bisulfite sequencing reads to the reference genome. Before read alignment the reference genome was prepared with the Bismark_genome_preparation script using default parameters. After the reference genome was prepared the alignment was performed using a maximum mismatch of two nucleotides, read length of 50 nt, and phred33 quality score. All other parameters were default. Following the read alignment Bismark_methylation_extractor was used to determine the methylation status of each cytosine. The parameters for Bismark_methylation_extractor were defaults for single end reads.

Formatting Methylation Status For CoGe-upload

The output of Bismark_methylation_extractor was processed with the Bismark_Sorter bash script, using the gnu tools sed and sort. This script is available on my github repository for the course (https://github.com/groverj3/PLS599).

After sorting was performed, percent methylation was calculated on a per-cytosine basis and output as a CoGe-compliant .csv file using the Bismark-CoGe_Import python script (CoGe-ifier for Bismark Methylation Extractor, v2.1) and. This script is also available on the course github (https://github.com/groverj3/PLS599).

Bash scripting was done in gedit, and the PyCharm IDE was used to write the python script.

Viewing of Methylation Status Using CoGe Genome Browser

CoGe-formatted methylation status files were uploaded to the iPlant data store and loaded into CoGe (7) using the LoadExperiment function. After loading, methylation status can be viewed with CoGe GenomeView.

Finding Hypermethylated Regions

Samtools (8) was used to sort Bismark .sam files for downstream analysis. The command executed was as follows:

samtools sort -o bismark_output-sorted.sam -O sam -T .tmp -@ 3 bismark_output.sam

After sorting with samtools the sorted .sam file is loaded by the methylkit R package (9):

read.bismark(location, sample.id, assembly, save.folder = NULL, save.context = c("CpG"), read.context = "CpG", nolap = FALSE, mincov = 10, minqual = 20, phred64 = FALSE, treatment)

Once loaded by methylkit, run the following command to create a methbase object:

methyl.obj=unite(sample_list, destrand=FALSE)

Determination of hypermethylated regions can be done with the following command:

> myDiff = calculateDiffMeth(methyl.obj, num.cores = 8)
> myDiff25p.hyper = get.methylDiff(myDiff, difference = 25,qvalue = 0.01, type = "hyper")

References

Mosher RA, Schwach F, Studholme D, Baulcombe DC. PolIVb influences RNA-directed DNA methylation independently of its role in siRNA biogenesis. Proc Natl Acad Sci USA. 2008;105(8):3145-50
Wierzbicki AT, Cocklin R, Mayampurath A, et al. Spatial and functional relationships among Pol V-associated loci, Pol IV-dependent siRNAs, and cytosine methylation in the Arabidopsis epigenome. Genes Dev. 2012;26(16):1825-36.
Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27(11):1571-2.
Matzke MA, Mosher RA. RNA-directed DNA methylation: an epigenetic pathway of increasing complexity. Nat Rev Genet. 2014;15(6):394-408.
Lamesch P, Berardini TZ, Li D, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40(Database issue):D1202-10.
Goff SA, Vaughn M, Mckay S, et al. The iPlant Collaborative: Cyberinfrastructure for Plant Biology. Front Plant Sci. 2011;2:34.
Lyons E, Freeling M. How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 2008;53(4):661-73.
Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-9.
Akalin A, Kormaksson M, Li S, et al. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol. 2012;13(10):R87.

PLS 599 2014

Science