msgbsR: An R package for analysing methylation-sensitive restriction enzyme sequencing data

Genotyping-by-sequencing (GBS) or restriction-site associated DNA marker sequencing (RAD-seq) is a practical and cost-effective method for analysing large genomes from high diversity species. This method of sequencing, coupled with methylation-sensitive enzymes (often referred to as methylation-sensitive restriction enzyme sequencing or MRE-seq), is an effective tool to study DNA methylation in parts of the genome that are inaccessible in other sequencing techniques or are not annotated in microarray technologies. Current software tools do not fulfil all methylation-sensitive restriction sequencing assays for determining differences in DNA methylation between samples. To fill this computational need, we present msgbsR, an R package that contains tools for the analysis of methylation-sensitive restriction enzyme sequencing experiments. msgbsR can be used to identify and quantify read counts at methylated sites directly from alignment files (BAM files) and enables verification of restriction enzyme cut sites with the correct recognition sequence of the individual enzyme. In addition, msgbsR assesses DNA methylation based on read coverage, similar to RNA sequencing experiments, rather than methylation proportion and is a useful tool in analysing differential methylation on large populations. The package is fully documented and available freely online as a Bioconductor package (https://bioconductor.org/packages/release/bioc/html/msgbsR.html).


Supplementary Data 1:
A bash script containing how the publicly available data used in this paper was downloaded from SRA. After the bam files were generated the pipeline in Supplementary Data 2 was used on each data set.

Introduction
Current data analysis tools do not ful l all experimental designs. For example, GBS experiments using methylation sensitive restriction enzymes (REs), which is also known as methylation sensitive genotyping by sequencing (MS-GBS), is an e ective method to identify di erentially methylated sites that may not be accessible in other technologies such as microarrays and methyl capture sequencing. However, current data analysis tools do not satisfy the requirements for these types of experimental designs.
Here we present msgbsR, an R package for data analysis of MS-GBS experiments. Read counts and cut sites from a MS-GBS experiment can be read directly into the R environment from a sorted and indexed BAM le(s).
2 Reading data into R The analysis with the msgbsR pipeline begins with a directory which contains sorted and indexed BAM le(s). msgbsR contains an example data set containing 6 samples from a MS-GBS experiment using the restriction enzyme MspI. In this example the 6 samples are from the prostate of a rat and have been truncated for chromosome 20. 3 of the samples were fed a control diet and the other 3 were fed an experimental high fat diet.
To read in the data directly into the R environment can be done using the rawCounts() function, which requires the directory path to where the sorted and indexed les are located and the desired number of threads to be run (Default = 1).

Con rmation of correct cut sites
After the data has been generated into the R environment, the next step is to con rm that the cut sites were the correctly generated sites. In this example, the methylated sensitive restriction enzyme that has been used is MspI which recognizes a 4bp sequence (C/CGG). MspI cuts between the two cytosines when the outside cytosine is methylated.
The rst step is to extract the location of the cut sites from se and adjust the cut sites such that the region will cover the recognition sequence of MspI. It is important to note that in this example the user must adjust the region over the cut sites speci cally for each strand. In other words although the enzyme cuts at C/CGG on the minus strand this would appear as CCG/G. The code below shows how to adjust the postioining of the cut sites to cover the recginition site on each strand. > cutSites <-rowRanges(se) > # # Adjust the cut sites to overlap recognition site on each strand > start(cutSites) <-ifelse(test = strand(cutSites) == '+', + yes = start(cutSites) -1, no = start(cutSites) -2) > end(cutSites) <-ifelse(test = strand(cutSites) == '+', + yes = end(cutSites) + 2, no = end(cutSites) + 1) The object cutSites is a GRanges object that contains the start and end position of the MspI sequence length around the cut sites. These cut sites can now be checked if the sequence matches the MspI sequence.
msgbsR o er two approaches to checking the cut sites. The rst approach is to use a BSgenome which can be obtained from Bioconductor. In this example, BSgenome.Rnorvegicus.UCSC.rn6 will be used.
> library(BSgenome.Rnorvegicus.UCSC.rn6) > correctCuts <-checkCuts(cutSites = cutSites, genome = "rn6", seq = "CCGG") If a BSgenome is unavailable for a species of interest, another option to checking the cut sites is to use a fasta le. msgbsR comes with the fasta le for chromosome 20 from UCSC rn6. To use the checkCuts function with a fasta le simply change the genome input to the fasta le location and change the fasta option to TRUE. An example of this is shown below. > chr20 <-system.file("extdata", "chr20.fa.gz", package = "msgbsR") > correctCuts <-checkCuts(cutSites = cutSites, genome = chr20, fasta = TRUE, seq = "CCGG") [1] "Uncompressing fasta file" [1] "Compressing fasta file" > The correctCuts data object is in the format of a GRanges object and contains the correct sites that contained the recognition sequence. These sites can be kept within se by using the subsetByOverlaps function.
The incorrect MspI cut sites can be ltered out of datCounts: 3 > se <-subsetByOverlaps(se, correctCuts) > dim(assay(se)) [1] 13983 6 se now contains the correct cut sites and can now be used in downstream analyses.

Visualization of read counts
Before any further downstream analyses with the data, the user may want to lter out samples that did not generate a su cient number of read counts or cut sites. The msgbsR package contains a function which plots the total number of read counts against the total number of cut sites produced per sample. The user can also use the function to visulaise if di erent categories or groups produced varying amount of cut sites or total amount of reads.
To visualize the total number of read counts against the total number of cut sites produced per sample: > plotCounts(se = se, cateogory = "Group") This function generates a plot (Figure 1) where the x axis and y axis repre-sents the total number of reads and the total number of cut sites produced for each sample respectively. 5 Di erential methylation analysis msgbsR utilizes edgeR in order to determine which cut sites are di erentially methylated between groups. Since MS-GBS experiments can have multi-ple groups or conditions msgbsR o ers a wrapper function of edgeR (Zhou et al., 2014) tools to automate di erential methylation analyses.
To determine which cut sites are di erentiallly methylated between groups: > top <-diffMeth(se = se, cateogory = "Group", + condition1 = "Control", condition2 = "Experimental", + cpmThreshold = 1, thresholdSamples = 1) The top object now contains a data frame of the cut sites that had a CPM > 1 in at least 1 sample and which cut sites are di erentially methylated between the two groups.

Visualization of cut site locations
The msgbsR package contains a function to allow visualization of the location of the cut sites. Given the lengths of the chromosomes the cut sites can be visualized in a circos plot (Figure 2).
Firstly, de ne the length of the chromosome.
> ratChr <-seqlengths(BSgenome.Rnorvegicus.UCSC.rn6)["chr20"] Extract the di erentially methylated cut sites by selecting sites that had a false discovery rate (FDR) of less than 0.05. Below the code will extract the sites and return them in the form of GRanges object which can then be used to visualize the sites using functions below.