Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data

DNA base modifications, such as C5-methylcytosine (5mC) and N6-methyldeoxyadenosine (6mA), are important types of epigenetic regulations. Short-read bisulfite sequencing and long-read PacBio sequencing have inherent limitations to detect DNA modifications. Here, using raw electric signals of Oxford Nanopore long-read sequencing data, we design DeepMod, a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications. We sequence a human genome HX1 and a Chlamydomonas reinhardtii genome using Nanopore sequencing, and then evaluate DeepMod on three types of genomes (Escherichia coli, Chlamydomonas reinhardtii and human genomes). For 5mC detection, DeepMod achieves average precision up to 0.99 for both synthetically introduced and naturally occurring modifications. For 6mA detection, DeepMod achieves ~0.9 average precision on Escherichia coli data, and have improved performance than existing methods on Chlamydomonas reinhardtii data. In conclusion, DeepMod performs well for genome-scale detection of DNA modifications and will facilitate epigenetic analysis on diverse species.

Human Nanopore sequencing dataset of NA12878: The human genome NA12878 has been well studied with different types of sequencing data including bisulfite sequencing data 4 and Nanopore long-read data 3 . Nanopore long-read data were generated by Jain et al. mainly using Nanopore R9.4 with ~30X coverage 3 . There was no PCR amplification before sequencing, and thus this Nanopore data contains native modifications in NA12878.
To evaluate DeepMod on NA12878 Nanopore sequencing data, the ground truth of methylation was obtained from bisulfite sequencing 4 with two replicates. The analysis of bisulfite sequencing provided percentages of methylations for cytosines in several motifs (CpG, CHG and CHH) together with coverage information, and the majority of methylations are for cytosines in CpG sites. The heterogeneity of sequenced cells could not be evaluated here.

Supplementary Methods
In this section, how to extract precise modification labels were introduced for synthetically introduced modifications and modifications detected by bisulfite sequencing. These labels were used for evaluating of DeepMod only, and not required and unknown in real-world applications.
Precise modification labels of events are necessary to build a DeepMod model in training process and to evaluate prediction performance of DeepMod in testing process. Reliable modification labelling of events is critical to build a well-trained model: if many labels are incorrect, the learning process would be incorrect. Nanopore sequencing data with synthetically introduced modifications and native modifications have different modification information, and thus were treated separately. Given a set of FAST5 files of Nanopore sequencing as input of DeepMod, the following procedures are used to extract precise modification labels of events in Nanopore long reads. Please note that the modification labeling here was only for training/testing process of DeepMod.
On synthetically introduced methylation data: For Nanopore sequencing with synthetical treatment, the molecules to be sequenced usually contains completely modified or completely un-modified bases for a nucleotide in a motif of interest. Since the synthetically introduced methylations are usually motifbased (for example, in CG_MSssI, almost all cytosine in CpG sites were methylated, and none of other bases should be modified), each event in negative control Nanopore sequencing data were generated from un-methylated bases. In a synthetically introduced methylation dataset, methylation and unmethylation labeling was a little more complicated, because of error rate of Nanopore sequencing, the slight difference between sequenced reads and their reference genome and other factors. Thus the following procedures were used for methylation and un-methylation labeling for a Nanopore sequencing data with synthetically introduced methylations based on motifs.
First, a sequence of bases from events were extracted from each a long read, and the sequence was aligned with a reference genome (E. coli strand K-12 sub-strand MG1655 5 for E. coli datasets and hg38 for human Nanopore sequencing data). Then according to the alignment, we found out reliable motifbased methylations which were synthetically introduced by methylases. In detail, given a motif, for example, CpG sites, C in a CpG site was labelled to be methylated if (i) there were not more than 2 gaps in a 7-base window centered at that C, or (ii) there were not more than 3 gaps in a 13-base window centered at that C. The maximum number of gaps in a small region was used to eliminate poor alignment with more gaps, and CpG sites in poor alignment regions were not used. Meanwhile, 13-base window is used to avoid the incorrect small-region alignment due to methylations (methylation might results in wrong basecalling in a k-mer such as 5-mer) rather than due to alignment errors, while 7-base window is used to avoid other effect on the alignment. 7 is the minimum odd number larger than 5 and 5-mer is usually used for event in fast5 files, while 13 is almost twice of 7. These numbers is selected based on our understanding and not optimal.
After that, all other bases in a long read would have an un-methylation label, but one adjacent upstream and downstream events of the C in any CpG would be excluded from training and testing data without any methylation or un-methylation label. That is, long reads with synthetically introduced methylations contain methylated events, unmethylated events, and other events which were not used for training/testing. For other types of motifs, the similar procedures could be used to label events in long reads with synthetically introduced methylations.
Usually, a given reference genome contains a roughly known number of a certain motifs. For example, in a reference E. coli genome, there are 693,586 CpG sites among ~4.64Mb nucleotides. The information for other motifs could also be found in 2 .
On real native data with bisulfite sequencing: In Nanopore sequencing data with native DNA molecules, heterogeneous cells were usually sequenced, and then each strand-sensitive genome position was associated with a unknown methylation percentage rather than complete methylation (the methylation percent is 100%) or complete un-methylation (the methylation percent is 0%). Usually, bisulfite sequencing was used as a gold standard of DNA methylation. Thus, for a genome with both Nanopore sequencing data and bisulfite sequencing data, the criteria below was used to obtain high-quality methylation and un-methylation labels for events in long reads: on one hand, if a cytosine at a particular genomic position in a reference genome (hg38) has both >90% of methylations in two replicates of bisulfite sequencing, a cytosine at the corresponding position from long reads was considered to be completely methylated when that cytosine from long reads was just aligned with the cytosine at the particular genomic position in hg38; on other hand, if a nucleotide at a particular genomic position in hg38 has 0% methylations in both replicates of bisulfite sequencing, a cytosine at the corresponding position from long reads was considered to be completely un-methylated when that cytosine from long reads was just aligned with the cytosine at the particular genomic position in hg38. Nucleotides at certain positions with methylation percent between 0% and 90% were not considered in this testing process. The number of completely methylated/un-methylated cytosines from NA12878 (HX1) was given in Supplementary Table 4 (Supplementary Table 5).

Evaluation on Escherichia coli data sequenced by Simpson et. al. 1 :
This Nanopore dataset of E. coli has high-coverage, and thus it was used to see how different hyper-parameters affect DeepMod performance in modification prediction. To overcome certain overfitting issues, two independent validation strategies were used to evaluate DeepMod on this data. One strategy is read-based, where all long reads were divided into two groups: one group, with 90% long reads in CG_MSssI (positive control) and 90% long reads in UMR (negative control), was used to train DeepMod, and the remained 10% long reads (10% long reads in CG_MSssI and 10% long reads in UMR) were used for testing DeepMod. Under this validation strategy, long reads in testing groups and in training groups might be aligned with same reference positions.
Thus, the second validation strategy is region-based independent validation, where all long reads or some bases of long reads mapped to the genomic positions from 1,000,000 to 2,000,000 of E. coli were used for testing, no matter long reads were from CG_MSssI or UMR; the rest of long reads were used for training DeepMod.
To train DeepMod, there is a hyper-parameter of window size, . No prior knowledge can guarantee which is best. To select a better , we ranged from 7, to 11, to 15, to 21, to 31, and then to 51 with a step of 10. The per-call validation performance with different s and the two validation strategies were shown in Error! Reference source not found. together with 57-feature or 7-feature description and the number of times long reads in UMR were used for training. The per-call performance was evaluated by precision, recall, accuracy and F1-score. It can be seen from Error! Reference source not found. that 57-feature description provided similar performance to 7-feature description did under the two validation strategies. Thus, 7-feature description was used by default in DeepMod.
Meanwhile from Error! Reference source not found., when increases, F1-score also increases, although the increase of F1-score become smaller. However, with a larger , more resources (time, CPU and memory) were needed for both training and prediction. In real applications, =21 was selected as default setting. = 21 was used on other Nanopore datasets without further optimization. This value of was larger than 5 or 6, but this did not mean that all adjacent bases of a base or as far as bases of a base would affect signals of an event associated with the base. However, if users have enough resources and time, we recommended a larger with better performance.

Supplementary Tables
Supplementary Table 1. Per-call performance of DeepMod with independent validation on E. coli nanopore data of CG_MSssI (a nanopore data of PCR-amplified and enzyme-treated reads where almost CpG sites were methylated) and of UMR (a nanopore data of PCR-amplified reads where no modification would be available). is the window size. Read-based validation (in italic) means 10% reads were used for testing while 90% reads were used for training DeepMod, no matter reads were from CG_MSssI or UMR, while Region-based validation means reads or bases mapped to the genomic positions from 1,000,000 to 2,000,000 were used for testing and reads/bases mapped to other genomic positions were used to train DeepMod. 57-feature description represents 50 count values of 50 bins and mean, standard deviation and length for an event together with base information, while 7-feature description has only three summarization features together with base information. r1, r2, r3 and r4 indicates that how many times (1, 2, 3 and 4) long reads in UMR were used for DeepMod. The descriptions of different metrics were described in Methods. MCC is Matthews correlation coefficient.

Read-based validation
Region-based validation 57-feature description 7-feature description 57-feature description 7-feature description r1 r2 r3 r4 r1 r2 r3 r4 r1 r2 r3 r4 r1 r2 r3 r4 Supplementary Table 2. Cross-validation per-call performance of DeepMod with independent validation on E. coli nanopore data of CG_MSssI (a nanopore data of PCR-amplified and enzyme-treated reads where almost CpG sites were methylated) and of UMR (a nanopore data of PCR-amplified reads where no modification would be available). Tested regions means the region in the row were used for testing and other genomic positions were used to train DeepMod. r1, r2, r3 and r4 indicates that how many times (1, 2, 3 and 4) long reads in UMR were used for DeepMod. The descriptions of different metrics were described in Methods. MCC is Matthews correlation coefficient. Metrics  r1  r2  r3  r4  Supplementary Table 3. Confusion matrix for DeepMod on E. coli data using a cutoff of methylation percentage 0.1. Pos: methylation in ground truth or methylation prediction; Neg: non-methylation in ground truth or non-methylation prediction. The first column is about the motif followed by an enzyme to methylate bases whose name is capital in motifs.