Bacterial DNA methylation occurs at diverse sequence contexts and plays important functional roles in cellular defense and gene regulation. Existing methods for detecting DNA modification from nanopore sequencing data do not effectively support de novo study of unknown bacterial methylomes. In this work, we observed that a nanopore sequencing signal displays complex heterogeneity across methylation events of the same type. To enable nanopore sequencing for broadly applicable methylation discovery, we generated a training dataset from an assortment of bacterial species and developed a method, named nanodisco (https://github.com/fanglab/nanodisco), that couples the identification and fine mapping of the three forms of methylation into a multi-label classification framework. We applied it to individual bacteria and the mouse gut microbiome for reliable methylation discovery. In addition, we demonstrated the use of DNA methylation for binning metagenomic contigs, associating mobile genetic elements with their host genomes and identifying misassembled metagenomic contigs.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All sequencing data generated for this study are available at the Sequence Read Archive under the BioProjects PRJNA559199 for individual bacteria and PRJNA559386 for the mouse gut microbiome samples. NCBI reference sequences used for the individual bacteria analysis are available under the accession codes CP041693, CP041696, NC_008261.1, CP014225.1, CP023448.1, NC_007796.1, NC_002946.2, CP041695 and CP003732.1 (Supplementary Table 1). Information related to methylation motifs are available from the REBASE database (http://rebase.neb.com)16. Data from the SMRT sequencing metagenomic study can be found under the BioProject PRJNA404082.
The nanodisco software and a detailed tutorial with supporting data are available at http://github.com/fanglab/nanodisco.
Beaulaurier, J., Schadt, E. E. & Fang, G. Deciphering bacterial epigenomes using modern sequencing technologies. Nat. Rev. Genet. 20, 157–172 (2019).
Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7, 461–465 (2010).
Blow, M. J. et al. The epigenomic landscape of prokaryotes. PLoS Genet. 12, e1005854 (2016).
Laszlo, A. H. et al. Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc. Natl Acad. Sci. USA 110, 18904–18909 (2013).
Schreiber, J. et al. Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands. Proc. Natl Acad. Sci. USA 110, 18910–18915 (2013).
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 14, 411–413 (2017).
McIntyre, A. B. R. et al. Single-molecule sequencing detection of N6-methyladenine in microbial reference materials. Nat. Commun. 10, 579 (2019).
Ni, P. et al. DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning. Bioinformatics 35, 4586–4595 (2019).
Liu, Q. et al. Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nat. Commun. 10, 2449 (2019).
Liu, Q., Georgieva, D. C., Egli, D. & Wang, K. NanoMod: a computational tool to detect DNA modifications using Nanopore long-read sequencing data. BMC Genomics 20, 78 (2019).
Stoiber, M. et al. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. Preprint at bioRxiv https://doi.org/10.1101/094672 (2017).
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Wion, D. & Casadesus, J. N6-methyl-adenine: an epigenetic signal for DNA-protein interactions. Nat. Rev. Microbiol. 4, 183–192 (2006).
Casadesus, J. & Low, D. Epigenetic gene regulation in the bacterial world. Microbiol Mol. Biol. Rev. 70, 830–856 (2006).
Roberts, R. J., Vincze, T., Posfai, J. & Macelis, D. REBASE–a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res. 43, D298–D299 (2015).
Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).
Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME Suite. Nucleic Acids Res. 43, W39–W49 (2015).
Saeed, I., Tang, S. L. & Halgamuge, S. K. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition. Nucleic Acids Res. 40, e34 (2012).
Iverson, V. et al. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335, 587–590 (2012).
Laczny, C. C., Pinel, N., Vlassis, N. & Wilmes, P. Alignment-free visualization of metagenomic data by nonlinear dimension reduction. Sci. Rep. 4, 4516 (2014).
Laczny, C. C. et al. VizBin—an application for reference-independent visualization and human-augmented binning of metagenomic data. Microbiome 3, 1 (2015).
Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).
Sharon, I. et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res 23, 111–120 (2013).
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Marbouty, M. et al. Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms. eLlife 3, e03318 (2014).
Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. Species-level deconvolution of metagenome assemblies with Hi-C-based contact probability maps. G3 4, 1339–1346 (2014).
Marbouty, M., Baudry, L., Cournac, A. & Koszul, R. Scaffolding bacterial genomes and probing host-virus interactions in gut microbiome by proximity ligation (chromosome capture) assay. Sci. Adv. 3, e1602105 (2017).
Beaulaurier, J. et al. Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation. Nat. Biotechnol. 36, 61–69 (2018).
Fang, G. et al. Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat. Biotechnol. 30, 1232–1239 (2012).
Murray, I. A. et al. The methylomes of six bacteria. Nucleic Acids Res. 40, 11450–11462 (2012).
Schadt, E. E. et al. Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases. Genome Res 23, 129–141 (2013).
Beaulaurier, J. et al. Single molecule-level detection and long read-based phasing of epigenetic variations in bacterial methylomes. Nat. Commun. 6, 7438 (2015).
Song, C. X., Yi, C. & He, C. Mapping recently identified nucleotide variants in the genome and transcriptome. Nat. Biotechnol. 30, 1107–1116 (2012).
Yoshihara, M., Jiang, L., Akatsuka, S., Suyama, M. & Toyokuni, S. Genome-wide profiling of 8-oxoguanine reveals its association with spatial positioning in nucleus. DNA Res 21, 603–612 (2014).
Li, S. & Mason, C. E. The pivotal regulatory landscape of RNA modifications. Annu Rev. Genomics Hum. Genet 15, 127–150 (2014).
Roundtree, I. A., Evans, M. E., Pan, T. & He, C. Dynamic RNA modifications in gene expression regulation. Cell 169, 1187–1200 (2017).
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).
Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 6, 1–6 (2017).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Morgan, M., Pagès, H., Obenchain, V. & Hayden, N. Rsamtools: binary alignment (BAM), FASTA, variant call (BCF), and tabix file import v.3.12 (Bioconductor, 2016).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722–736 (2017).
Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27, 737–746 (2017).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
We thank A. Fomenkov and R. J. Roberts from NEB for their help with the bacterial strain selection and for providing us with DNA samples (B. amyloliquefaciens, B. fusiformis and N. otitidiscaviarum). We also thank R. Gunsalus from the University of California, Los Angeles (M. hungatei), S. Logan from the National Research Council Canada (C. perfringens), L. Jackson from the University of Oklahoma Health Sciences Center (N. gonorrhoeae) and B. Schink, N. Müller and A. Keller from the University of Konstanz, Germany (T. phaeum) for providing us with DNA samples. We thank Y. Kong and M. Ni for providing helpful feedback for early versions of this paper. This work was supported by a seed fund from Icahn Institute for Genomics and Multiscale Biology (G.F.) and by grant nos. R01 GM128955 (G.F.), R35 GM139655 (G.F.) and R56 HG011095 (G.F.) from the National Institutes of Health. G.F. is a Hirschl Research Scholar by Irma T. Hirschl/Monique Weill-Caulier Trust, and a Nash Family Research Scholar. This work was also supported in part through the computational resources and staff expertise provided by the Department of Scientific Computing at the Icahn School of Medicine at Mount Sinai.
A.T. and G.F. are inventors of two US Provisional patent applications (62/860,952 and 62/851,205) that describe the methods in this paper.
Peer reviewer information Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
(a) Distribution of current differences are shown for all confident motifs altogether (n = 46 motifs) as well as average absolute differences and associated standard deviations near methylated bases ([− 10 bp, + 11 bp]). The lower and upper hinges correspond to the 25th and 75th percentiles while the lower and upper whiskers extend to the minima and maxima respectively (capped at 1.5 time the inter-quartile range). (b) Same as a with distinction between DNA methylation types (n = 28 6 mA motifs, n = 7 4mC motifs, n = 11 5mC motifs). (c) Same as a but for individual methylation motifs.
Extended Data Fig. 2 Systematic examination of three main DNA methylation types with nanopore sequencing.
(a) t-SNE projection of isolated methylation motif occurrences separated per motif. The same dataset as Fig. 2b was used with occurrences colored per motif. Other motifs are colored in gray. (b) Same as a but grouped by methylation type.
(a) Comparison of current differences across methylation occurrences between datasets base called with Albacore 1.1.0, Albacore 2.3.4, and Guppy 3.2.4 illustrated by projection with t-SNE from for 46 well-characterized motifs (Supplementary Table 2). Each dot represents one isolated motif occurrence colored by base caller versions. 100,000 motif occurrences were randomly selected from each dataset to reduce the scatter plot density and ease the visualization. For each motif occurrence, current differences from 22 positions near methylated bases ([− 10 bp, + 11 bp]) were used. (b) Performance for de novo methylated site detection between datasets base called with Albacore 1.1.0, Albacore 2.3.4, and Guppy 3.2.4. We evaluated individual motif occurrences detection using Precision-Recall curves for H. pylori at 75x coverage. Precision-Recall curves and area under the curves (AUC) were computed as described in the Method section. Only confident H. pylori motifs were considered for the evaluation. (c) Comparison of current differences across methylation occurrences (same as a) between datasets produced with or without outlier removal step (Methods). (d) Performance for de novo methylated site detection (similar than b) with datasets produced with or without outlier removal step. (e) Variation of current differences across methylation occurrences without outlier removal step as illustrated by motif signatures from three motifs, AG4mCT (n = 6550 occurrences), GGW5mCC (n = 1875 occurrences), and GCYYG6mAT (n = 954 occurrences). For each motif, current differences near methylated bases ([− 6 bp, + 7 bp]) from all isolated occurrences are plotted with conservation of relative distances to methylated bases. Distributions of current differences for each relative distance are displayed as a violin plot. Current differences axis is limited to −8 to 8 pA range. (f) Performance for de novo methylated site detection across current difference datasets generated with different read alignment type filtering: remove alternative alignments (filtered out XA bam flags; named No Alt.), remove supplementary alignments (filtered out 2048 bam flags; named No Supp.), remove chimeric alignments (filtered out SA bam flags; named No Chim.), only conserve unique mapping (filtered out XA and SA bam flags; named Unique), and keep all alignments (named None). (g) Performance for de novo methylated site detection across datasets normalized with linear regression (lm function), robust regression (rlm function) or no additional normalization (annotated as none). (h) Performance for de novo methylated site detection across datasets generated using two-sided Mann-Whitney U-test or Student’s t-test. (i) Performance for de novo methylated site detection across datasets generated using different p-value smoothing window size: no smoothing (named None), 3 nt, 5 nt, and 7 nt. (j) Performance for de novo methylated site detection across datasets generated using different function for combining consecutives p-values: Fisher’s method (named sumlog), logit method (named logitp), sum p method (named sump), and sum z method (named sumz). (k) Performance for de novo methylated site detection across peaks datasets generated using different peak detection window size: 5 nt, 7 nt, and 9 nt. Plots f, g, h, i, j, and k show Precision-Recall curves and area under the curves (AUC) for various signal processing steps and were computed as described in the Method section. (l) Comparison of current differences across methylation occurrences (same as a) with E. coli datasets (200x) produced using either the reference genome or the de novo assembly (Methods). (m) Performance for de novo methylated site detection in E. coli datasets (200x) using either the reference genome or the de novo assembly. (n) Performance of methylation motif typing and fine mapping on E. coli datasets (200x) produced using either the reference genome or the de novo assembly (motif occurrences: n = 458 for AACNNNNNNGTGC, n = 18451 for CCWGG, n = 28110 for GATC, n = 463 for GCACNNNNNNGTT). Only results for k-nearest neighbors, neural network, and random forest are displayed.
(a) Approximation of DNA methylation position in three motifs, AG4mCT (n = 6549 occurrences), GGW5mCC (n = 1875 occurrences), and GCYYG6mAT (n = 954 occurrences). Signal strength is computed using a sliding window alongside motif signature to choose the best vector positioning to use for classification. (b) Flowchart description of procedure for classifier training and novel motifs dataset annotation. Training the classifier consists of gathering a set of bacteria with characterized methylomes. Confident motifs are selected to assure the robustness of the final classifier, then all motif occurrences are localized in the genome (from corresponding reference genome or de novo assembled and polished genome). Current differences are then computed along the genome. Next, the training dataset is built from the offsetted vector of current differences labelled with the known methylation type and the offset combination. Finally, the classifier is trained using the chosen model(s). Analyzing a new bacterial sample consists of de novo detecting the methylated motif from processed current differences (see Methods). Then methylated motif occurrences are localized and the motif signatures are computed (that is, distribution of current differences at relative distance from the methylated bases). Next, those signatures are leveraged to approximate the methylated position for each de novo detected motif (see Methods), which is used to define the classifier inputs (that is, vector of current differences centered on the approximate methylated position). Finally, the trained classifier is used to predict the methylation type and fine map the DNA methylation for each motif. (c) Boxplot of overall prediction accuracy in LOOCV evaluation (n = 46 motifs) for each classifier. Classifiers are ordered by average accuracy. The lower and upper hinges correspond to the 25th and 75th percentiles while the lower and upper whiskers extend to the minima and maxima respectively (capped at 1.5 time the inter-quartile range). (d) Effect of hyperparameters on classification accuracy. Boxplot of overall prediction accuracy in LOOCV evaluation with classifiers trained on all motifs except the ones from H. pylori (n = 27 motifs). Hyperparameters were either tuned on H. pylori motifs only (“Alt. HP”) or on all motifs (“Main HP”). The lower and upper hinges correspond to the 25th and 75th percentiles while the lower and upper whiskers extend to the minima and maxima respectively (capped at 1.5 time the inter-quartile range). (e) Relationship between LOOCV accuracy and current difference signal similarities. Current difference signal near methylated bases is visualized by projection with t-SNE for the 46 well-characterized motifs similar to Fig. 2b. Each dot represents one isolated motif occurrence colored by accuracy from LOOCV analysis.
Similar to Fig. 4d with full set of prediction results for a subset of methylation motifs for k-nearest neighbors, random forest, and neural network. Filling colors correspond to percentage of occurrences classified to a specific class ranging from blue (0%) to red (100%). Greyed out prediction correspond to out of motif position. Blank columns correspond to within-motif positions without prediction. Prediction percentages of expected classes are displayed in italic and selected predictions based on consensus are displayed in bold.
See Extended Data Fig. 5.
(a) Effect of coverage on de novo methylated site detection. We evaluated individual motif occurrences detection using Precision-Recall curves (PR curves) for H. pylori. Studied datasets with coverage ranging from 5x to 200x were generated by random subsampling of native and WGA datasets. Precision-Recall curves were generated as described in the Method section. We considered only confident H. pylori motifs for evaluation. (b) Same as a but using ROC curves for representation. Motif occurrences without data due to low coverage (<5x) were not considered. (c) Performance of methylation motif typing and fine mapping (n = 46 motifs) on datasets with genomic coverage subsampled at 10x, 15x, 20x, and 30x. Only results for k-nearest neighbors, neural network, and random forest are displayed. (d) Precision-Recall curves summarizing the detection performance at 75x coverage of individual methylation sites for each motif in H. pylori with adjusted frequency (Methods). (e) Same as d but using ROC curves for representation. (f) Effect of motif frequency on de novo methylated site detection. For each methylation motif, in silico datasets with a wide range of motif frequencies were created using a random subsampling strategy (either the motif occurrences or the genomic regions without motifs, see Methods). The natural motif frequencies (that is, the original ratio of motif occurrences over all queried regions) are annotated by a point on each motif curve.
Extended Data Fig. 8 Schematic representation of methylation feature vectors computation and methylation binning of contigs.
The computation of methylation features and the building of the methylation profile matrix is described in the method.
(a) Methylation binning using automated methylation features selection (without precise methylation motif discovery; Methods). Methylation features are projected on two dimensions using t-SNE. Contigs are colored per bin defined using DBSCAN, with point sizes matching contig length according to the legend. Two bins with the same methylation motifs were manually merged into Bin 4. (b) Methylation binning using de novo discovered motifs on each bin found in a (Methods). Methylation features computed from de novo discovered motifs are projected on two dimensions using t-SNE. Contigs are colored per bin defined using DBSCAN except Bin 11, which was manually defined. (c) Methylation binning using de novo discovered motifs on each bin found in b. Contigs are colored per bin defined using DBSCAN except for Bin 13, which was manually defined. (d) Methylation binning of MGM1 metagenome contigs using de novo discovered motifs (after three rounds of motif discovery (same as Fig. 5a).
(a) Methylation binning using automated methylation features selection (without precise methylation motif discovery; Methods). Methylation features are projected on two dimensions using t-SNE. Contigs are colored per defined bin with point sizes matching contig length according to the legend. Bin 1, 3, 4, and 5 were defined using DBSCAN. The other bins are composed of one or two contigs and were manually defined after de novo methylation motif discovery. (b) Methylation binning using de novo discovered motifs on each bin found in a (Methods). Methylation features computed from de novo discovered motifs are projected on two dimensions using t-SNE. Contigs are colored per bin as described in a.
About this article
Cite this article
Tourancheau, A., Mead, E.A., Zhang, XS. et al. Discovering multiple types of DNA methylation from bacteria and microbiome using nanopore sequencing. Nat Methods (2021). https://doi.org/10.1038/s41592-021-01109-3