Introduction

DNA methylation is a major class of epigenetic modification that is found in diverse prokaryotes, in addition to eukaryotes1. For example, prokaryotic DNA methylation by sequence-specific restriction-modification (RM) systems that protect host cells from invasion by phages or extracellular DNA has been well characterized and is utilized as a key tool in biotechnology2,3,4. In addition, recent studies have revealed that prokaryotic DNA methylation plays additional roles, performing various biological functions, including regulation of gene expression, mismatch DNA repair, and cell cycle functions5,6,7,8,9. Research interest in the diversity of prokaryotic methylation systems is therefore growing due to their importance in microbial physiology, genetics, evolution, and disease pathogenicity7,10. However, our knowledge of the diversity of prokaryotic methylation systems has been severely limited thus far because most studies focus only on the rare prokaryotes that are cultivable in laboratories.

The recent development of single-molecule real-time (SMRT) sequencing technology provides us with another tool for observing DNA methylation. An array of DNA methylomes of cultivable prokaryotic strains, including N6-methyladenine (m6A), 5-methylcytosine (m5C), and N4-methylcytosine (m4C) modifications, have been revealed by this technology11,12,13,14. Despite its high rates of base-calling and modification detection errors per raw read15,16, SMRT sequencing technology can produce ultralong reads of up to 60 kbp with few context-specific biases (e.g., GC bias)17. This characteristic enables SMRT sequencing to achieve high accuracy by merging data from many erroneous raw reads originating from clonal DNA molecules, typically from cultivated prokaryotic populations18. Alternatively, in an approach referred to as circular consensus sequencing (CCS), a circular DNA library is prepared as a sequence template to allow the generation of a single ultralong raw read containing multiple sequences (‘subreads’) that correspond to the same stretch on the template19,20; therefore, a cultivated clonal population is not required21. However, CCS has thus far been applied in only a few shotgun metagenomics studies22 and, to the best of our knowledge, has not yet been applied to ‘metaepigenomics’ or direct methylome analysis of environmental microbial communities, which are usually constituted by uncultured prokaryotes.

Here, we applied CCS to shotgun metagenomic and metaepigenomic analyses of freshwater microbial communities in Lake Biwa, the largest lake in Japan, to reveal the genomic and epigenomic characteristics of the environmental microbial communities using the PacBio Sequel platform (Supplementary Fig. 1a). Freshwater lakes are of economical and social importance, where microbes constitute the bases of their ecosystems23. In addition, freshwater habitats are rich in phage–prokaryote interactions24,25,26,27, which can affect prokaryotic DNA methylation. We report that our CCS analyses of the environmental microbial samples allowed reconstruction of draft genomes and the identification of their methylated motifs, at least nine of which were novel. Furthermore, we computationally predicted and experimentally confirmed four methyltransferases (MTases) responsible for the detected methylated motifs. Importantly, two of the four MTases were revealed to recognize novel motif sequences.

Results and Discussion

Water sampling, SMRT sequencing, and circular consensus analysis

Water samples were collected at a pelagic site in Lake Biwa, Japan, at 5 m (biwa_5m) and 65 m depths (biwa_65m), from which PacBio Sequel produced a total of 2.6 million (9.6 Gbp) and 2.0 million (6.4 Gbp) subreads, respectively (Table 1). The circular consensus analysis produced 168,599 and 117,802 CCS reads, with lengths of 4474 ± 931 and 4394 ± 587 bp, respectively (Table 1 and Supplementary Fig. 2). In the shallow sample data, at least 90% of the CCS reads showed high quality (Phred quality scores > 20) at each base position, except for the 5′-terminal five bases and 3′-terminal bases after the 5638th base. In the deep sample data, the same was true, except for the 5′-terminal four bases and 3′-terminal bases after the 5356th base (Supplementary Fig. 3).

Table 1 Statistics of SMRT sequencing and CCS-read analysis

Taxonomic analysis

Taxonomic assignment of the CCS reads was performed using Kaiju28 and the National Center for Biotechnology Information non-redundant (NCBI nr) database29 (Fig. 1). The assignment ratios were >88% and >56% at the phylum and genus levels, respectively, which were higher than those for the Illumina-based shotgun metagenomic analysis of lake freshwater and other environments using the same computational method28. Kraken30 with complete prokaryotic and viral genomes in RefSeq31 (Supplementary Fig. 4a–c) provided similar results but resulted in much lower assignment ratios (30% and 27%, respectively), likely due to the lack of genomic data for freshwater microbes in RefSeq. The 16S ribosomal RNA (rRNA) sequence-based taxonomic assignment via blastn searches against the SILVA database32 also provided consistent results (Supplementary Fig. 4d–f). It should be noted that 16S rRNA-based and CDS-based taxonomic assignments can be affected by 16S rRNA gene copy numbers and genome sizes, respectively.

Fig. 1
figure 1

Phylogenetic distribution of CCS reads. Estimated relative abundances at the a domain, b phylum, and c class levels are shown. Eukaryotic and viral reads are ignored, and groups with <1% abundance are grouped as ‘Other’ in b, c

At the phylum level, Proteobacteria dominated both samples, followed by Actinobacteria, Verrucomicrobia, and Bacteroidetes (Fig. 1). Chloroflexi and Thaumarchaeota were especially abundant in the deep water sample, consistent with previous findings33,34. The ratio of Archaea was particularly low in the shallow sample (0.6 and 6.9% in biwa_5m and biwa_65m, respectively). Although the filter pore-size range (5–0.2 μm) was not suitable for most viruses and eukaryotic cells, non-negligible ratios corresponding to their existence were observed in the shallow sample. The dominant eukaryotic phylum was Opisthokonta (2.68 and 0.92%), followed by Alveolata (1.67 and 0.45%) and Stramenopiles (1.45 and 0.15%). Among viruses, Caudovirales and Phycodnaviridae were the most abundant families in both samples. Caudovirales are known to act as bacteriophages, while Phycodnaviridae primarily infect eukaryotic algae. The third most abundant viral family was Mimiviridae, whose members are also known as ‘Megavirales’ due to their large genome size (0.6–1.3 Mbp)35,36. Viruses without double-stranded DNA (i.e., single-stranded DNA and RNA viruses) were not observed because of the experimental method employed. Overall, the taxonomic composition was consistent with those obtained in previous studies on microbial communities in freshwater lake environments, reflecting the fact that SMRT sequencing provides taxonomic compositions consistent with those obtained using short-read technologies, such as the Illumina MiSeq and HiSeq platforms37,38.

Metagenomic assembly and genome binning

The CCS reads from the shallow and deep samples were assembled into 599 and 429 contigs, respectively, using Canu18. After removing 45 (7.5%) and 84 (19.6%) repetitive contigs, we retrieved 554 and 345 contigs, respectively (Supplementary Table 3). The corresponding N50 values were 83 and 76 kbp, and the longest contigs had lengths of 481 and 740 kbp, respectively. Notably, the contigs were much longer than those obtained in a previous study that applied CCS for shotgun metagenomics analysis of an active sludge microbial community22. We also used Mira39 for metagenomic assembly, but this resulted in shorter longest contigs (148 and 151 kbp, respectively) and N50 values (19 and 18 kbp, respectively).

The contigs were binned to genomes using MetaBAT40, which is a reference-independent binning tool, based on CCS-read coverage and tetranucleotide frequency (Fig. 2 and Table 2). Among a total of 554 and 345 contigs, 290 (52.3%) and 100 (29.0%) were assigned to 15 and 4 bins from the shallow and deep samples, respectively. In total, 46.9 and 44.8% of the CCS reads could be mapped to the draft genomes for the shallow and deep samples, respectively. We obtained a draft genome for each bin, where the completeness of the genome ranged from 17 to 99% (67% on average). Estimated contamination levels were low (<3% in each draft genome). Based on the total contig size and estimated genome completeness of each draft genome, the genome sizes were estimated to range from 1.0 to 5.6 Mbp. The GC content ranged from 29 to 68%, and the N50 was 24 kbp on average, with a maximum of 1.67 Mbp.

Fig. 2
figure 2

Genome binning of the assembled contigs. Each circle represents a contig, where the color and size represent its assigned bin and total sequence length, respectively. Contigs not assigned to any bin are indicated in gray (named ‘NA’). The x-axis and y-axis represent GC% and genome coverage, respectively

Table 2 Statistics for draft genomes

The 19 draft genomes belonged to 7 phyla (Table 2 and Supplementary Fig. 5). Among these draft genomes, 10 contained 16S rRNA genes, and many of them showed top hits to uncultured clades; thus, our CCS-based approach was estimated to have truly targeted multiple uncultured prokaryotes. Seven draft genomes were predicted to belong to the phylum Actinobacteria, including Candidatus Planktophila (BS7), one of the most dominant bacterioplankton lineages in freshwater systems23,41. The draft genomes affiliated with other dominant freshwater lineages were also recovered, including Candidatus Methylopumilus (BS12)42, the freshwater lineage (LD12) of Pelagibacterales (BS14)43,44, and Nitrospirae (BD2) and Candidatus Nitrosoarchaeum (BD3), the predominant nitrifying bacteria and archaea in the hypolimnion33,34. Four draft genomes were affiliated with the phylum Verrucomicrobia (BS6, BS8, BS10, and BD4), in line with a previous study45. The BS3 and BD1 draft genomes likely represent members of the CL500-11 group (class Anaerolineae) of the Chloroflexi phylum, where BD1 presented the highest coverage of >45×. This group is a dominant group in the hypolimnion of Lake Biwa and is frequently found in deep oligotrophic freshwater environments worldwide46. Although Proteobacteria is the most dominated phylum, two and no draft genomes were retrieved from the shallow and deep samples, respectively. Regarding the shallow sample, approximately one-fourth of the Proteobacteria CCS reads could be mapped to the two draft genomes, which means three-fourths of them likely originated from minor and diverse Proteobacteria clades. Overall, the phylogeny of the reconstructed genomes likely reflects the major lineages that are yet to be cultured but are dominantly present in the water of Lake Biwa.

Metaepigenomic analysis

A total of 29 candidate methylated motifs were detected in 10 draft genomes (Table 3). Their methylation ratios ranged from 19 to 99%, which can be affected by modification detection power, i.e., these ratios are likely lower than the true methylation levels. The mapped subread coverages of the methylated motifs ranged from 28.7 to 297.3×. Three motifs from the Proteobacteria BS12 genome contained similar sequences (HCAGCTKC, BGMAGCTGD, and GMAGCTKC, where B: C/G/T, D: A/G/T, H: A/C/T, K: G/T, and M: A/C, where the underlined bold face indicates methylation sites) that were likely due to incomplete detection of a single methylated motif or heterogeneous motif sequences between closely related lineages contained within that genome. A palindromic motif and five complementary motif pairs that likely reflect double-strand methylation were observed in the Bacteroidetes BS15 genome (e.g., a pair of AGCNNNNNNCAT and ATGNNNNNNGCT). It may also be notable that three draft genomes from the Chloroflexi phylum (BS1, BS3, and BD1) shared the same motif sequence set (GANTC, TTAA, and GCWGC, where W: A/T), likely due to evolutionarily shared methylation systems. Contigs in each draft genome showed a similar methylation pattern in general, providing additional epigenomic support of the quality of the genome binning (Supplementary Fig. 6).

Table 3 Detected methylated motifs

Overall, even if such similar, complementary, and shared motif sequences are considered, at least 9 motifs among the identified 22 motifs still presented no match to existing recognition sequences in the REBASE repository. This result demonstrates the existence of unexplored diversity of DNA methylation systems in environmental prokaryotes, which include many uncultured strains.

Known MTases that correspond to detected methylated motifs

To identify MTases that can catalyze the methylation reactions of the detected methylated motifs, systematic annotation of MTase genes was performed. Sequence similarity searches against known genes identified 20 MTase genes in nine draft genomes (sequence identities ranged from 23 to 71%) (Table 4). The most abundant group was Type II MTases, followed by Type I and Type III MTases, a trend that is consistent with the general MTase distribution13,47. Several genes encoding REases and DNA sequence-recognition proteins were also detected, and 9 of the 20 MTases (45%) were estimated to constitute RM systems (Table 4). The known motifs of 7 of the 20 MTases were matched to those identified in our metaepigenomic analysis (Table 3). For example, the Thaumarchaeota BD3 genome contained two MTases that showed the best sequence similarities to those that recognize AGCT and GATC motif sequences, which were perfectly congruent with the two motifs detected in our metaepigenomic analysis. It may be notable that these two motifs were also reported in an enrichment-culture study of the closely related genus Candidatus Nitrosomarinus catalina48 and are therefore likely evolutionarily conserved within their group. In the Proteobacteria BS14 genome, a similar one-to-one perfect match was also observed. The two genomes Chloroflexi BS3 and Chloroflexi BD1 were characterized by the same set of three methylated motifs, each of which contained three MTases. No MTase gene was found in the other Chloroflexi genome BS1, likely due to its low estimated genome completeness of 31% (Table 2). Among these MTases, two were most similar to those possessing methylation specificities that were congruent with two of the detected motifs, GANTC and TTAA (the other MTase and motif will be discussed in the next section). Collectively, these observations suggest that metaepigenomic analysis is an effective tool for identifying the methylation systems of environmental prokaryotes.

Table 4 Detected MTases, REases, and specificity subunit genes

Unexplored diversity of prokaryotic methylation systems

Among the 20 detected MTases, 13 MTases did not show sequence similarities to MTases that recognize the motifs identified in our metaepigenomic analysis (Tables 3 and 4). Although homology search-based MTase identification and recognition motif estimation are frequently conducted in genomic and metagenomic studies, this result suggests that these approaches are not sufficient, and direct observation of DNA methylation is needed to reveal the methylation systems of diverse environmental prokaryotes.

As noted earlier, each of the Chloroflexi BS3 and Chloroflexi BD1 genome had three MTase genes, two of which were congruent to two of the detected motifs. The other MTase from each genome (EMGBS3_12600 and EMGBD1_09320 in Chloroflexi BS3 and Chloroflexi BD1, respectively) showed the highest sequence similarity to an MTase that was reported to recognize ACGGC; however, the other methylated motif detected in the Chloroflexi BS3 and Chloroflexi BD1 genomes was GCWGC.

In the Bacteroidetes BS15 genome, 6 MTases and 11 methylated motifs were detected, but none of the MTases and motifs matched each other. At the methylation type level, five MTases and all of the methylated motifs were of the m6A type. We predicted that the EMGBS15_03820, whose closest homolog was an MTase that exhibits nonspecific m6A methylation activity, is actually a sequence-specific enzyme that recognizes a GAANNNNTTC motif that was detected through metaepigenomic analysis, because the adjacent gene EMGBS15_03830 encodes an REase that targets the same GAANNNNTTC sequence.

In the Verrucomicrobia BS8 genome, one MTase and one methylated motif were detected; however, the reported recognition motif sequence of the closest MTase was incongruent with the detected motif (the reported and detected motifs were ACGANNNNNNGRTC and AGGNNNNNRTTT, respectively, where R: A/G). This MTase is predicted to function in an RM system because of the existence of the neighboring REase and DNA sequence-recognition protein genes.

In the Verrucomicrobia BS10 genome, one MTase and one methylated motif were detected, and their motifs were also incongruent (GCAAGG and ACGAG, respectively).

In the Nitrospirae BD2 genome, two MTases and one methylated motif were detected. The two MTases EMGBD2_08760 and EMGBD2_08790 showed the best sequence similarities to those with m5C and m6A methylation activities, respectively, while the detected motif contained an m6A site. Thus, the former MTase was predicted to catalyze the methylation reaction, although their motifs were again incongruent (GRGGAAG and TANGGAB, respectively). It should also be noted that these MTases appear to constitute a recently proposed system known as the Defense Island System Associated with Restriction-Modification (DISARM), which is a phage-infection defense system composed of MTase, helicase, phospholipase D, and DUF1998 genes49. To our knowledge, this is the first DISARM system identified in the phylum Nitrospirae.

In the Verrucomicrobia BS6 genome, one MTase gene was found, but we could not detect any methylated motif, and we therefore anticipate that this MTase gene does not exhibit methylation activity or the corresponding methylation motif was undetected due to the low sensitivity of SMRT sequencing to m5C modification as described previously13,14. However, in the Proteobacteria BS12 genome, we detected methylated motifs but no MTase genes. We assume that the MTase genes corresponding to this genome were missed due to insufficient genome completeness (although the estimated completeness was 81%), or because these MTase genes have diverged considerably from MTase genes found in cultivable strains, or because these MTases belong to a new group.

Experimental verification of MTases with new methylated motifs

Among the MTases whose sequences showed the best similarities to MTases that recognize motifs incongruent with our metaepigenomic results, we experimentally verified the methylation specificities of the four MTases: EMGBS3_12600 in Chloroflexi BS3 (and EMGBD1_09320 in Chloroflexi BD1, which has exactly the same amino-acid sequence), EMGBS15_03820 in Bacteroidetes BS15, EMGBS10_10070 in Verrucomicrobia BS10, and EMGBD2_08790 in Nitrospirae BD2 (Table 4). We constructed plasmids that each carried one of the artificially synthesized MTase genes, transformed them to Escherichia coli cells, forced their expression, and observed the methylation status of the isolated plasmid DNA by REase digestion.

Although the EMGBS3_12600 showed the best sequence similarity to a sequence-diverged MTase that possesses the ACGGC specificity, the unaccounted-for motif sequence observed in Chloroflexi BS3 was GCWGC. Thus, we hypothesized that the true recognition sequence of EMGBS3_12600 is GCWGC. The REase digestion assay showed that TseI (GCWGC specificity) did not cleave the plasmids when EMGBS3_12600 was expressed in the cells, which clearly supports our hypothesis (Fig. 3a). Furthermore, we confirmed that BceAI (ACGGC specificity) cleaved plasmids regardless of whether EMGBS3_12600 was expressed, indicating that the EMGBS3_12600 protein does not show ACGGC sequence specificity (Fig. 3a). Accordingly, we named this protein M.AbaBS3I, as a novel MTase that possesses GCWGC specificity (Table 4).

Fig. 3
figure 3

REase digestion assays. a Assay of the EMGBS3_12600 gene (and EMGBD1_09320, which has the same amino-acid sequence). BceAI and TseI were used, where the plasmid contained 12 (ACGGC) and 21 (GCWGC) target sites, respectively. Plasmid DNAs were linearized using SalI before the assay. An NEB 2-log DNA ladder was employed as a size marker. b Assay of the EMGBS15_03820 gene. DpnII and XmnI were used, where the plasmid contained 27 (GATC) and 2 (GAANNNNTTC) target sites, respectively

While the homology-based analysis showed that the closest homolog of EMGBS15_03820 was a non-sequence-specific MTase, its adjacency to an REase and the results of the metaepigenomic analysis suggested that this MTase presents GAANNNNTTC sequence specificity. The REase digestion assay showed that XmnI (GAANNNNTTC specificity) did not cleave the plasmids only when EMGBS15_03820 was expressed in the cells, which also supports our hypothesis (Fig. 3b). Furthermore, we confirmed that DpnII (GATC specificity) cleaved the plasmids regardless of whether EMGBS15_03820 was expressed, indicating that EMGBS15_03820 is not a nonspecific MTase. We named this protein M.FspBS15I, as a novel MTase that possesses GAANNNNTTC methylation specificity (Table 4).

For EMGBS10_10070 in Verrucomicrobia BS10 and EMGBD2_08790 in Nitrospirae BD2, we also conducted REase digestion assays to confirm the recognition motif sequences. Based on the results of the metaepigenomic analysis, their motifs were predicted to be ACGAG and TANGGAB, respectively. Expression of each gene altered the electrophoresis patterns of the digested plasmids to contain fragments that resulted from inhibition of REase cleavage at the estimated methylation sites (Supplementary Fig. 7). Furthermore, we additionally conducted SMRT sequencing analysis using the PacBio RSII platform to examine the methylation status of the chromosomal DNA of the E. coli transformed with each of the two MTase genes. The results were basically consistent (Supplementary Table 4): ACGAG was actually detected as the methylated motif in E. coli transformed with EMGBS10_10070, and we named the protein M.ObaBS10I. In the case of EMGBD2_08790, the detected TAHGGAB motif was almost the same, but a subset of the estimated TANGGAB motif (i.e., TAGGGAB was excluded), and this difference could be due to E. coli-specific conditions (e.g., cofactors and sequence biases), insufficient data, inaccuracy of the methylated motif detection method. Regardless of this minor difference, we concluded that EMGBD2_08790 is a novel MTase gene responsible for methylation of the TAHGGAB motif and we named the protein M.NbaBD2I accordingly.

Metaepigenomics for exploring prokaryotic methylation systems in nature

The present study demonstrated the effectiveness of the metaepigenomic approach powered by SMRT sequencing and CCS, showing obvious advantages over sequence similarity-based and culture-based methylation system analyses and short-read metagenomics. The CCS reads facilitated metagenomic assembly, binning, and protein sequence-based taxonomic assignment from an environmental sample that contained dominant uncultured prokaryotes. Most importantly, this approach revealed several methylated motifs, including novel ones in environmental prokaryotes, and subsequent experiments identified four MTases responsible for those reactions.

The current throughput of SMRT sequencing may be still insufficient to apply the metaepigenomic approach to more diverse and complex samples. Because deep sequencing coverage is required for the reliable detection of DNA methylation (for example, >25× subreads per each DNA strand is recommended according to the official instruction), it is still difficult to obtain sufficient sequencing reads to recover long contigs and detect methylated motifs for ‘rare’ species (typically those with <1% relative abundance). In addition to rapid and ongoing technological advances in SMRT sequencing, the emergence of Oxford Nanopore Technology may provide as another long-read, single-molecule, and methylation-detectable technology50,51. Another problem is that the detectable types of DNA modifications are limited (i.e., m4C, m5C, and m6A) with the currently available SMRT sequencing technology, while many other DNA chemical modifications occur in nature52. In addition to advances in sequencing methods, novel bioinformatic tools will be critical for metaepigenomic analyses of environmental prokaryotes.

A recent study showed that sets of methylated motifs and MTases can vary widely, even between closely related strains53, where metaepigenomics is expected to enable differential methylation analyses between populations. It should be noted that metaepigenomic data may be adopted for various bioinformatic applications. For example, because reads and contigs in the same genome are expected to have the same methylation patterns, metaepigenomic information may be used for improving metagenomic assembly and binning54. In addition, genus-level conservation of MTases that are not associated with REases is sometimes observed, which suggests that MTases play unexplored adaptive roles, in addition to their functions in combating phages13,55. Novel MTases may be adopted for biotechnological uses, such as DNA recombination and methylation analyses56. It is envisioned that metaepigenomics of environmental prokaryotes under different sampling conditions and environments will significantly deepen our understanding of the ecological impacts of DNA methylation on prokaryotes, enigmatic evolution of prokaryotic methylation systems, and broaden their application potential.

Methods

Sample collection

Water samples were collected at a pelagic long-term survey station (Ie-1) (35° 13′09.5″N 135°59′44.7″E) of the Center for Ecological Research, Kyoto University in Lake Biwa, Japan, on 26 December 2016 (Supplementary Fig. 1a). The sampling site was located approximately 3 km from the nearest shore and had a depth of 73 m. The lake has a permanently oxygenated hypolimnion and was thermally stratified during sampling (Supplementary Fig. 1b). Water sampling into prewashed 5-L Niskin bottles was conducted at depths of 5 m and 65 m, above and below the thermally stratified layer, respectively, to collect prokaryotic communities with different structures34. The vertical profiles of temperature, dissolved oxygen concentrations, and chlorophyll a concentrations were measured using a conductivity, temperature, and depth probe in situ. Equipment that could come into direct contact with the water samples in the following steps was either sterilized by autoclaving or disinfected with a hypochlorous acid solution. The water samples were transferred to sterile bottles, kept cool by contact with ice packs in a dark cool box, and immediately transported to the laboratory. Water samples with a total volume of approximately 30 L were prefiltered through 5 μm membrane PC filters (Whatman). Microbial cells were collected using 0.22 μm Sterivex filters (Millipore) and immediately stored at −20 °C in a refrigerator until analysis.

DNA extraction and SMRT sequencing

The microbial DNA was retrieved using a PowerSoil DNA Isolation Kit (QIAGEN) according to the supplier’s protocol with slight modifications as described below. The filters were removed from the container, cut into 3 mm fragments, and directly suspended in the extraction solution from the kit for cell lysis. The bead-beating time was extended to 20 min to yield sufficient quantities of DNA for SMRT sequencing, with reference to Albertsen et al.57. SMRT sequencing was conducted using a PacBio Sequel system (Pacific Biosciences) in two independent runs according to the manufacturer’s standard protocols. SMRT libraries for CCS were prepared with a 4 kbp insertion length and two SMRT cells were used for each sample. Briefly, 3–5 kbp DNA fragments from each genomic DNA sample were extracted using the BluePippin size-selection system (Sage Science). Two sequencing libraries for CCS analysis were prepared using the SMRTbell Template Prep Kit 1.0-SPv3 according to the manufacturer’s protocol (Pacific Biosciences). The final libraries were sequenced using a PacBio Sequel sequencer with Sequel SMRT Cell 1M v2 and Sequel Binding/Sequencing Kits 2.0.

Bioinformatic analysis of CCS reads

Reads that contained at least three full-pass subreads on each polymerase read were retained to generate CCS reads using the standard PacBio SMRT software package with the default settings. Only CCS reads with >97% average base-call accuracy were retained. For taxonomic assignment of the CCS reads, Kaiju28 in Greedy-5 mode with the NCBI nr database29 and Kraken30 with the default parameters and complete prokaryotic genomes from RefSeq31 were used. CCS reads that potentially encoded 16S rRNA genes were extracted using SortMeRNA58 with the default settings, and the 16S rRNA sequences were predicted by RNAmmer59 with the default settings. The 16S rRNA sequences were taxonomically assigned using blastn60 searches against the SILVA database release 12861, where the top-hit sequences with e-values ≤ 1E−15 were retrieved.

CCS reads were de novo assembled using Canu18 with the -pacbio-corrected setting and Mira39 with the settings for PacBio CCS reads, according to the provided instructions. The Canu assembler provides information on repetitive contigs based on the graph topology and read-overlap analyses. Because such contigs are known to tend to contain misassembles, which can negatively affect accuracies of downstream analyses, we removed them. The remaining contigs were binned into genomes using MetaBAT40 based on genome coverage and tetranucleotide frequencies as genomic signatures, where the genome coverage was calculated by mapping the CCS reads to the assembled contigs using BLASR62 with the settings for PacBio CCS reads. The quality of all genomes was assessed using CheckM63, which estimates completeness and contaminations based on taxonomic collocation of prokaryotic marker genes with the default settings. Sequence extraction and taxonomic assignment of 16S rRNA genes in each draft genome were conducted using RNAmmer59 with the default settings. Taxonomic assignment of the draft genomes was based on the 16S rRNA genes if found or on the taxonomic groups most frequently estimated by CAT64 otherwise (and Kaiju28 if CAT did not provide an estimation).

Coding sequences (CDSs) in each draft genome were predicted using Prodigal65 with the default settings. Functional annotations were achieved through GHOSTZ66 searches against the eggNOG67 and Swiss-Prot68 databases, with a cut-off e-value ≤ 1E−5, and HMMER69 searches against the Pfam database70, with a cut-off e-value ≤ 1E−5. A maximum-likelihood tree of the draft genomes was constructed on the basis of the set of 400 conserved prokaryotic marker genes using PhyloPhlAn71 with the default settings.

Metaepigenomic and RM system analyses

DNA modification detection and motif analysis were performed according to BaseMod (https://github.com/ben-lerch/BaseMod-3.0). Briefly, the subreads were mapped to the assembled contigs using BLASR62, and interpulse duration ratios were calculated. Candidate motifs with scores higher than the default threshold value were retrieved as methylated motifs. Those with infrequent occurrences (<50) or very low methylation fractions (<1%) in each draft genome were excluded from further analysis. The methylated ratios of all detected motifs on each contig were calculated using Seqkit72. The sequence divergences of target recognition domains (TRDs) from those of the closest-match MTases were investigated using amino-acid alignments of BLASTP60.

Genes encoding MTases, restriction endonucleases (REases), and DNA sequence-recognition proteins were detected by BLASTP60 searches against an experimentally confirmed gold-standard dataset from the Restriction Enzyme Database (REBASE)73 (downloaded on 2 October 2017), with a cut-off e-value of ≤ 1E−15. Sequence specificity information for each hit MTase gene was also retrieved from REBASE. The flanking regions of the MTase genes were investigated to search for REase genes and examine whether they constitute RM systems.

Experimental verification of MTase activities

For verification of the estimated methylation specificities, all four estimated Type II MTase genes (EMGBS3_12600, EMGBS15_03820, EMGBS10_10070, and EMGBD2_08790) that satisfied the following two criteria were selected: (1) their novel methylation motifs were uniquely predicted and (2) additional proteins were not required in evaluating their enzyme activities. The four MTases were artificially synthesized with codon optimization and cloned into the pUC57 cloning vector by Genewiz (Supplementary Data 1). The genes were subcloned into the pCold III expression vector (Takara Bio) using an In-FusionHD Cloning Kit (Takara Bio). The gene-specific oligonucleotide primers used for polymerase chain reaction and recombination are described in Supplementary Table 1. For verification of the EMGBS10_10070 gene function, the 5′-ACGAGTC-3′ sequence was inserted downstream of the termination codon for the sake of the methylation assay (the first five-base ACGAG sequence was the estimated methylated motif, and the last five-base GAGTC is recognized by the restriction enzyme PleI) (Supplementary Data 1).

The constructs were transformed into E. coli HST04 dam/dcm (Takara Bio), which lacks dam and dcm MTase genes. The E. coli strains were cultured in LB broth medium supplemented with ampicillin. MTase expression was induced according to the supplier’s protocol. Plasmid DNAs were isolated using the FastGene Xpress Plasmid PLUS Kit (Nippon Genetics). SalI was employed to linearize the plasmid DNAs encoding EMGBS3_12600 and EMGBS15_03820 and then inactivated by heat. Methylation statuses were assayed by enzymatic digestion using the following restriction enzymes: BceAI and TseI for EMGBS3_12600, DpnII and XmnI for EMGBS15_03820, PleI for EMGBS10_10070, and FokI for EMGBD2_08790. All restriction enzymes were purchased from New England BioLabs. All digestion reactions were performed at 37 °C for 1 h, except for those involving TseI (8 h) and FokI (20 min). Notably, although TseI digestion is conducted at 65 °C in the manufacturer’s protocol, we adopted a temperature of 37 °C to avoid cleavage of methylated DNA.

We further verified the methylated motifs that were newly estimated in this study, i.e., those of EMGBS10_10070 and EMGBD2_08790. Chromosomal DNA was extracted from cultures of the transformed E. coli strains using a PowerSoil DNA Isolation Kit (QIAGEN) according to the supplier’s protocol. SMRT sequencing was conducted using PacBio RSII (Pacific Biosciences), and methylated motifs were detected via the same method described above.