Metaepigenomic analysis reveals the unexplored diversity of DNA methylation in an environmental prokaryotic community

DNA methylation plays important roles in prokaryotes, and their genomic landscapes—prokaryotic epigenomes—have recently begun to be disclosed. However, our knowledge of prokaryotic methylation systems is focused on those of culturable microbes, which are rare in nature. Here, we used single-molecule real-time and circular consensus sequencing techniques to reveal the ‘metaepigenomes’ of a microbial community in the largest lake in Japan, Lake Biwa. We reconstructed 19 draft genomes from diverse bacterial and archaeal groups, most of which are yet to be cultured. The analysis of DNA chemical modifications in those genomes revealed 22 methylated motifs, nine of which were novel. We identified methyltransferase genes likely responsible for methylation of the novel motifs, and confirmed the catalytic specificities of four of them via transformation experiments using synthetic genes. Our study highlights metaepigenomics as a powerful approach for identification of the vast unexplored variety of prokaryotic DNA methylation systems in nature.

detected methylated motifs. Importantly, two of the four MTases were revealed to recognize novel motif 66 sequences. 67

Taxonomic analysis 177
Taxonomic assignment of the CCS reads was performed using Kaiju 28 and the NCBI NR database 29 178 (Fig. 1). The assignment ratios were >88% and >56% at the phylum and genus levels, respectively, which 179 were higher than those for the Illumina-based shotgun metagenomic analysis of lake freshwater and other 180 environments using the same computational method 28 . Kraken 30 with complete prokaryotic and viral genomes 181 in RefSeq 31 (Fig. S4a-c) provided similar results but resulted in much lower assignment ratios (30% and 27%, 182 respectively), likely due to the lack of genomic data for freshwater microbes in RefSeq. 16S rRNA 183 sequence-based taxonomic assignment via BLASTN searches against the SILVA database 53 also provided 184 consistent results ( Fig. S4d-f). It should be noted that 16S rRNA-based and CDS-based taxonomic 185 assignments can be affected by 16S rRNA gene copy numbers and genome sizes, respectively. 186 At the phylum level, Proteobacteria dominated both samples, followed by Actinobacteria, 187 Verrucomicrobia, and Bacteroidetes (Fig. 1). Chloroflexi and Thaumarchaeota were especially abundant in the 188 deep water sample, consistent with previous findings 54, 55 . The ratio of Archaea was particularly low in the 189 shallow sample (0.6 and 6.9% in biwa_5m and biwa_65m, respectively). Although the filter pore-size range 190 The CCS reads from the shallow and deep samples were assembled into 554 and 345 contigs, 204 respectively, using Canu 18 (Table S4). The corresponding N50 values were 83 and 76 kbp, and the longest 205 contigs had lengths of 481 and 740 kbp, respectively. Notably, the contigs were much longer than those 206 obtained in a previous study that applied CCS for shotgun metagenomics analysis of an active sludge 207 microbial community 22 . We also used Mira 36 for metagenomic assembly, but this resulted in shorter longest 208 contigs (148 and 151 kbp, respectively) and N50 values (19 and 18 kbp, respectively). 209 The contigs were binned to genomes using MetaBAT 37 , which is a reference-independent binning 210 tool, based on CCS-read coverage and tetranucleotide frequency ( Fig. 2 and Table 2). Among a total of 899 211 contigs, 390 (43.3%) were assigned to fifteen and four bins from the shallow and deep samples, respectively. 212 We obtained a draft genome for each bin, where the completeness of the genome ranged from 17-99% (67% 213 on average). Estimated contamination levels were low (<3% in each bin). Based on the total contig size and 214 estimated genome completeness of each bin, the genome sizes were estimated to range from 1.0-5.6 Mbp. 215 The GC content ranged from 29-68%, and the average N50 was 24 kbp, with a maximum of 1.67 Mbp. 216 The nineteen genome bins belonged to seven phyla (Table 2 and Fig. S5). Among these genome bins, 217 ten contained 16S rRNA genes, and many of them showed top hits to uncultured clades; thus, our CCS-based 218 approach was estimated to have truly targeted multiple uncultured prokaryotes. Seven genome bins were 219 predicted to belong to the phylum Actinobacteria, including Candidatus Planktophila (BS7), one of the most 220 dominant bacterioplankton lineages in freshwater systems 60 A total of 29 methylated motifs were detected in ten genome bins (Table 3). Their methylation 233 ratios ranged from 19-99%, which can be affected by modification detection power, i.e., these ratios are likely 234 lower than the true methylation levels. Three motifs from the BS12 genome bin contained overlapping prokaryotes, which include many uncultured strains. 247

Known MTases that correspond to detected methylated motifs 249
To identify MTases that can catalyze the methylation reactions of the detected methylated motifs, 250 systematic annotation of MTase genes was performed. Sequence similarity searches against known genes 251 identified 20 MTase genes in nine genome bins (sequence identities ranged from 23-71%) ( Table 4). The 252 most abundant group was Type II MTases, followed by Type I and Type III MTases, a trend that is consistent 253 with the general MTase distribution 13,67 . Several genes encoding REases and DNA sequence-recognition 254 proteins were also detected ( Table 4). The known motifs of seven of the 20 MTases were matched to those 255 identified in our metaepigenomic analysis (Table 3). For example, the genome bin BD3 contained two 256 MTases, whose recognition motif sequences were AGCT and GATC according to the sequence 257 homology-based prediction, which were perfectly congruent with the two motifs detected in our 258 metaepigenomic analysis. It may be notable that these two motifs were also reported in an enrichment-culture 259 study of the closely related genus Candidatus Nitrosomarinus catalina 68 and are therefore likely evolutionarily 260 conserved within their group. In the BS14 bin, a similar one-to-one perfect match was also observed. The two 261 Chloroflexi genome bins BS3 and BD1 were characterized by the same set of three methylated motifs, each of 262 which contained three MTases. No MTase gene was found in the other Chloroflexi bin BS1, likely due to its 263 low estimated genome completeness of 31% (Table 2). Among these MTases, two were predicted to show 264 methylation specificities that were congruent with two of the detected motifs, GANTC and TTAA (the other 265 MTase and motif will be discussed in the next section). Collectively, these observations suggest that 266 metaepigenomic analysis is an effective tool for identifying the methylation systems of environmental 267 prokaryotes. 268 269

Unexplored diversity of prokaryotic methylation systems 270
Among the 20 detected MTases, 13 MTases did not present known recognition motifs that matched 271 those identified in our metaepigenomic analysis (Tables 3 and 4). Although homology search-based MTase 272 identification and recognition motif estimation are frequently conducted in genomic and metagenomic studies, 273 this result suggests that these approaches are not sufficient, and direct observation of DNA methylation is 274 needed to reveal the methylation systems of diverse environmental prokaryotes. 275 As noted earlier, each of the BS3 and BD1 bins had three MTase genes, two of which were 276 congruent to two of the detected motifs. The other MTase from each bin (EMGBS3_12600 and 277 EMGBD1_09320 in BS3 and BD1, respectively) showed the highest sequence similarity to an MTase that 278 was reported to recognize ACGGC; however, the other methylated motif detected in the BS3 and BD1 bins 279 was GCWGC. 280 In the BS15 genome bin, six MTases and eleven methylated motifs were detected, but none of the 281 MTases and motifs matched each other. At the methylation type level, five MTases and all of the methylated 282 motifs were of the m6A type. We predicted that the EMGBS15_03820 MTase, which is estimated to exhibit 283 non-specific m6A methylation activity, is actually a sequence-specific enzyme that recognizes a 284 GAANNNNTTC motif that was detected through metaepigenomic analysis, because the adjacent gene 285 EMGBS15_03830 encodes an REase that targets the same GAANNNNTTC sequence. 286 In the BS8 genome bin, one MTase and one methylated motif were detected; however, the 287 estimated motif of this MTase was incongruent with the detected motif (the estimated and detected motifs 12 to function in an RM system because of the existence of the neighboring REase and DNA-sequence 290 recognition protein genes. 291 In the BS10 genome bin, one MTase and one methylated motif were detected, and their motifs were 292 also incongruent (GCAAGG and ACGAG, respectively). 293 In the BD2 genome bin, two MTases and one methylated motif were detected. The two MTases 294 were predicted to display m6A and m5C methylation activities, while the detected motif contained an m6A 295 site. Thus, the former MTase was predicted to catalyze the methylation reaction, although their motifs were 296 again incongruent (GRGGAAG and TANGGAB, respectively). It should also be noted that these MTases 297 appear to constitute a recently proposed system known as the Defense Island System Associated with 298 Restriction-Modification (DISARM), which is a phage-infection defense system composed of MTase, helicase, 299 phospholipase D, and DUF1998 genes 69 . To our knowledge, this is the first DISARM system identified in the 300 phylum Nitrospirae. 301 In the BS6 genome bin, one MTase gene was found, but we could not detect any methylated motif, 302 and we therefore anticipate that this MTase gene does not exhibit methylation activity or the corresponding 303 methylation motif was undetected due to the low sensitivity of SMTR sequencing to m5C modification as 304 described previously 13,14 . However, in the BS12 genome bin, we detected methylated motifs but no MTase 305 genes. We assume that the MTase genes corresponding to this bin were missed due to insufficient genome 306 completeness (although the estimated completeness was 81%), or because these MTase genes have diverged 307 considerably from MTase genes found in cultivable strains, or because thee MTases belong to a new group. 308 309

Experimental verification of MTases with new methylated motifs 310
Among the MTases whose estimated methylated motifs were not congruent with our 311 metaepigenomic results, we experimentally verified the methylation specificities of the four MTases: 312 EMGBS3_12600 in BS3 (and EMGBD1_09320 in BD1, which has exactly the same amino acid sequence), 313 EMGBS15_03820 in BS15, EMGBS10_10070 in BS10, and EMGBD2_08790 in BD2 (Table 4). We 314 constructed plasmids that each carried one of the artificially synthesized MTase genes, which we then 315 transformed E. coli cells that lacked endogenous MTases, forced their expression, and observed the 316 methylation status of the isolated plasmid DNA by REase digestion.
unaccounted-for motif sequence observed in BS3 was GCWGC. Thus, we hypothesized that the true 319 recognition sequence of EMGBS3_12600 is GCWGC. The REase digestion assay showed that TseI (GCWGC 320 specificity) did not cleave the plasmids when EMGBS3_12600 was expressed in the cells, which clearly 321 supports our hypothesis (Fig. 3a). Furthermore, we confirmed that BceAI (ACGGC specificity) cleaved 322 plasmids regardless of whether EMGBS3_12600 was expressed, indicating that the EMGBS3_12600 protein 323 does not show ACGGC sequence specificity (Fig. 3a). Accordingly, we named this protein M.AspBS3I, as a 324 novel MTase that possesses GCWGC specificity (Table 4). 325 While the homology-based analysis predicted EMGBS15_03820 as a non-sequence specific MTase, 326 its adjacency to an REase and the results of the metaepigenomic analysis suggested that this MTase presents 327 GAANNNNTTC sequence specificity. The REase digestion assay showed that XmnI (GAANNNNTTC 328 specificity) did not cleave the plasmids only when EMGBS15_03820 was expressed in the cells, which also 329 supports our hypothesis (Fig. 3b). Furthermore, we confirmed that DpnII (GATC specificity) cleaved the 330 plasmids regardless of whether EMGBS15_03820 was expressed, indicating that EMGBS15_03820 is not a 331 nonspecific MTase. We named this protein M.FspBS15I, as a novel MTase that possesses GAANNNNTTC 332 methylation specificity (Table 4). 333 For EMGBS10_10070 in BS10 and EMGBD2_08790 in BD2, we also conducted REase digestion 334 assays to confirm the recognition motif sequences. Based on the results of the metaepigenomic analysis, their 335 motifs were predicted to be ACGAG and TANGGAB, respectively. Expression of each gene altered the 336 electrophoresis patterns of the digested plasmids to contain fragments that resulted from inhibition of REase 337 cleavage at the estimated methylation sites (Fig. S6). Furthermore, we additionally conducted SMRT 338 sequencing analysis using the PacBio RSII platform to examine the methylation status of the chromosomal 339 DNA of the E. coli transformed with each of the two MTase genes. The results were basically consistent 340 (Table S5): ACGAG was actually detected as the methylated motif in E. coli transformed with 341 EMGBS10_10070, and we named the protein M.OspBS10I. In the case of EMGBD2_08790, the detected 342 TAHGGAB motif was almost the same, but a subset of the estimated TANGGAB motif (i.e., TAGGGAB was 343 excluded), and this difference could be due to E. coli-specific conditions (e.g., cofactors and sequence biases), 344 insufficient data, or inaccuracy of the methylated motif detection method. Regardless of this minor difference, 345 we concluded that EMGBD2_08790 is a novel MTase gene responsible for methylation of the TAHGGAB Among the nineteen genome bins, no methylated motifs were detected in nine genome bins (MTase 350 genes were also not detected, except in the BS6 genome bin). This high ratio of methylation-lacking 351 organisms contrasts remarkably with a previous report in which prokaryotic genomes were found to rarely 352 lack DNA methylation systems (<7%) 13 . Notably, those nine genome bins contained seven Actinobacteria 353 bins, indicating that the dominant Actinobacteria in Lake Biwa lack methylation systems, although a number 354 of methylated motifs and corresponding MTases have been reported in Actinobacteria 13 . 355 Because DNA methylation is known to play a role in opposing phage infection 2-4 , we conducted in 356 silico prophage detection to evaluate whether prokaryotes in Lake Biwa tend to be infected by phages. Within 357 the nineteen genome bins, more than one prophage was found in ten genome bins (Table 2 and S6). Among 358 these ten bins, six overlapped with the nine genome bins in which no methylated motifs were identified. The 359 prophages showed little sequence similarity to each other except for two pairs and likely resulted from 360 independent and repetitive infections (Fig. S7). Thus, phage infection and prophage integration appear to 361 frequently occur in prokaryotes that lack DNA methylation systems. We also investigated the presence of 362 CRISPR/Cas systems as another major prokaryotic mechanism against phage infections 70-73 . We identified 363 possible CRISPR arrays in three genome bins, BS3, BS8, and BD3, which exhibit methylation systems but no 364 prophages, although the first two genome bins contained no associated Cas genes. 365 Based on these results, we assume that the possession of prophages is tolerable in lake freshwater 366 environments, and thus, the evolutionary pressure to develop or retain methylation systems is low. These 367 results also suggest that uncultured and cultivable strains may be under different selection pressures regarding 368 DNA methylation systems, and the true diversity of microbial methylation systems must be examined in the 369 future using metaepigenomic approaches. The current throughput of SMRT sequencing may be still insufficient to apply the metaepigenomic 383 approach to more diverse and complex samples. Because deep sequencing coverage (>25× subreads for each 384 DNA strand) is required for the reliable detection of DNA methylation, it is still difficult to obtain sufficient 385 sequencing reads to recover long contigs and detect methylated motifs for 'rare' species (typically those with 386 <1% relative abundance). In addition to rapid and ongoing technological advances in SMRT sequencing, the 387 emergence of Oxford Nanopore Technology may provide as another long-read, single-molecule, and 388 methylation-detectable technology 74,75 . Another problem is that the detectable types of DNA modifications are 389 limited (i.e., m4C, m5C, and m6A) with the currently available SMRT sequencing technology, while many 390 other DNA chemical modifications occur in nature 76 . In addition to advances in sequencing methods, novel 391 bioinformatic tools will be critical for metaepigenomic analyses of environmental prokaryotes. 392 A recent study showed that sets of methylated motifs and MTases can vary widely, even between 393 closely related strains 77 , where metaepigenomics is expected to enable differential methylation analyses 394 between populations. In addition, genus-level conservation of MTases that are not associated with REases is 395 sometimes observed, which suggests that MTases play unexplored adaptive roles, in addition to their 396 functions in combating phages 13,78 . Novel MTases may be adopted for biotechnological uses, such as DNA 397 recombination and methylation analyses 79 . It is envisioned that metaepigenomics of environmental 398 prokaryotes under different sampling conditions and environments will significantly deepen our understanding 399 of the enigmatic evolution of prokaryotic methylation systems and broaden their application potential.  represent its assigned bin and total sequence length, respectively. Contigs not assigned to any bin are indicated 596 in gray (named 'NA'). The x-axis and y-axis represent GC% and genome coverage, respectively. 597 Figure 3. REase digestion assays. a Assay of the EMGBS3_12600 gene (and EMGBD1_09320, which has 599 the same amino-acid sequence). BceAI and TseI were used, where the plasmid contained 12 (ACGGC) and 21 600 (GCWGC) target sites, respectively. Plasmid DNAs were linearized using SalI before the assay. An NEB 601 2-log DNA ladder was employed as a size marker. b Assay of the EMGBS15_03820 gene. DpnII and XmnI 602 were used, where the plasmid contained 27 (GATC) and two (GAANNNNTTC) target sites, respectively. 603