Main

Microorganisms drive global biogeochemical cycles, support food webs, and underpin the health of animals and plants5. Their immense phylogenetic, metabolic and functional diversity represents a rich discovery potential for new taxa1, enzymes and biochemical compounds, including natural products6. In environmental communities, such molecules confer microorganisms with diverse physiological and ecological functions ranging from communication to competition2,7. In addition to their original functions, these natural products and their genetically encoded production pathways include examples used for biotechnological and therapeutic applications2,3. The identification of such pathways and compounds has largely been facilitated by studying cultivable microorganisms. However, taxonomic surveys of natural environments have revealed that the vast majority of microbial life has not yet been cultivated8. This cultivation bias has limited our ability to tap into much of the microbially encoded functional diversity4,9.

To overcome these limitations, technological advances over the past decade have enabled researchers to directly (that is, without previous cultivation) sequence pieces of microbial DNA from whole communities (metagenomics) or single cells. The possibility to assemble such pieces into larger genomic fragments and to reconstruct several metagenome assembled genomes (MAGs) or single amplified genomes (SAGs), respectively, has opened new paths to the previously taxon-centric investigation of microbiomes (that is, microbial communities and their genetic material in a given environment)10,11,12. Indeed, recent surveys have vastly extended the phylogenomic representation of microbial diversity on Earth1,13 and revealed that most of the functional diversity in different microbiomes had previously not been captured by reference genome sequences (REFs) from cultivated microorganisms14. The ability to place uncovered functional diversity into the host genomic (that is, genome-resolved) context has been critical to predict yet uncharacterized microbial lineages that putatively encode new natural products15,16 or to trace such compounds to their original producers17. A combinatorial approach of metagenomic and single-cell genomic analyses, for example, led to the recognition of ‘Candidatus Entotheonella’, a group of metabolically rich, sponge-associated bacteria, as producers of multiple classes of candidate drugs18. However, despite recent attempts to establish genome-resolved explorations of various microbiomes16,19, for the ocean—the largest ecosystem on Earth—over two-thirds of global metagenomic data still remain unaccounted for16,20. Thus, the biosynthetic potential of the ocean microbiome in general and its potential as a reservoir of new enzymology and natural products specifically remain largely underexplored.

To examine the biosynthetic potential of the ocean microbiome at the global scale, we first integrated ocean microbial genomes obtained using cultivation-dependent and cultivation-independent methods to establish an extensive phylogenomic and gene functional database. By mining this database, we uncovered a diverse array of biosynthetic gene clusters (BGCs), the majority of them from yet un-characterized gene cluster families (GCFs). We further identified an uncharted family of bacteria that display the highest known diversity of BGCs in the open oceans to date. We selected two ribosomally synthesized and post-translationally modified peptide (RiPP) pathways on the basis of their genetic dissimilarity to currently known ones for experimental validation. Functional characterization of these pathways revealed examples of unexpected enzymology as well as a structurally unusual compound with protease inhibitory activity.

Phylogenomic representation of the ocean microbiome

We first sought to establish a global genome-resolved data resource focusing on its bacterial and archaeal constituents. To this end, we aggregated metagenomic data along with contextual information from 1,038 ocean water samples from 215 globally distributed sampling sites (latitudinal range = 141.6°) and several depth layers (from 1 to 5,600 m deep, covering epipelagic, mesopelagic and bathypelagic zones)21,22,23 (Fig. 1a, Extended Data Fig. 1a and Supplementary Table 1). In addition to providing broad geographical coverage, these size-selectively filtered samples enabled us to compare different components of the ocean microbiome, including virus-enriched (<0.2 μm), prokaryote-enriched (0.2–3 μm), particle-enriched (0.8–20 μm) and virus-depleted (>0.2 μm) communities.

Fig. 1: Reconstruction of MAGs at the global scale fills gaps in ocean phylogenomic diversity.
figure 1

a, A total of 1,038 publicly available ocean microbial community genomes (metagenomes) were collected at 215 globally distributed sites (between 62° S to 79° N and 179° W to 179° E). Map tiles © Esri. Sources: GEBCO, NOAA, CHS, OSU, UNH, CSUMB, National Geographic, DeLorme, NAVTEQ and Esri. b, These metagenomes were used to reconstruct MAGs (Methods and Supplementary Information), which varied in numbers and quality (Methods) across different datasets (colour coded). Reconstructed MAGs were complemented with publicly available (external) genomes, including manually curated MAGs26, SAGs27 and REFs. 27 to compile the OMD. c, The OMD improves the genomic representation (mapping rates of metagenomic reads; Methods) of ocean microbial communities by a factor of two to three compared with previous reports based solely on SAGs (GORG)20 or MAGs (GEM)16, with a more consistent representation across depth and latitudes. <0.2, n = 151; 0.2–0.8, n = 67; 0.2–3, n = 180; 0.8–20, n = 30; >0.2, n = 610; <30°, n = 132; 30–60°, n = 73; >60°, n = 42; EPI, n = 174; MES, n = 45; BAT, n = 28. d, Grouping the OMD into species-level (95% average nucleotide identity) clusters identified a total of around 8,300 species, over half of which were previously uncharacterized based on taxonomic annotations using the GTDB (release 89)13. e, A breakdown of the species by genome type reveals a high complementarity of MAGs, SAGs and REFs in capturing the phylogenomic diversity of the ocean microbiome. Specifically, 55%, 26% and 11% of the species were specific to MAGs, SAGs and REFs, respectively. BATS, Bermuda Atlantic Time-series; GEM, Genomes from Earth’s Microbiomes; GORG, Global Ocean Reference Genomes; HOT, Hawaiian Ocean Time-series.

Using this dataset, we reconstructed a total of 26,293 predominantly bacterial and archaeal MAGs (Fig. 1b and Extended Data Fig. 1b). We generated these MAGs on the basis of assemblies from individual, rather than pooled, metagenomic samples to prevent the collapsing of natural sequence variations across samples from different locations or time points (Methods). Furthermore, we grouped genomic fragments on the basis of their abundance correlation across large numbers of samples (between 58 and 610 samples, depending on the survey; Methods). We found this to be a computationally intensive, yet important step24, that was omitted in several large-scale MAG reconstruction efforts16,19,25, and substantially improved both the number (mean, 2.7 times) and quality score (mean, +20%) of genomes reconstructed from the ocean metagenomes studied here (Extended Data Fig. 2a and Supplementary Information). Overall, these efforts have increased the number of ocean water microbial MAGs by a factor of 4.5 (6 when counting high-quality MAGs only) compared with the most comprehensive MAG resource available to date16 (Methods). This set of newly created MAGs was then combined with 830 manually curated MAGs26, 5,969 SAGs27 and 1,707 REFs. 27 of marine bacteria and archaea into a combined collection of 34,799 genomes (Fig. 1b).

We next evaluated the newly established resource for its improved ability to represent ocean microbial communities and to assess the impact of integrating different genome types. On average, we found that it captured about 40–60% of ocean metagenomic data (Fig. 1c), corresponding to a two- to threefold increase in coverage with a more consistent representation across depths and latitudes compared with previous reports based solely on MAGs16 or SAGs20. Furthermore, to obtain a systematic measure of the taxonomic diversity within the established collection, we annotated all genomes using the Genome Taxonomy Database (GTDB) Toolkit (Methods) and clustered them using a 95% whole-genome average nucleotide identity cut-off28 to define 8,304 species-level clusters (species). Two thirds of these species (including new clades) were previously not represented in the GTDB and 2,790 of them were uncovered by MAGs reconstructed in this study (Fig. 1d). Moreover, we found that the different genome types were highly complementary, with 55%, 26% and 11% of the species being exclusively composed of MAGs, SAGs and REFs, respectively (Fig. 1e). Furthermore, MAGs covered all 49 phyla detected in the water column, whereas SAGs and REFs represented only 18 and 11 of them, respectively. However, SAGs better represented the diversity of the most abundant clades (Extended Data Fig. 3a), such as the order Pelagibacterales (SAR11), with nearly 1,300 species covered by SAGs as opposed to only 390 by MAGs. Notably, REFs rarely overlapped with either MAGs or SAGs at the species level and represented >95% of the approximately 1,000 genomes that were not detected in the set of open ocean metagenomes studied here (Methods), mostly owing to representatives that were isolated from other types of marine samples (such as sediments or host-associated). To enable its broad use by the scientific community, this ocean genomic resource—which also includes unbinned fragments (for example, from predicted phages, genomic islands and fragments of genomes with insufficient data for MAG reconstruction)—can be accessed alongside taxonomic and gene functional annotations as well as contextual environmental parameters at the Ocean Microbiomics Database (OMD; https://microbiomics.io/ocean/).

Biosynthetic potential of the global ocean microbiome

Next, we set out to investigate the richness and the degree of novelty of the biosynthetic potential in the open ocean microbiome. To this end, we first used antiSMASH on all of the MAGs, SAGs and REFs detected in the set of 1,038 ocean metagenomes (Methods) to predict a total of 39,055 BGCs. We then clustered them into 6,907 non-redundant GCFs and 151 gene cluster clans (GCCs; Supplementary Table 2 and Methods) to account for inherent redundancy (that is, the same BGC can be encoded in several genomes) and fragmentation of BGCs in metagenomic datasets. Incomplete BGCs did not significantly inflate, if at all (Supplementary Information), the number of GCFs and GCCs, which contained at least one complete member BGC in 44% and 86% of the cases, respectively.

At the GCC level, we found a high diversity of predicted RiPPs and other natural products (Fig. 2a). Among these, aryl polyenes, carotenoids, ectoines and siderophores, for example, belonged to GCCs with wide phylogenomic distributions and high prevalence across ocean metagenomes, possibly indicative of widespread microbial adaptations to the ocean environment, including resistance to reactive oxygen species, oxidative and osmotic stress or uptake of iron (Supplementary Information). This functional diversity contrasted with recent analyses of ~1.2 million BGCs from any of the ~190,000 genomes deposited in the NCBI RefSeq database (BiG-FAM/RefSeq, hereafter RefSeq)29 that showed a dominance of non-ribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) BGCs (Supplementary Information). We also found that 44 (29%) GCCs were only remotely related to any RefSeq BGCs (\(\bar{d}\)RefSeq > 0.4; Fig. 2a and Methods), and 53 (35%) GCCs were encoded only in MAGs, highlighting the potential for discovery of previously undescribed chemistry within the OMD. Given that each of these GCCs is likely to represent highly diverse biosynthetic functions, we further analysed the data at the level of GCFs, which aim to provide a more fine-grained grouping of BGCs that are predicted to encode similar natural products29. A total of 3,861 (56%) of the identified GCFs did not overlap with RefSeq and >97% of the GCFs were not represented in MIBiG, one of the most extensive databases of experimentally validated BGCs30 (Fig. 2b). Although finding many potentially new pathways in an environment that is not well represented by reference genomes is not unexpected, our approach of dereplicating BGCs into GCFs before comparative analyses, which differs from previous reports16, enabled us to provide unbiased novelty estimates. The majority of that novel diversity (3,012 GCFs, that is, 78%) corresponded to predicted terpenes, RiPPs or other natural products, and a large fraction (1,815 GCFs, that is, 47%) was encoded in phyla that are not generally known for their biosynthetic potential. As opposed to PKS and NRPS clusters, these compact BGCs are less likely to be fragmented during metagenomic assembly31 and can make easier targets for time- and resource-consuming functional characterization of their products.

Fig. 2: Novelty and phylogenomic distribution of the ocean microbiome biosynthetic potential.
figure 2

A total of 39,055 BGCs were clustered into 6,907 GCFs and 151 GCCs. a, Representation of the data (inner to outer layers). Hierarchical clustering based on BGC distances of the GCCs, 53 of which were captured only by MAGs. GCCs comprise BGCs from different taxa (ln-transformed phylum frequencies) and different BGC classes (circle sizes correspond to their frequencies). The outer layers indicate, for each GCC, the number of BGCs, the prevalence (percentage of samples) and the distance (minimum cosine distance of BGCs (min(dMIBiG))) to BGCs from BiG-FAM. GCCs with BGCs closely related to experimentally validated BGCs (MIBiG) are highlighted by arrows. b, Comparing GCFs to computationally predicted (BiG-FAM) and experimentally validated (MIBiG) BGCs uncovered 3,861 new (d– > 0.2) GCFs. Most of them (78%) encode RiPPs, terpenes and other putative natural products. c, All genomes in the OMD detected across 1,038 ocean metagenomes were placed onto the GTDB backbone trees to reveal the extent of the phylogenomic coverage of the OMD. Clades without any genome in the OMD are coloured grey. The number of BGCs corresponds to the highest number of predicted BGCs per genome in a given clade. For visualization, the last 15% of the nodes were collapsed. The arrows denote BGC-rich clades (>15 BGCs) with the exception of Mycobacteroides, Gordonia (next to Rhodococcus) and Crocosphaera (next to Synechococcus). d, An unknown species of ‘Ca. Eremiobacterota’ displayed the highest biosynthetic diversity (Shannon index based on natural product types). Each bar represents the genome with the highest number of BGCs within a species. T1PKS, type I PKS; T2/3PKS, type II and III PKS.

Beyond richness and novelty, we also investigated the biogeographical structuring of the ocean microbiome’s biosynthetic potential. Grouping samples by the mean metagenomic copy number distribution of GCFs (Methods) revealed that low-latitude, epipelagic, prokaryote-enriched and virus-depleted communities, mostly from surface or deeper sunlit waters were enriched in RiPP and terpene BGCs. By contrast, polar, deep-ocean, virus-enriched and particle-enriched communities were associated with higher abundances of NRPS and PKS BGCs (Extended Data Fig. 4 and Supplementary Information). Finally, we found that well-studied tropical and epipelagic communities were the most promising sources of new terpenes (Extended Data Fig. 5a,b), and that the least explored communities (polar, deep, virus- and particle-enriched) had the highest potential for the discovery of NRPS, PKS, RiPPs and other natural products (Extended Data Fig. 5a).

Identification of undescribed BGC-rich lineages

To complement the survey of the biosynthetic potential of the ocean microbiome, we sought to map its phylogenomic distribution and identify new BGC-rich clades. To this end, we placed the ocean microbial genomes in the standardized bacterial and archaeal phylogenomic trees of the GTDB13, and overlayed the putative biosynthetic pathways that they encode (Fig. 2c). We readily detected in ocean water samples (Methods) several BGC-rich clades (representatives with >15 BGCs) that are either well known for their biosynthetic potential, such as Cyanobacteria (Synechococcus) and Proteobacteria (such as Tistrella)32,33, or that have recently garnered attention for their natural products, such as Myxococcota (Sandaracinaceae), Rhodococcus and Planctomycetota34,35,36. Interestingly, we found within these clades several previously unexplored lineages. For example, those species with the richest biosynthetic potential within the phyla Planctomycetota and Myxococcota belonged to an uncharacterized candidate order and genus, respectively (Supplementary Table 3). Overall, this shows that the OMD provides access to previously uncharted phylogenomic information, including for microorganisms that may represent new targets for the discovery of enzymes and natural products.

We further characterized BGC-rich clades not only by counting the maximum number of BGCs encoded by their members, but also by assessing the diversity of these BGCs, which accounts for the frequency of different candidate natural product types (Fig. 2c and Methods). We found that the most biosynthetically diverse species were represented by bacterial MAGs exclusively reconstructed in this study. These bacteria belong to the uncultivated phylum ‘Candidatus Eremiobacterota’, which has remained largely unexplored except in a few genomic studies37,38. Notably, ‘Ca. Eremiobacterota’ spp. have been analysed only from terrestrial environments39 and have not been known to include any BGC-rich representatives. Here we initially reconstructed eight MAGs from the same species (with a nucleotide identity of >99%) from deep (between 2,000 m and 4,000 m) and particle-enriched (0.8–20 µm) ocean metagenomes collected by the Malaspina expedition23. Accordingly, we propose that this species is named ‘Candidatus Eudoremicrobium malaspinii’, after the nereid (sea nymph) of fine gifts in Greek mythology and the expedition.Ca. E. malaspinii’ had no previously known relatives below the order level based on phylogenomic annotation13 and therefore belongs to a new bacterial family for which we propose ‘Ca. E. malaspinii’ as the type species and ‘Ca. Eudoremicrobiaceae’ as its official name (Supplementary Information). The short-read metagenomic reconstruction of ‘Ca. E. malaspinii’ draft genomes was corroborated by ultra-low input, long-read metagenomic sequencing of one sample and targeted assembly (Methods) into a single 9.63 Mb linear chromosome with a 75 kb repeat as the only remaining ambiguity.

To establish a phylogenomic context for this species, we searched for closely related species through targeted genome reconstructions in additional eukaryote-enriched metagenomic samples from the Tara Oceans expedition40. In brief, we aligned metagenomic reads to ‘Ca. E. malaspinii’-related genomic fragments and assumed increased recruitment rates in a given sample to be indicative of the presence of additional relatives (Methods). As a result, we recovered 10 MAGs, and the combined set of 19 MAGs represents five species across three genera within the newly defined family (that is, ‘Ca. Eudoremicrobiaceae’). After manual inspection and quality control (Extended Data Fig. 6 and Supplementary Information), we found that ‘Ca. Eudoremicrobiaceae’ spp. representatives have larger genomes (8 Mb) and a richer biosynthetic potential (ranging from 14 to 22 BGCs per species) compared with members of other ‘Ca. Eremiobacterota’ clades (up to 7 BGCs) (Fig. 3a–c).

Fig. 3: Phylogeny, biosynthetic potential and distribution of the BGC-rich family ‘Ca. Eudoremicrobiaceae’.
figure 3

a, Phylogenomic placement of five ‘Ca. Eudoremicrobiaceae’ spp. revealed a BGC richness specific to the ocean lineage discovered in this study. The phylogenomic tree includes all ‘Ca. Eremiobacterota’ MAGs available in the GTDB (release 89) and representatives from additional phyla (the number of genomes is indicated in parentheses) for evolutionary context (Methods). The outermost layer indicates family-level (‘Ca. Eudoremicrobiaceae’ and ‘Ca. Xenobiaceae’) and class-level (‘Ca. Eremiobacteria’) taxonomy. The five species described in this study are denoted by an alphanumeric code and a proposed binomial name (Supplementary Information). b, ‘Ca. Eudoremicrobiaceae’ spp. share a core of seven BGCs. The missing BGC from clade A2 was attributed to incompleteness of the representative MAG (Supplementary Table 3). BGCs specific to ‘Ca. Amphithomicrobium’ and ‘Ca. Amphithomicrobium’ (clades A and B) are not displayed. c, All BGCs encoded by ‘Ca. Eudoremicrobium taraoceanii’ were found to be expressed across the set of 623 metatranscriptomes sampled by Tara Oceans. The filled circles indicate active transcription. The orange circles indicate values below or above a log2-transformed fold change from the expression rate of housekeeping genes (Methods). d, Relative abundance profiles (Methods) showed that ‘Ca. Eudoremicrobiaceae’ spp. are abundant and prevalent in most ocean basins and throughout the water column (from the surface to a depth of at least 4,000 m). On the basis of these estimations, we found that ‘Ca. E. malaspinii’ comprises up to 6% of the prokaryotic cells in bathypelagic particle-associated communities. We considered a species to be present at a station if it was detected in any of the size fractions of a given depth layer. IO, Indian Ocean; NAO, North Atlantic Ocean; NPO, North Pacific Ocean; RS, Red Sea; SAO, South Atlantic Ocean; SO, Southern Ocean; SPO, South Pacific Ocean.

Exploring the abundance and distribution of ‘Ca. Eudoremicrobiaceae’, we found that its members are prevalent in most oceanic basins as well as throughout the water column (Fig. 3d). Locally, they account for up to 6% of ocean microbial communities, making them a numerically substantial component of the global ocean microbiome. Furthermore, we found the relative abundances of ‘Ca. Eudoremicrobiaceae’ spp. and their BGCs expression levels to be the highest in eukaryote-enriched fractions (Fig. 3c and Extended Data Fig. 7), suggesting possible interactions with particulate matter, including planktonic organisms. This observation, and the homology of some ‘Ca. Eudoremicrobium’ BGCs to known pathways producing cytotoxic natural products could suggest a predatory behaviour (Supplementary Information and Extended Data Fig. 8), akin to other specialized metabolite-producing predators, such as Myxococcus41. The detection of ‘Ca. Eudoremicrobiaceae’ in less accessible (deep ocean) or eukaryote-enriched, rather than prokaryote-enriched samples, probably explains why these bacteria and their unsuspected BGC diversity had remained unclear in the context of natural-products research.

New enzymes and natural products

We finally sought to experimentally validate the promising prospects of our microbiomics-driven work for the discovery of new pathways, enzymes and natural products. Among the different BGC classes, RiPP pathways are known to encode a wealth of chemical and functional diversity owing to the various modifications installed post-translationally on a core peptide by maturase enzymes42. We therefore selected two ‘Ca. Eudoremicrobium’ RiPP BGCs (Fig. 3b and 4a–e) that were predicted to produce novel metabolites on the basis of their dissimilarity to any known BGC (\(\bar{d}\)MIBiG and \(\bar{d}\)RefSeq above 0.2).

Fig. 4: ‘Ca. Eudoremicrobiaceae’ spp. are a source of unusual enzymology and natural product structure.
figure 4

ac, In vitro heterologous expression and in vitro enzyme assays of a novel (\(\bar{d}\)RefSeq = 0.29) RiPP biosynthetic cluster specific to the deep ocean species ‘Ca. E. malaspinii’ led to the production of a di-phosphorylated product. c, Modifications were identified using high-resolution (HR) MS/MS (fragmentation is indicated by the b and y ions on the chemical structure) and NMR (Extended Data Fig. 9). d, This phosphorylated peptide displayed low-micromolar mammalian neutrophil elastase inhibition, which was not found for the control and dehydrated peptides (dehydration induced by chemical elimination). The experiment was repeated three times, leading to similar outcomes. eg, Heterologous expression of a second novel \(\bar{d}\)RefSeq = 0.33) proteusin biosynthetic cluster sheds light on the functionality of four maturases modifying a 46-amino-acid core peptide. Residues are coloured on the basis of predicted modification sites from HR-MS/MS, isotope labelling and NMR analyses (Supplementary Information). Dashed colouring indicates that the modification occurs on either of the two residues. The figure represents a compilation of numerous heterologous constructs to display the activity of all maturases on the same core. h, Inset of the NMR data of the backbone amide N-methylation. The complete results are shown in Extended Data Fig. 10. i, Phylogenetic placement of the FkbM maturase of the proteusin cluster among all FkbM domains found in the MIBiG 2.0 database revealed an enzyme of this family with N-methyltransferase activity (Supplementary Information). Schematic representations of the BGCs (a,e), the structure of the precursor peptides (b,f) and the proposed chemical structures of the natural products (c,g) are shown.

The first RiPP pathway (\(\bar{d}\)MIBiG = 0.41, \(\bar{d}\)RefSeq = 0.29) was found only in the deep-ocean species ‘Ca. E. malaspinii’ and encodes a precursor peptide modified by a sole maturase (Fig. 4a,b). In this maturase, we found a single functional domain homologous to the dehydration domain of lanthipeptide synthetases, which normally catalyses phosphorylation and subsequent elimination43 (Supplementary Information). We therefore predicted the modifications of the precursor peptide to include such a two-step dehydration. However, using tandem mass spectrometry (MS/MS) and nuclear magnetic resonance spectroscopy (NMR), we identified a poly-phosphorylated linear peptide (Fig. 4c). Although this was unexpected, we found several lines of evidence supporting that it is the final product: the absence of dehydration in two different heterologous hosts as well as in vitro assays, the identification of mutated key residues in the dehydration catalytic site of the maturase, which were consistently found in all reconstructed ‘Ca. E. malaspinii’ genomes (Extended Data Fig. 9 and Supplementary Information), and finally the bioactivity of the phosphorylated product rather than the chemically synthesized, dehydrated form (Fig. 4d). Indeed, we found that it displayed low-micromolar protease inhibitory activity against neutrophil elastase within a concentration range (IC50 = 14.3 µM) comparable to other relevant natural products44, although the ecological role of this unusual natural product remains to be elucidated. On the basis of these results, we propose naming this pathway ‘phospeptin’.

The second case represents a complex RiPP pathway specific to ‘Ca. Eudoremicrobium’ spp. (\(\bar{d}\)MIBiG = 0.46, \(\bar{d}\)RefSeq = 0.33) that is predicted to encode a proteusin natural product (Fig. 4e). These pathways are of particular biotechnological interest owing to the expected density and diversity of unusual chemical modifications installed by enzymes encoded in relatively short BGCs45. We found that this proteusin differs from previously characterized ones as it lacks both the NX5N core motif of polytheonamides and the lanthionine rings of landornamides46. To overcome the limitations of common heterologous expression models, we used them along a non-standard Microvirgula aerodenitrificans system to characterize the four maturase enzymes of the pathway (Methods). Using a combination of MS/MS, isotope labelling and NMR, we found that these maturases install up to 21 modifications, including l- to d-amino acid epimerization, hydroxylation as well as C- and backbone amide N-methylation, on a 46 amino-acid core peptide (Fig. 4f,g, Extended Data Figs. 1012 and Supplementary Information). Among the maturases, we characterized the first occurrence of a FkbM O-methyltransferase family member47 in a RiPP pathway and, unexpectedly, found that this maturase introduces backbone N-methylations (Fig. 4h,i and Supplementary Information). Although this modification is known in NRP natural products48, enzymatic N-methylation of the amide bond is a challenging yet biotechnologically relevant reaction49 that has to date been specific to the borosin RiPP family50,51. The identification of this activity in other enzyme and RiPP families may provide opportunities for new applications and expand the functional diversity52 of proteusins along with their chemical diversity. On the basis of the identified modifications and the unusual length of the proposed structure for the product, we suggest naming the pathway ‘pythonamide’.

The finding of unexpected enzymology within a functionally characterized enzyme family47 illustrates both the promise of environmental genomics for new discoveries and, at the same time, the limited power of functional inferences made solely by sequence homology. Thus, together with the report of a non-canonical bioactive poly-phosphorylated RiPP, our results demonstrate the resource-intensive, yet critical value of synthetic biology efforts to fully uncover the functional richness, diversity and unusual architectures of biochemical compounds.

Conclusions

Here we have demonstrated the extent of microbially encoded biosynthetic potential and its genomic context in the global ocean microbiome, facilitating future research by making the generated resources available to the scientific community (https://microbiomics.io/ocean/). We found that the majority of both its phylogenomic and functional novelty is accessible only through the reconstruction of MAGs and SAGs, particularly in underexplored microbial communities, which could direct future bioprospecting efforts. Although we focused here on ‘Ca. Eudoremicrobiaceae’ as a particularly biosynthetically ‘talented’ lineage, many of the predicted BGCs within other uncovered microbial groups are likely to encode previously undescribed enzymology that produces compounds with ecologically and/or biotechnologically relevant activities.

Methods

Metagenomic data selection, assembly and binning

Metagenomic datasets from major oceanographical surveys and time-series studies with sufficient sequencing depth were included to maximize the coverage of global ocean microbial communities across ocean basins, depth layers and time. These datasets (Supplementary Table 1 and Fig. 1) included metagenomes from samples collected by Tara Oceans (virus-enriched, n = 190; and prokaryote-enriched, n = 180)12,22, BioGEOTRACES expeditions (n = 480), the Hawaiian Ocean Time-series (HOT, n = 68), the Bermuda-Atlantic Time-series Study (BATS, n = 62)21 and the Malaspina expedition (n = 58)23. Sequencing reads from all metagenomes were quality filtered using BBMap (v.38.71) by removing sequencing adapters from the reads, removing reads that mapped to quality control sequences (PhiX genome) and discarding low quality reads using the parameters trimq = 14, maq = 20, maxns = 0 and minlength = 45. Downstream analyses were performed on quality-controlled reads or if specified, merged quality-controlled reads (bbmerge.sh minoverlap = 16). Quality-controlled reads were normalized (bbnorm.sh target = 40, mindepth = 0) before they were assembled with metaSPAdes (v.3.11.1 or v.3.12 if required)53. The resulting scaffolded contigs (hereafter scaffolds) were finally filtered by length (≥1 kb).

The 1,038 metagenomic samples were grouped into several sets and, for each sample set, the quality-controlled metagenomic reads from all samples were individually mapped against the scaffolds of each sample, resulting in the following numbers of pairwise readset to scaffold mappings: Tara Oceans virus-enriched (190 × 190), prokaryote-enriched (180 × 180), BioGEOTRACES, HOT and BATS (610 × 610) and Malaspina (58 × 58). Mapping was performed using the Burrows–Wheeler-Aligner (BWA) (v.0.7.17-r1188)54, allowing the reads to map at secondary sites (with the -a flag). Alignments were filtered to be at least 45 bases in length, with an identity of ≥97% and covering ≥80% of the read sequence. The resulting BAM files were processed using the jgi_summarize_bam_contig_depths script of MetaBAT2 (v.2.12.1)55 to provide within- and between-sample coverages for each scaffold. The scaffolds were finally binned by running MetaBAT2 on all samples individually with parameters --minContig 2000 and --maxEdges 500 for increased sensitivity. We used MetaBAT2 in lieu of ensemble binning approaches as it was shown to be the best performing single binner in independent benchmarks56 and was found to be 10 to 50 times faster than other usual binners57. To test for the effect of abundance correlation, a randomly selected subsample of the metagenomes (10 from each of the two Tara Ocean datasets, 10 from BioGEOTRACES, five from each time-series and five from Malaspina) was additionally binned using only within-sample coverage information (Supplementary Information).

Selection of additional genomes

Additional (external) genomes were included in downstream analyses, namely 830 manually curated MAGs from a subset of the Tara Oceans dataset26, 5,287 SAGs from the GORG dataset20, as well as 1,707 isolate REFs and 682 SAGs from the MAR databases (MarDB v.4)27. For the MarDB dataset, genomes were selected on the basis of the available metadata if the sample type matched the following regular expression: ‘[S|s]ingle.?[C|c]ell|[C|c]ulture|[I|i]solate’.

Quality evaluation of metagenomic bins and external genomes

The quality of each metagenomic bin and external genome was evaluated using both the ‘lineage workflow’ of CheckM (v.1.0.13) and Anvi’o (v.5.5.0)58,59. Metagenomic bins and external genomes were retained for downstream analyses if either CheckM or Anvi’o reported a completeness/completion of ≥50% and a contamination/redundancy of ≤10%. These metrics were then aggregated into a mean completeness (mcpl) and a mean contamination (mctn) value to classify genome quality according to community standards60 as follows: high quality: mcpl ≥ 90% and mctn ≤ 5%; good quality: mcpl ≥ 70% and mctn ≤ 10%; medium quality: mcpl ≥ 50% and mctn ≤ 10%; fair quality: mcpl ≤ 90% or mctn ≥ 10%. Filtered genomes were further attributed quality scores (Q and Q′) as follows: Q = mcpl − 5 × mctn; and Q′ = mcpl − 5 × mctn + mctn × (strain heterogeneity)/100 + 0.5 × log[N50] (as implemented in dRep61).

Species-level clustering of the genome collection and comparison to other resources

To allow for comparative analyses between different data resources and genome types (MAGs, SAGs and REFs), the full set of 34,799 genomes was dereplicated on the basis of both whole-genome average nucleotide identity (ANI) using dRep (v.2.5.4)61 with a 95% ANI threshold28,62 (-comp 0 -con 1000 -sa 0.95 -nc 0.2) and single-copy marker genes using SpecI63, providing species-level clustering of the genomes. A representative genome was selected based on the maximum quality score defined above (Q′) for each of the dRep clusters, which were considered to be a proxy for species membership.

To estimate mapping rates, all 1,038 metagenomic readsets were mapped against the 34,799 genomes included in the OMD using BWA (v.0.7.17-r1188, -a). Quality-controlled reads were mapped in single-end mode and the resulting alignments were filtered to keep only alignments of ≥45 bp in length and with an of identity ≥ 95%. The per-sample mapping rate is the percentage of reads that remained after filtering divided by the total number of quality-controlled reads. Using the same method, each of the 1,038 metagenomes was downsampled to 5 million inserts (Extended Data Fig. 1c) and mapped to the GORG SAGs within the OMD and all of the GEM16. The number of MAGs in the GEM catalogue16 that were recovered from ocean waters was determined on the basis of a keyword query on the source of metagenomes, selecting for ocean water samples (as opposed to marine sediments, for example). Specifically, we selected ‘aquatic’ as the ‘ecosystem_category’, ‘marine’ as the ‘ecosystem_type’ and filtered ‘habitat’ for 'deep ocean', 'marine', 'marine oceanic', 'pelagic marine', 'seawater', 'marine', 'seawater', 'surface seawater', 'surface seawater'. This resulted in 5,903 MAGs (734 high quality), distributed across 1,823 OTUs (here, species).

Taxonomic and functional genome annotation

Prokaryotic genomes were taxonomically annotated using GTDB-Tk (v.1.0.2)64 with the default parameters against the GTDB r89 release13. Anvi’o was used to identify eukaryotic genomes on the basis of domain prediction and completion of ≥50% and redundancy of ≤10%. The taxonomic annotation of a species is defined as the one of its representative genome. Excluding eukaryotes (148 MAGs), each genome was functionally annotated by first calling complete genes using prokka (v.1.14.5)65 with the ‘Archaea’ or ‘Bacteria’ parameter specified as appropriate, which also reported non-coding genes and CRISPR regions, among other genomic features. The predicted genes were annotated by identifying universal single-copy marker genes (uscMGs) with fetchMGs (v.1.2)66, assigning orthologous groups with emapper (v.2.0.1)67 based on eggNOG (v.5.0)68 and performing queries against the KEGG database (release 2020-02-10)69. This last step was performed by aligning the proteins to the KEGG database using DIAMOND (v.0.9.30)70 with a query and subject coverage of ≥70%. The results were further filtered on the basis of the bitscore being ≥50% of the maximum expected bitscore (reference against itself) according to the NCBI Prokaryotic Genome Annotation Pipeline71. The gene sequences were additionally used as input to identify BGCs in the genomes using antiSMASH (v.5.1.0)72 with the default parameters and the different cluster blasts turned on. All genomes and annotations have been compiled along with contextual metadata into the OMD, which is available online (https://microbiomics.io/ocean/).

Gene-level profiling

Similar to the methods described previously12,22, we clustered the >56.6 million protein-coding genes from the bacterial and archaeal genomes of the OMD at 95% identity and 90% coverage of the shorter gene using CD-HIT (v.4.8.1)73 into >17.7 million gene clusters. The longest sequence was selected as the representative gene of each gene cluster. The 1,038 metagenomes were then mapped to the >17.7 million cluster representatives with BWA (-a) and the resulting BAM files were filtered to retain only alignments with a percentage identity of ≥95% and ≥45 bases aligned. Length-normalized gene abundance was calculated by first counting inserts from best unique alignments and then, for ambiguously mapped inserts, adding fractional counts to the respective target genes in proportion to their unique insert abundances.

Species-level profiling with mOTUs

The genomes in the extended OMD (augmented with additional MAGs from ‘Ca. Eudoremicrobiaceae’, see below) were added to the database (v.2.5.1) of the metagenomic profiling tool mOTUs74 to generate an extended mOTUs reference database. Only genomes with at least six out of the ten uscMGs in single copy were kept (23,528 genomes). The extension of the database resulted in 4,494 additional species-level clusters. Profiling of the 1,038 metagenomes was performed using the default parameters of mOTUs (v.2). On the basis of the mOTU profiles, a total of 989 genomes (95% REFs, 5% SAGs and 99.9% belonging to MarDB) contained within 644 mOTU clusters were not detected. This reflects the various additional marine isolation sources (most of the genomes not detected are associated with organisms isolated from, for example, sediments, marine hosts) of the MarDB genomes. To remain focused on the open ocean environment in this study, we excluded them from downstream analyses if they were not detected or not included in the extended mOTU database established in this study.

Clustering and selection of BGCs

All BGCs from MAGs, SAGs and REFs in the OMD (see above) were combined with the ones identified across all the metagenomic scaffolds (antiSMASH v.5.0, default parameters) and processed with BiG-SLICE (v.1.1) for feature (PFAM domains) extraction75. On the basis of these features, we computed all-versus-all cosine distances between BGCs and clustered them (average linkage) into GCFs and GCCs, using a 0.2 and a 0.8 distance threshold, respectively. These thresholds are an adaptation of those previously used with Euclidean distances75 to cosine distances, which alleviate some of the biases of the original BiG-SLICE clustering strategy (Supplementary Information).

BGCs were subsequently filtered, retaining only the ones encoded on scaffolds ≥5 kb to reduce the risk of fragmentation, as done previously16, and excluding MarDB REFs and SAGs that were not detected in the 1,038 metagenomes (see above). This resulted in a total of 39,055 BGCs encoded by OMD genomes and an additional 14,106 identified on metagenomic fragments (that is, that were not binned into MAGs). These ‘metagenomic’ BGCs were used to estimate the proportion of the ocean microbiome biosynthetic potential not captured by the database (Supplementary Information). Each BGC was functionally characterized on the basis of predicted product types as defined by antiSMASH or coarser product classes, as defined in BiG-SCAPE76. To prevent sampling biases in quantitative analyses (taxonomic and functional compositions of GCCs/GCFs, GCF and GCC distances to reference databases as well as GCF metagenomic abundances), the 39,055 BGCs were further dereplicated by retaining only the longest BGC per GCF per species, leading to a total of 17,689 BGCs.

Novelty of GCFs and GCCs

The novelty of GCCs and GCFs was estimated on the basis of distances to databases of computationally predicted (the RefSeq database within BiG-FAM)29 and experimentally validated (MIBIG 2.0)30 BGCs. For each of the 17,689 representative BGCs, we selected the minimum cosine distance to the respective database. These minimum distances were then averaged (mean) per GCF or GCC as appropriate. A GCF was considered to be novel if the distance to the database was above 0.2, which corresponds to (on average) the complete separation between the GCF and the reference. For GCCs, we selected 0.4, twice the GCF-defining threshold, to capture remote relationships with the reference.

Abundance and prevalence of GCFs and GCCs

The metagenomic abundance of a BGC was estimated as the median abundance of its biosynthetic genes (as defined by antiSMASH), which were available from the gene-level profiles. The metagenomic abundance of each GCF or GCC was subsequently computed as the sum of its representative BGCs (out of the 17,689). These abundance profiles were subsequently cell-normalized using the mOTU count per sample, which also accounts for the sequencing effort22,74 (Extended Data Fig. 1d). The prevalence of a GCF or GCC was computed as the percentage of samples with an abundance of >0.

Structure of the ocean microbiome biosynthetic potential

Euclidean distances between samples were computed on the basis of the normalized GCF profiles. These distances were dimensionally reduced using UMAP77 and the resulting embedding was used for unsupervised density-based clustering with HDBSCAN78. The optimal minimum number of points of a cluster (and therefore the number of clusters) used by HDBSCAN was determined by maximizing the cumulative cluster membership probability. The identified clusters (as well as random balanced subsamples of these clusters, to account for biases in permutational multivariate analysis of variance (PERMANOVA)) were tested for significance using PERMANOVA against the non-reduced Euclidean distances. The average genome size of a sample was computed on the basis of relative abundances of mOTUs and the estimated genome sizes of the member genomes. Specifically, the mean genome size for each mOTU was estimated as the mean of the completeness-corrected sizes (for example, the corrected size of a 75% complete genome with a length of 3 Mb is 4 Mb) of its member genomes (after filtering for genomes with a mean completeness of ≥70%). Then, per sample, the average genome size was computed as the sum of the relative abundance-weighted mOTU genome sizes.

Phylogenomic distribution of BGCs

The filtered set of BGCs encoded by genomes in the OMD (in scaffolds ≥5 kb and excluding MarDB REFs and SAGs that were not detected in the 1,038 metagenomes, see above) along with their predicted product classes were displayed on the GTDB bacterial and archaeal trees on the basis of the GTDBTk phylogenomic placement of the genomes (see above). We first reduced the data on a per-species basis, using the genome with most BGCs in that species as a representative. For visualization, the representatives were further binned along the tree and, similarly, for each binned clade, the genome containing the most BGCs was selected as representative. BGC-rich species (at least one genome with >15 BGCs) were further analysed by computing the Shannon diversity index of the product types encoded in these BGCs. Chemical hybrids and other complex BGCs (as predicted by antiSMAH) were considered to be from the same product type if all of the predicted product types were identical, irrespective of their order within the cluster (for example, a proteusin–bacteriocin hybrid is identical to bacteriocin–proteusin hybrid).

Long-read sequencing of ‘Ca. Eudoremicrobium’

Leftover DNA (an estimated 6 ng) from the sample Malaspina MP1648, corresponding to the biosample SAMN05421555 and matching the short-read Illumina metagenomic readset SRR3962772, was processed for an ultralow input PacBio sequencing protocol to produce a >20 Gb Hifi Pacbio metagenome using the PacBio kits SMRTbell gDNA Sample amplification kit (100-980-000) and the SMRTbell Express Template Prep kit 2.0 (100-938-900). In brief, the remaining DNA was sheared using a Covaris (g-TUBE, 52104), repaired and purified (ProNex beads). The purified DNA was then library prepped, amplified, purified (ProNex beads) and size-selected (>6 kb, Blue Pippin) before a final purification step (ProNex beads) and sequencing on the Sequel II platform.

Targeted binning of ‘Ca. Eremiobacterota’

After the reconstruction of the first two ‘Ca. Eremiobacterota’ MAGs, we identified six additional ones with ANI > 99% (these are included in Fig. 3) that were initially filtered out on the basis of contamination estimates (later identified as gene duplications, see below). We additionally recovered bins identified as ‘Ca. Eremiobacterota’ from a different study23 and used them along with the eight MAGs from our study as a reference for subsampled mapping (5 million reads) of metagenomic reads from 633 eukaryote-enriched (>0.8 μm) samples using BWA (v.0.7.17-r1188, -a flag). On the basis of enriched specific mappings (after 95% alignment identity and 80% read coverage filtering), 10 metagenomes (expected coverage, ≥5×) were selected for assembly and 49 additional metagenomes (expected coverage, ≥1×) for abundance correlation. Using the same parameters as described above, these samples were binned and 10 additional ‘Ca. Eremiobacterota’ MAGs were recovered. These 16 MAGs (which excludes the two that were already in the database) bring the total number of genomes in the extended OMD to 34,815. The MAGs were assigned to taxonomic ranks on the basis of their genomic similarity and GTDB placement. The 18 MAGs were dereplicated using dRep into 5 species (within-species ANIs were >99%) and 3 genera (within-genus ANIs ranged between 85% and 94%)79 within the same family. Species representatives were manually selected on the basis of completeness, contamination and N50. Proposed naming is available in Supplementary Information.

Manual evaluation of ‘Ca. Eremiobacterota’ MAGs

To evaluate the completeness and contamination of ‘Ca. Eremiobacterota’ MAGs, we assessed the presence of uscMGs, in addition to lineage- and domain-specific single-copy marker gene sets used by CheckM and Anvi’o. The identification of duplications among 2 out of the 40 uscMGs was confirmed by phylogenetic reconstruction (see below) to rule out any potential contamination (which would have corresponded to 5% on the basis of these 40 marker genes). Additional inspection of the representative MAGs of the five ‘Ca. Eremiobacterota’ species confirmed low rates of contaminants in these reconstructed genomes on the basis of abundance correlation and sequence composition (Supplementary Information) using the Anvi'o interactive interface59.

Phylogenomics of ‘Ca. Eudoremicrobiaceae’

For phylogenomic analyses, we selected the representative MAGs of the five ‘Ca. Eudoremicrobiaceae’ species, all ‘Ca. Eremiobacterota’ genomes available in GTDB (r89)13 and representatives of additional phyla (including UBP13, Armatimonadota, Patescibacteria, Dormibacterota, Chloroflexota, Cyanobacteria, Actinobacteria and Planctomycetota). All of these genomes were annotated as previously described to extract single-copy marker genes and to annotate BGCs. GTDB genomes were retained on the basis of the completeness and contamination criteria mentioned above. The phylogenomic analysis was performed using the Anvi’o phylogenomics workflow59. The tree was constructed with IQTREE (v.2.0.3) (default parameters and -bb 1000)80 on an alignment (MUSCLE, v.3.8.1551)81 of 39 concatenated ribosomal proteins identified by Anvi’o, with positions trimmed for coverage in at least 50% of the genomes82 and using Planctomycecota as the outgroup based on the GTDB tree topology. Individual trees for the 40 uscMGs were built using the same tools and parameters.

Trait and lifestyle prediction of ‘Ca. Eudoremicrobiaceae’

We used Traitar (v.1.1.2) with the default parameters (phenotype, from nucleotides)83 to predict general microbial traits. We investigated the potential predatory lifestyle on the basis of a previously developed predatory index84, which relies on the protein-coding gene content of a genome. Specifically, we used DIAMOND to compare the proteins from a genome to the OrthoMCL database (v.4)85 using the parameters --more-sensitive --id 25 --query-cover 70 --subject-cover 70 --top 20 and counted genes that matched predatory and non-predatory marker genes. The index is the difference between the number of predatory and non-predatory markers. As an additional control, we also analysed the genome of ‘Ca. Entotheonella’ factor TSY118 on the basis of its similar characteristics to ‘Ca. Eudoremicrobium’ (large genome size and biosynthetic potential). We further tested a potential link between predatory and non-predatory marker genes with the biosynthetic potential of ‘Ca. Eudoremicrobiaceae’ and found at most one gene (from either type, that is, predatory/non-predatory, of marker genes) overlapping with BGCs, suggesting that BGCs do not confound the predatory signal. Additional annotations of the genomes to specifically investigate secretion systems, pili and flagella were performed using TXSSCAN (v.1.0.2) for unordered replicons86.

Transcriptomic profiling of ‘Ca. E. taraoceanii’

Transcriptomic profiling was performed by mapping 623 metatranscriptomes from Tara Oceans prokaryote- and eukaryote-enriched fractions22,40,87 (using BWA, v.0.7.17-r1188, -a flag) to the five representative ‘Ca. Eudoremicrobiaceae’ genomes. After 80% read coverage and 95% identity filtering, the BAM files were processed using FeatureCounts (v.2.0.1)88 (using the parameters featureCounts --primary -O --fraction -t CDS,tRNA -F GTF -g ID -p) to compute the number of inserts per gene. The resulting profiles were normalized to gene length and mOTU marker gene abundances (median length-normalized insert count of genes with insert count of >0) and log2-transformed22,74 to obtain relative per-cell expression levels of each gene, which also accounts for between-samples differences in sequencing effort. Such ratios allow for comparative analyses by mitigating the issues of compositionality when working with relative abundance data. Only samples with >5 out of the 10 mOTU marker genes were considered for further analyses, ensuring that a large enough fraction of the genome was detected.

The normalized transcriptomic profiles of ‘Ca. E. taraoceanii’ were dimension-reduced using UMAP and the resulting representation was used for unsupervised clustering using HDBSCAN (see above) to identify expression states. The significance of differences between the identified clusters was tested by PERMANOVA in the original (non-reduced) distance space. Differential expression between these states was tested across 201 KEGG pathways identified in the genome (see above) and 6 functional groups, namely, BGCs, secretion systems and flagellar genes from TXSSCAN, degradative enzymes (proteases and peptidases) from prokka and predatory and non-predatory markers from the predatory index. For each sample, we computed the median normalized expression for each category (note that BGC expression was itself computed as the median expression of the biosynthetic genes of that BGC) and tested for significance (FDR-corrected Kruskal–Wallis test) across the different states.

Experimental validation of a novel phosphorylated RiPP pathway (‘Ca. E. malaspinii’, HLLJDLBE BGC 75.1)

Materials for heterologous expression

Synthetic genes were purchased from GenScript and PCR primers were ordered from Microsynth. Phusion polymerase from Thermo Fisher Scientific was used for DNA amplification. NucleoSpin plasmid and NucleoSpin Gel and PCR Clean-up kits from Macherey–Nagel were used to purify DNA. Restriction enzymes and T4 DNA ligase were purchased from New England Biolabs. Chemicals were purchased from Sigma-Aldrich, with the exception of isopropyl-β-d-1-thiogalactopyranoside (IPTG) (Biosynth) and 1,4-dithiothreitol (DTT, AppliChem) and used without further purification. The antibiotics chloramphenicol (Cm), spectinomycin dihydrochloride (Sm), ampicillin (Amp), gentamicin (Gt) and carbenicillin (Cbn) were purchased from AppliChem. The medium components Bacto Tryptone and Bacto Yeast Extract were purchased from BD Biosciences. Sequencing-grade trypsin was purchased from Promega.

Cloning of embA, embM and orf3 (embI) for protein expression

Gene sequences were extracted from BGC 75.1 predicted by antiSMASH on the type material of ‘Ca. E. malaspinii’ (Supplementary Information).

The genes embA (locus, MALA_SAMN05422137_METAG-scaffold_127-gene_5), embM (locus, MALA_SAMN05422137_METAG-scaffold_127-gene_4) and embAM (including intergenic region) were ordered as synthetic constructs in pUC57(AmpR), with and without codon-optimization for expression in Escherichia coli. The gene embA was subcloned into the first multiple cloning site (MCS1) of pACYCDuet-1(CmR) and pCDFDuet-1(SmR) with BamHI and HindIII cut sites. The genes embM and embMopt (codon optimized) were subcloned into MCS1 of pCDFDuet-1(SmR) with BamHI and HindIII and in the second multiple cloning site (MCS2) of pCDFDuet-1(SmR) and of pRSFDuet-1(KanR) with NdeI/XhoI. The embAM cassette was subcloned into pCDFDuet1(SmR) with the BamHI and HindIII cut sites. The gene orf3/embI (locus, MALA_SAMN05422137_METAG-scaffold_127-gene_3) was constructed by overlap extension PCR with primers EmbI_OE_F_NdeI and EmbI_OE_R_XhoI, digested with NdeI/XhoI and ligated to pCDFDuet-1-EmbM(MCS1), which was digested using the same restriction enzymes (Supplementary Table 6). Restriction enzyme digestions and ligations were performed according to the manufacturer’s (New England Biolabs) procedures.

All constructs generated above were introduced into chemically competent E. coli DH5α and plated onto LB agar with appropriate antibiotic selection. Plasmids were purified from single colonies and sequenced using sequencing primers to verify proper insertion of genes (Supplementary Table 6). The genes embA and embAM were additionally subcloned in a modified pLMB509m(GtR) vector for M. aerodenitrificans expression through Gibson assembly, with the inclusion of an N-terminal His6 purification tag in the final EmbA protein product89. A list of the Gibson assembly primers EmbA_F_plmb, EmbA_R_plmb, Plmb_F_EmbA and Plmb_R_EmbA is provided in Supplementary Table 6. Transformation of E. coli DH5α, isolation of plasmids and validation of the correct clones by sequencing was followed by introduction of NHis6-EmbA-pLMB509m and NHis6-EmbAM-pLMB509m into E. coli SM10 for conjugation.

Heterologous expression and purification for protein isolation for in vitro assays

Chemically competent E. coli BL21(DE3) was introduced into pCDFDuet-1-EmbA(MCS1) and pCDFDuet-1-EmbM(MCS1). The same conditions were used for expression of both N-terminally His6-tagged proteins. Overnight cultures were prepared from single colonies and used to inoculate (1%, v/v) TB medium (2 × 200 ml) in 1 l baffled Erlenmeyer flasks supplemented with spectinomycin (50 mg ml−1). Cells were grown at 37 °C, 200 rpm, until an optical density at 600 nm (OD600) of around 1.0, cooled on an ice bath for 20 min and induced with a final concentration of 0.5 mM IPTG. The cultures were further incubated at 16 °C, 180 rpm for 18–20 h, and subsequently collected by centrifugation (8,000g for 20 min) and frozen.

Purifications of NHis6-EmbA and NHis6-EmbM were carried out at 4 °C, using the same procedure for both proteins. Cell pellets were resuspended in 5 ml g−1 of lysis buffer (50 mM Tris, 300 mM NaCl, 5 mM imidazole, 10% glycerol, pH 7.8). The suspension was supplemented with lysozyme (1 mg ml−1), DNase I (10 U ml−1) and protease inhibitor cocktail (0.2%, v/v) and stirred at 37 °C for 30 min. After cooling the suspension for 15 min on an ice bath, cells were lysed by sonication (30% amplitude, 10s on/off cycles, for a total of 3 min), and the clarified lysate was obtained by centrifugation (27,000g for 30 min). The supernatant was loaded onto 4 ml of Ni-NTA agarose resin that had been equilibrated with a lysis buffer in a fritted purification column. The resin was washed with 10 column volumes (CV) of lysis buffer, 3 CV of wash buffer (50 mM Tris, 300 mM NaCl, 40 mM imidazole, 10% glycerol, pH 7.8) and finally eluted with 3 CV of elution buffer (50 mM Tris, 300 mM NaCl, 250 mM imidazole, 10% glycerol, pH  7.8) in 1.5 ml fractions. Elution fractions were analysed by SDS–PAGE, pooled and concentrated in spin concentrators with the appropriate molecular weight cut-off. NHis6-EmbA and NHis6-EmbM were buffer-exchanged using a PD MiniTrap G25 column pre-equilibrated with G25 buffer (50 mM Tris, 300 mM NaCl, 10% glycerol, pH 8.0). The concentration of buffer-exchanged proteins was determined by measuring the absorbance of purified proteins at 280 nm and using the calculated values for molecular mass and extinction coefficient for each protein.

In vitro enzymatic activity assays with EmbA and EmbM

Extensive screening of enzymatic reaction parameters, including temperature, time, enzyme and substrate concentration, buffer pH and salinity resulted in the following best condition set for EmbM turnover: EmbA was added to a glass vial to a final concentration of 200 mM. Final concentrations of 2 mM of MgCl2 and 2 mM of adenosine 5′-triphosphate (ATP) were added to the reactions. EmbM was not added to control reactions and added to a final concentration of 10 mM in turnover experiments. The enzymatic reaction was stirred at 37 °C for 72 h, and the reaction mixture was supplemented with 2 mM of ATP every 24 h. Reaction scales ranged from 100 μl, for analytical purposes, to 3 ml for product isolation.

Modified EmbA was proteolysed with trypsin for MS analysis and with LahT150 for MS analysis and product isolation90. Trypsin cleavage was performed by diluting the reaction mixture with 2× trypsin buffer (50 mM Tris, 5 mM CaCl2, pH 8.0), adding 1:20 trypsin:EmbA and incubating overnight at 37 °C. LahT150 cleavage was performed by adding LahT150 at a 1:10 ratio to EmbA, and incubating at room temperature overnight for small-scale reactions, or for 24 h, for reactions larger than 1 ml.

Large-scale enzymatic reactions were purified using solid-phase extraction (SPE) with a Phenomenex Strata C18-E reverse phase column (2 g sorbent). The sorbent was first washed with 24 ml MeOH and equilibrated with 24 ml H2O (+0.1% formic acid). The proteolysis reactions were loaded onto the sorbent, which was then washed with 24 ml H2O (+0.1% formic acid). The peptide products were eluted with 24 ml of 1:1 MeCN:H2O (+0.1% formic acid) and 12 ml MeCN (+0.1% formic acid). Elution fractions were pooled, dried on a Genevac concentrator and stored at −20 °C.

Co-expression of EmbA, EmbM and Orf3 (EmbI) in heterologous hosts and purification

A wide variety of E. coli co-expression conditions of EmbA, EmbM and Orf3 were screened (Supplementary Table 6). In general, chemically competent BL21(DE3) or Tuner(DE3) were transformed with different combinations of the constructs described above, and selected with appropriate antibiotics on LB plates. Overnight cultures were prepared from single colonies and 1% (v/v) of culture was used to inoculate 200 ml of medium (TB, LB, XPPM)91 supplemented with appropriate antibiotic selection in 1 l baffled Erlenmeyer flasks. Cells were grown at 37 °C, 200 rpm until an OD600 of around 1.2. For low-temperature growths, cultures were subsequently cooled in ice baths for 20 min, induced with 0.5 mM IPTG and incubated at 16 °C at 180 rpm. Incubation times varied from 24 h to 7 days. High temperature growths were induced with a final concentration of 0.5 mM IPTG without cooling and incubated at 37 °C, 200 rpm from 6 h to 16 h. Cultures were collected by centrifugation (8,000g for 20 min), frozen and purified as described above. Proteolysis reactions with trypsin or LahT150 for MS analysis were performed as described in the previous section.

Transformation of M. aerodenitrificans DSMZ 15089 with NHis6-EmbA-pLMB509m and NHis6-EmbAM-pLMB509m was achieved according to a published procedure89. Culturing conditions followed an adaptation of a previously described method89. Cultures were collected, stored and purified as described above. Trypsin digestion was used for MS analysis.

β-Elimination of phosphorylated peptides

Phosphorylated peptide intermediates obtained by co-expression of EmbA and EmbM were submitted to β-elimination conditions at the 100 μl scale. EmbA (200 μM) in G25 buffer was used either as an intact protein or as a trypsin-digest product. The pH of the solution was adjusted to pH 13 with 1 M NaOH, and the elimination reaction proceeded at 37 °C for 4 h to afford dehydrobutyrine (Dhb)-containing products. Derivatization of Dhb was performed by adding a final concentration of 50 mM DTT to the reaction. The pH of the solution was adjusted to 7 with HCl (aq.) before MS analysis.

β-Elimination of phosphorylated peptide intermediates obtained through in vitro enzymatic reactions was performed using EmbA that had been proteolysed with LahT150 and purified using SPE as described above. EmbA (0.6 mmol) was resuspended in 3 ml of H2O, and the pH of the solution was adjusted to 14 with 1 M NaOH. The reaction stirred at 37 °C for 48 h, and was subsequently neutralized to pH 7 with HCl (aq.). SPE purification was performed as described above.

HPLC–HR-MS and MS/MS analysis

HPLC–HR-MS and MS/MS analyses were performed on a Thermo Scientific Dionex UltiMate 3000 UHPLC coupled to a Thermo Scientific Q Exactive mass spectrometer using heated electrospray ionization in positive ion mode with a Phenomenex Kinetex 2.6 μm XB-C18 100 Å (150 × 4.6 mm) column. The column temperature was set to 50 °C and the flow rate to 0.5 ml min−1. Samples were centrifuged before injections and target peptides were eluted with a gradient of 15–55% MeCN (+0.1% formic acid) over 15 min. Full MS was performed at a resolution of 70,000 (AGC target, 1 × 106; maximum IT, 100 ms) and parallel reaction monitoring was performed at a resolution of 17,500 (AGC target, 2 × 105; maximum IT, 100 ms; isolation windows in the range of 2.0 m/z).

NMR analysis

2D [13C,1H] HSQC spectra with multiplicity editing were recorded at natural 13C abundance on ~4 mM solutions of full length EmbA in unmodified and modified form. Spectra were recorded at 25 °C on a Bruker AVNEO 600 MHz spectrometer equipped with a TCI CryoProbeTM. The following spectral parameters were used: 2,048 complex points at a spectral width of 16 ppm, centred at 4.7 ppm in the direct 1H dimension and 512 complex points at a spectral width of 80 ppm centred at 42 ppm in the indirect 13C dimension. A number of scans of 40 was used, which resulted in a measurement time of 24 h per spectrum.

Antibiotic activity assays

E. coli DSM 1103, Staphylococcus aureus ssp. aureus ATCC 29213, Pseudomonas aeruginosa DSM 1117, Acinetobacter baumanii DSM 30007, Enterococcus faecalis DSM 2570, Rhodococcus sp. L233, Aquimarina sp. Aq135, Rheinheimera aquimaris B26, Vibrio spartinae (salt marsh isolate), Pseudoalteromonas rubra DSM 6842, Saccharomyces cerevisiae W301-1A and Pichia pastoris (Komagataella phaffii) NRR Y-11430 were tested for antimicrobial activity with the dehydrated peptide from BGC 75.1. Bioactivity assays were carried out in accordance with the 2003 guidelines of the Clinical and Laboratory Standards Institute (CLSI) using the microtitre method.

E. coli, S. aureus, A. baumanii, and P. aeruginosa overnight cultures were grown in LB at 37 °C. P. rubra, R. aquimaris (37 °C), V. spartinae (30 °C), P. rubra and Aquimarina sp. Aq135 (24 °C) were cultured in marine broth at their respective growth temperature optima indicated in parentheses. Rhodococcus sp. L233 was cultured in R2A medium at 24 °C. S. cerevisiae and P. pastoris were cultured in YPD medium at 28 °C. An overview of strains, growth conditions and taxonomy is provided in Supplementary Table 6.

Microbial seed cultures were initiated by inoculating 5 ml of medium for each strain and by incubating overnight shaking at 200 rpm. Each culture was then diluted with their respective growth medium to an initial OD600 of 0.02 in 80 μl volume per well in sterile 96-well plates (one per strain tested). Assays were set up in duplicates, with appropriate controls including solvent (water) controls and positive controls consisting of two different broad-spectrum antibiotics (chloramphenicol and ampicillin) with final concentrations of 50 µM. Cycloheximide and benomyl were used as positive controls for S. cerevisiae and P. pastoris.

Two final concentrations of the peptide, using water as a solvent, were tested: 50 μM and 25 μM. Then, 96-well plates were parafilmed and incubated at room temperature without shaking. OD600 was determined after 10 s of plate agitation at the following time points: 2 h, 4 h, 6 h, 8 h, 18 h, 21 h, 24 h and 48 h.

Protease inhibition assays

Inhibition assays against neutrophil elastase and cathepsin B were performed using the Neutrophil Elastase Inhibitor Screening Kit (MAK213, Sigma-Aldrich) and the Cathepsin B Inhibitor Screening Kit (MAK200, Sigma-Aldrich). Assays were performed according to the manufacturer’s protocol. A microplate reader spectrofluorometer (Tecan Infinite M200 Pro) was used to measure fluorescence and data were processed in Prism 9 to calculate IC50 values.

Inhibition assays with trypsin and chymotrypsin were set up in 96-well microtitre plates (black, clear bottom). To each well, 25 ml of stock solutions of different concentrations of peptides in 50 mM Tris, pH 8 buffer (assay buffer); 2 ml of enzyme stock solution (chymotrypsin, V1062, Promega, 1 nM final concentration; trypsin, V5111, Promega, 3 nM final concentration); and 48 ml of protease buffer (40 mM Tris, 10 mM CaCl2, pH 8) were added. Phenylmethylsulfonyl fluoride was used as a positive control. The plates were incubated at room temperature for 10 min. Subsequently, 23 ml of assay buffer and 2 ml of a 500 mM solution of substrate (chymotrypsin: Suc-Ala-Ala-Pro-Phe-AMC, 3114-v, Peptanova; trypsin: Boc-Ile-Glu-Gly-Arg-AMC, 3094-v, Peptanova) in DMSO were added to the wells. Enzyme activity was measured at 37 °C for 1 h, by measuring the fluorescence emission of the hydrolysed product (lex = 342 nm, lem = 440) in a microplate reader spectrofluorimeter (Tecan Infinite M200 Pro). Data were processed in Prism 9 to calculate IC50 values.

MTT assays

Inhibition against HeLa cells was tested for the phosphorylated, chemically dehydrated and control (unmodified) forms of peptide 75.1. Stock HeLa cells were resuspended in 10 ml HEPES-buffered high-glucose Dulbecco's modified Eagle medium (DMEM) supplemented with GlutaMAX (Gibco). The medium also contained 10% fetal calf serum and 50 mg ml−1 gentamicin. The cells were centrifuged for 5 min at 1,000g and room temperature. The medium was discarded, and the cells were resuspended in 10 ml fresh medium. The cells were put in a culture dish and incubated for 3–4 days at 37 °C. The cells were checked under the microscope and treated further only when 60–80% of the surface was covered with cells. The medium was removed from the culture flask and the cells were washed with 10 ml phosphate-buffered saline (PBS). The PBS was discarded and the cells treated with 2 ml trypsin-EDTA solution. When the cells were detached, 10 ml of medium was added and centrifuged for 5 min at 1,000g and room temperature. The supernatant was discarded and 10 ml fresh medium was added. Then, 2 ml of the cell suspension was put into a fresh culture flask containing 10 ml medium. Cells healthy enough for cytotoxicity assays were counted and diluted to a 10,000 cells per ml solution. Then, 96-well plates were filled with 200 μl cell suspension per well. All of the plates were incubated overnight at 37 °C. The outer wells were not used for the assay. A starting concentration of 100 μM of phosphorylated, chemically dehydrated and control (unmodified) forms of peptide 75.1 (2 μl of 1.25 mM stock solutions in DMSO) were added to the B lane wells. Doxorubicin was used as a positive control at 1 mg ml−1, and DMSO was used as a negative control. A total of 50 μl of lane B was transferred into lane C and mixed, and this transfer to the adjacent lane was repeated until lane G. The plates were then incubated for 3 days. Then, 50 μl of 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) (1 mg ml−1 in water) was added to all of the wells and incubated for 3 h at 37 °C. The supernatant was discarded and 150 μl of dimethyl sulfoxide (DMSO) was added to all of the wells. Absorbance was measured at 570 nm and the IC50 was calculated using the ‘drc’ package in R.

Experimental validation of a novel proteusin pathway (‘Ca. E. malaspinii’, HLLJDLBE BGC 3.1)

Cloning (E. coli and M. aerodenitrificans)

Genes of the type material of ‘Ca. E. malaspinii’ (Supplementary Information) in the antiSMASH-predicted BGC 3.1 (region 1) on MALA_SAMN05422137_METAG-scaffold_4, including Nhis-ereA (gene_139), ereI (gene_140), ereM (gene_141), ereD (gene_142) and ereB (gene_143) were codon-optimized and ordered as synthetic constructs as described in Supplementary Table 5 for expression in E. coli. Genes were further subcloned by Gibson assembly using the primers listed in Supplementary Table 5. All generated constructs were introduced into chemically competent E. coli DH5α and plated on LB agar with appropriate antibiotic selection. Plasmids were purified from single colonies and Sanger sequencing was used to verify proper insertion of genes (Supplementary Table 5).

Nhis-ereAD, Nhis-ereAIMD, Nhis-ereADB and Nhis-ereAIMDB with intergenic regions substituted with the M. aerodenitrificans ribosome-binding site (TTAGGAGAGTGCGGG) were subcloned by Gibson assembly in a modified pLMB509m(GtR) vector89. Introduction into E. coli DH5α, isolation of plasmids and validation of correct clones by sequencing was followed by introduction of all constructs into E. coli SM10 for conjugation into M. aerodenitrificans. Conjugation was performed as described previously89.

Heterologous expression and protein purification

Chemically competent E. coli BL21(DE3) cells were transformed with all of the constructs listed in Supplementary Table 5. Overnight cultures were prepared from single colonies and used to inoculate (1%, v/v) TB medium (1 × 50 ml) in 250 ml baffled Erlenmeyer flasks supplemented with appropriate antibiotics. Cells were grown at 37 °C, 200 rpm, until an OD600 of around 0.8–1.2 and induced with a final concentration of 1 mM IPTG. The cultures were further incubated at 16 °C, 180 rpm for 16 h, 24 h or 48 h as specified per experiment and subsequently collected by centrifugation (6,000g for 15 min) and frozen.

For purification from M. aerodenitrificans, 20 ml precultures in nutrient broth with the appropriate antibiotics were inoculated from freezer stocks and grown overnight at 30 °C. Then, 1% (v/v) cultures were used to inoculate 20 ml TB starter cultures with the appropriate antibiotics. Then, 1% (v/v) of starter cultures were then used to inoculate 200 ml in 1 l non-baffled Erlenmeyer flasks. Cells were grown at 30 °C, 180 rpm, until an OD600 of around 0.8–1.2 and induced with a final concentration of 0.2% l-arabinose. The cultures were further incubated at 30 °C, 180 rpm for 16 h, 24 h, 48 h or 72 h as specified for each experiment and subsequently collected by centrifugation (6,000g for 15 min) and frozen.

For protein purification, cell pellets were resuspended in 5 ml g−1 of lysis buffer (50 mM Tris, 300 mM NaCl, 5 mM imidazole, 10% glycerol, pH 8.0). Cells were lysed by sonication (30% amplitude, 10 s on/off cycles, for a total of 3 min), and the clarified lysate was obtained by centrifugation (27,000g for 45 min). The supernatant was loaded onto 1 ml of Ni-NTA agarose resin that had been equilibrated with a lysis buffer in a fritted purification column. The resin was washed with 10 column volumes (CV) of lysis buffer, 3 CV of wash buffer (50 mM Tris, 300 mM NaCl, 40 mM imidazole, 10% glycerol, pH 7.8) and finally eluted with 3 CV of elution buffer (50 mM Tris, 300 mM NaCl, 250 mM imidazole, 10% glycerol, pH 7.8). Elution fractions were analysed by SDS–PAGE, pooled and buffer-exchanged into SGT buffer (50 mM Tris, 300 mM NaCl, 10% glycerol, pH 8.0) in Amicon Ultra-4 10K spin filter devices (Millipore Sigma). The concentration of spun, buffer-exchanged proteins was determined using the Roti-Quant Bradford reagent (Carl Roth).

Proteolytic cleavage for analysis of core peptides and generation of the core region

For the LahT150 digest, 8 µl of purified LahT150 (ref. 90) at a concentration of 20 µM was added at a 1:10 ratio to 72 µl of approximately 200 µM of spun-concentrated protein in glass MS vial inlets. Digests were incubated overnight at room temperature and analysed by HPLC–HR-MS and MS/MS after approximately 16 h.

For the proteinase K digest, 16 μl of the elution was mixed with 20 μl of proteinase K buffer (100 mM Tris, 4 mM CaCl2, pH 8.0) and 4 μl of proteinase K (2 mg ml−1). Proteolytic cleavage was carried out in PCR tubes (12 h at 50 °C) and analysed by HPLC–HR-MS and MS/MS after approximately 16 h.

Marfey’s analysis of amino acids contributing to the matured precursor peptides

Advanced Marfey’s analysis was performed as described previously92,93. Before Marfey's analysis, 500 µg of matured Nhis-EreA precursor peptide was hydrolysed in 6 M HCl at 110 °C for 16 h. The mixture was dried in a speedvac concentrator and washed twice using 400 µl water. The dried samples were solubilized in 75 µl water, 25 µl 1 N NaHCO3 and 125 µl Nα-(2,4-Dinitro-5-fluorophenyl)-l-valinamide (4 mg ml−1 in acetone) and the mixture was heated for 1 h at 45 °C. The reaction was neutralized with 25 µl 1 N HCl and diluted with 250 µl acetonitrile/water (1:1). The samples were transferred to a HPLC vial and subsequently analysed using HPLC–HR-MS. Amino acids were verified by mass and retention time when compared to authentic amino acid standards treated in the same way as described above.

HPLC–HR-MS method for Marfey’s analysis

Analytical HPLC–HR-MS samples (5–20 µl injections) were separated on a Dionex Ultimate 3000 RS UHPLC equipped with the Phenomenex Kinetex C18 (2.6 m, 100 Å, 150 × 4.6 mm) column heated to 50 °C. HPLC separation was performed according to the following method (solvent A, H2O + 0.1% FA; solvent B, ACN + 0.1% FA; flow rate: 1 ml min−1): 2 min at 30% B; 2–9 min from 30% to 100% B; 9–10 min at 100% B; 10–10.1 min from 100% to 30% B, 10.1–14 min at 30%.

The HPLC was coupled to a Thermo Scientific Q Exactive Hybrid Quadrupole-Orbitrap Mass Spectrometer using heated electrospray ionization in positive-ion mode (spray voltage, 3,500 V; capillary temperature, 268.75 °C; probe heater temperature, 350 °C; S-lens level, 70). Full MS was detected at a resolution of 70,000 (AGC target, 1 × 106; maximum IT, 100 ms).

Purification of SAM for in vitro EreM assays

Before the enzymatic assays, 25 mg of commercially available SAM (Sigma-Aldrich) was dissolved in 1 ml and injected in five portions (200 µl each) to conduct additional purification on an Agilent 1260 HPLC equipped with a SemiPrep Hydro RP (4 m, 80 Å, 250 × 10 mm) column at ambient temperature. HPLC separation was performed using the following method (solvent A, 2.5 mM ammonium acetate; solvent B, 75% MeOH; flow rate, 2.5 ml min−1): 2 min at 15% B; 2–7 min from 15% to 30% B; 7–10 min from 30% to 100% B; 10–16 min at 100% B; 16–17 min from 100% to 15% B; 17–22 min at 15%.

The purification process was monitored at 260 nm. SAM eluted in a broad peak at 6–8 min and was collected in a Falcon tube on ice, immediately flash-frozen after collection in liquid nitrogen and subsequently lyophilized. The related impurities eluted later from the column and were discarded. The lyophilized pure SAM was dissolved in 20 mM HCl and aliquoted as 100 mM stock solutions that were stored at −80 °C until further use.

In vitro EreM assays

SAM (0.5 mM), 1 µM 5′-methylthioadenosine nucleosidase (MTAN), 50 µM epimerized Nhis-EreA precursor, 30 mM MgCl2 and 5 µM EreM were dissolved in SGT buffer (50 mM Tris, 300 mM NaCl, 10% glycerol, pH 8.0) to a total volume of 100 µl and incubated in glass MS vial inserts overnight at 37 °C. Then, 5 µM LahT was added to the in vitro assays and incubated for 2 h at room temperature before analysis by HPLC–HR-MS/MS.

In vitro EreM assays with 13C-labelled SAM

13C-SAM (0.5 mM), 1 µM MTAN, 50 µM epimerized Nhis-EreA precursor, 30 mM MgCl2 and 5 µM EreM were dissolved in SGT buffer (50 mM Tris, 300 mM NaCl, 10% glycerol, pH 8.0) to a total volume of 100 µl and incubated in glass MS vial inserts overnight at 37 °C. Then, 5 µM LahT was added to the in vitro assays the next day and incubated for 2 h at room temperature before analysis by HPLC–HR-MS/MS.

In vitro EreM assays with 13C-labelled 13CH3-SAM using an enzyme cascade starting with 13C-labelled 13CH3-l-methionine for NMR spectroscopy

13CH3-l-methionine (5 mM), 10 mM adenosine-triphosphate, 100 mM KCl, 30 mM MgCl2, 1 µM MTAN, 11 µM SAM synthase, 40 µM epimerized Nhis-EreA precursor and 40 µM EreM were dissolved in 50 mM Tris at pH 8.0 to a total volume of 500 µl and incubated in an Eppendorf tube overnight at 37 °C. Then, 50 µl of D2O was added to the assay mixture and transferred into an NMR tube. Proton and HSQC spectroscopy was performed on the Bruker 600 MHz NMR spectrometer equipped with a cryoprobe. To determine the location of protons attached to 13C-labelled carbons, an additional 1H NMR was recorded with parameters that enable decoupling of carbons and protons. These parameters cause a splitting of the respective protons attached to 13C-labelled carbons.

In vitro EreI assays

(NH4)2FeSO4 (800 µM) was incubated with 50 µM EreI in SGT buffer (50 mM Tris, 300 mM NaCl, 10% glycerol, pH 8.0) for 20 min on ice. Then, 1 mM 2-oxoglutarate and 1 mM dithiothreitol were added and incubated for another 20 min on ice. Epimerized Nhis-EreA precursor (5 µM) was added to yield a total volume of 100 µl and the reaction mixture was incubated in Eppendorf tubes for 20 min at 30 °C. The reaction was quenched by boiling the assay mixture at 95 °C for 10 min. LahT (5 µM) was added to the in vitro assays and incubated for 2 h at room temperature before analysis using HPLC–HR-MS/MS.

Site-directed mutagenesis to generate core variants

Primers for site-directed mutagenesis (Supplementary Table 5) were synthesized and used to amplify template DNA from EreAD-pET Duet (AmpR). Mutagenesis was accomplished using PCR amplification, KLD treatment and enrichment, and transformation into E. coli DH5α for isolation of plasmids and validation of the correct clones by sequencing. Owing to the highly repetitive nature of the core sequence, truncated core variants were also generated during site-directed mutagenesis and also tested for EreM modification (Supplementary Table 5).

Orthogonal D2O-based induction system for labelling epimerized residues

E. coli BL21 (DE3) cells were co-transformed with Nhis-ereA in pACYCDuet-1 and ereD in pCDFBAD/Myc-His A and plated on LB agar containing chloramphenicol (25 μg ml−1) and ampicillin (100 μg ml−1) and grown for 20 h at 37 °C or until colonies appeared. These colonies were used to inoculate 20 ml LB with chloramphenicol (25 μg ml−1) and ampicillin (100 μg ml−1), and the cultures were grown overnight. The next day, four separate 50 ml Falcon tubes containing TB medium, (15 ml), chloramphenicol (25 μg ml−1) and ampicillin (100 μg ml−1) were inoculated with 150 μl and shaken at 37 °C, 250 rpm to an OD600 of 1.6–2. Cultures were cooled on ice for 30 min, induced with IPTG (0.1 mM final concentration) and shaken (180 rpm at 16 °C) for 18 h. The cultures were centrifuged (10 min at 10,000g) and the supernatant was removed. The cell pellets were then washed with TB medium (1 × 15 ml) to remove any residual IPTG. This was followed by two washes with 1 ml TB in D2O. The washed cell pellets were resuspended in 15 ml TB medium in D2O containing ampicillin (100 μg ml−1 in D2O) and l-arabinose (0.2% w/v in D2O) and shaken (180 rpm at 16 °C) for 18 h. The cultures were combined and centrifuged (30 min at 15,000g) and the pellets were resuspended in 4 ml lysis buffer per gram of cell pellet and purified as described above. For the control, the same procedure was followed with normal TB medium.

Generation of the M. aerodenitrificans Δaer mutant

The suicide vector pSW8197 (ref. 94) was used as a basis to create a stable and markerless deletion of the aeroneamide (aer) cluster in M. aerodenitrificans. The primer pairs Aer1f/r and Aer2f/r were used to amplify 500 bp homologous regions up- and downstream of the aer cluster. The resulting DNA products were assembled into PCR-amplified pSW8197 (pSWAerKO-f/r) using Gibson assembly and transformed into E. coli SY327 electrocompetent cells and sequence-verified after plasmid extraction from the resulting colonies.

The resulting plasmid (pSW8197_aerKO) was then transformed into chemically competent E. coli ST18 donor cells95 and selected with 50 µg ml−1 kanamycin and 5-aminolevulinic acid (required by E. coli sT18). Conjugation with the M. aerodenitrificans wild type was performed as previously described, plated on LB agar plates with 50 µg ml−1 5-aminolevulinic acid and incubated at 37 °C for 24 h. Integrants were selected by plating on nutrient agar plates containing 25 µg ml−1 kanamycin and 400 µg ml−1 ampicillin at 30 °C and confirmed by PCR (aerKO seqF/pswAraC-R). Positive integrants were grown non-selectively in 5 ml nutrient broth overnight at 30 °C and plated on nutrient agar with 0.5% (w/v) l-arabinose (to induce counter-selectable ccdB toxin) and 100 µg ml−1 ampicillin, resulting in colonies with successful deletions or wild-type revertants. Successful deletion mutants of the aer cluster were identified using PCR with the primers aerKO seqF/R, which anneal to regions on the genome outside the flanking regions used to construct the vector. The resulting PCR product was verified by sequencing.

Phylogenetic analysis of FkbM-family proteins

PfamScan classified EreM from ‘Ca. E. malaspinii’ as belonging solely to the FkbM methyltransferase (PF05050) family. To identify other FkbM-family proteins involved in natural product biosynthesis, the FkbM-family methyltransferase HMM (Methyltransf_21.HMM in Pfam_A) was used to query all protein-coding sequences in MIBiG (v.2.0)30 using hmmsearch in HMMER v.3.1b2 (http://hmmer.org/) with the default parameters and the --cut_nc PFAM noise cut-off. Hits within 37 characterized BGCs in MIBiG were identified (Supplementary Table 5) and associated literature was manually assessed for experimental evidence of FkbM-family methyltransferase activity. Eight proteins were excluded on the basis of the FkbM hit falling outside the defined BGC cluster boundaries and having no apparent role in biosynthesis based on the final natural product structure (Supplementary Table 5). Four FkbM family members had experimental evidence for O-methyltransferase activity in the form of heterologous expression or genetic studies. A total of 25 FkbM-family proteins were documented in publications by authors to have likely O-methyltransferase activity on the basis of the final natural product structure, biosynthetic logic and bioinformatic evidence. The summed 29 FkbM-family proteins were aligned using MUSCLE (v.3.8.1551)81 with two outgroups involved in proteusin biosynthesis, PoyE (AFS60641.1) and AerE (AFS60641.1) from a different methyltransferase protein family (PF05175). The protein alignment was assessed and all columns containing 50% or more gaps were removed using Trimal v.1.2. The trimmed alignment was used for phylogenetic model selection using IQ-TREE (v.2.0.3)96 and the V5+F+R5 model was selected based on best-fit using the Bayesian and Akaike information criteria. IQ-TREE was then used to estimate a maximum-likelihood phylogeny with 5,000 resamplings using the ultrafast bootstrap approximation97. The scripts used for bioinformatic analysis are available at GitHub (https://github.com/serina-robinson/fkbm-bioinformatics).

HPLC–HR-MS and MS/MS analysis

HPLC–HR-MS and MS/MS analyses were performed on the Thermo Scientific Dionex UltiMate 3000 UHPLC coupled to a Thermo Scientific Q Exactive mass spectrometer using heated electrospray ionization in positive ion mode with the Phenomenex Kinetex 2.6 μm XB-C18 100 Å (150 × 4.6 mm) column. Analytical HPLC–HR-MS samples (5–20 µl injections) were separated on the Dionex Ultimate 3000 RS UHPLC equipped with a Phenomenex Kinetex C18 (2.6 m, 100 Å, 150 × 4.6 mm) column heated to 50 °C. HPLC separation was performed using the following standard methods:

Method A: solvent A: H2O + 0.5% FA; solvent B: 1-propanol + 0.5% FA; flow rate: 0.5 ml min−1; 2 min at 20% B; 2–14 min from 20% to 80% B; 14–17 min at 80% B; 17–17.1 min from 80% to 20% B, 17.1–20 min at 20%.

Method B: solvent A: H2O + 0.1% FA; solvent B: acetonitrile + 0.1% FA; flow rate: 1.0 ml min−1; 2 min at 2% B; 2–12 min from 2% to 100% B; 12–16 min at 100% B; 16–17 min from 100% to 2% B, 17.1–20 min at 2%.

The HPLC was coupled to a Thermo Scientific Q Exactive Hybrid Quadrupole-Orbitrap Mass Spectrometer using heated electrospray ionization in positive-ion mode (spray voltage, 3,500 V; capillary temperature, 268.75 °C; probe heater temperature, 350 °C; S-lens level, 70). Full MS was detected at a resolution of 70,000 (AGC target: 1 × 106; maximum IT, 100 ms). MS2 fragmentation was performed at a resolution of 35,000 (AGC target, 2 × 105; maximum IT, 100 ms, isolation window, 4.0 m/z). Normalized collision energy was 20, 25 and 30 for +3 charge states. Parallel reaction monitoring was performed at a resolution of 17,500 (AGC target 2 × 105; maximum IT, 200 ms; isolation windows, 1.4 m/z) and a normalized collision energy of 18, 20 and 24 for +2 and +3 charge states.

Statistics and reproducibility

Wherever appropriate, correction for multiple testing was performed using false-discovery rate correction. Wherever appropriate and if not specified otherwise, statistical tests performed were two-sided.

The box plots were plotted in R (v.4.0.0–v.4.1.2) using ggplot2 (v.3.3.0–v.3.3.5) and defined as follows: the bottom and top hinges correspond to the first and third quartiles (the 25th and 75th percentiles), the top whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where the IQR is the interquartile range, or distance between the first and third quartiles). The bottom whisker extends from the bottom hinge to the smallest value at most 1.5 × IQR. Data points beyond the end of the whiskers are outliers are plotted individually, except for in Fig. 1c, owing to the large number of points and space constraints.

Fig. 1e was plotted using the R package UpSetR (v.1.4.0)98. The trees in Fig. 2 were plotted using the R package ggtree (v.3.3.0.901)99.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.