Introduction

Post-translational modifications modulate protein properties in response to the environment at very short time scales [1]. Thereof, protein glycosylation is fundamental to all three domains of life and one of the most abundant modification out of the many reported to date [2, 3]. Moreover, glycoproteins are also found in protective envelopes of most pathogenic viruses [4]. Carbohydrates, the building blocks of glycans, are the least conserved class of molecules and, when linked to proteins, considerably increase the proteome diversity. Consequently, due to glycosylation and the many other types of modifications, the number of proteoforms is orders of magnitude larger than what can be translated from the genomic sequence of an organism alone [4, 5].

Apart from well-investigated pathogens or easily culturable model archaea, such as C. jejuni or H. volcanii, our understanding of the biological significance and evolutionary forces shaping glycan structures is highly limited. This lack of understanding is largely due to the general paucity of large-scale approaches capable of resolving the chemical diversity expressed by prokaryotes. In eukaryotes, protein glycosylation utilizes a well-defined number of some 10 monosaccharides (SI-DOC-Table-S1). Thereby, the existence of distinct glycosylation systems and oligosaccharide structures evolved as ubiquitous for many molecular processes, such as protein folding, stability and immunity [5,6,7]. In contrast, biosynthetic routes in prokaryotes have a much higher diversity [5, 8]. Oligosaccharide chains show large variations in regard to sugar and linkage chemistry across species and strains [8, 9]. Interestingly, many functions protein glycosylation serves in eukaryotes, such as for protein folding, quality control or intracellular trafficking, do not apply to prokaryotes [10, 11]. Moreover, glycan biosynthetic routes involve many enzymatic steps and are, particularly for prokaryotes, an energetically costly process. Therefore, those structures must serve different, but likely equally important roles [12, 13].

Prokaryotes account for a substantial proportion of Earth’s biomass, drive many global processes such as the nitrogen cycle or methane production, and directly impact human health (e.g., via the gut microbiome) [14, 15]. One of these globally wide spread microorganisms are anaerobic ammonium-oxidizing (anammox) bacteria, a phylogenetically deep branching group within the Planctomycetes phylum. Anammox bacteria play a key role in the bio-geochemical nitrogen cycle (e.g., producing over two thirds of atmospheric N2) and are the cornerstone of new, sustainable and resource-efficient biotechnologies (e.g., for wastewater treatment) [16,17,18]. Members of the Planctomycetes phylum display cellular characteristics that are unique and strikingly homologous to those found in eukaryotes, making them of paramount interest for investigating fundamental aspects of bacterial cell physiology [19, 20]. Only the (finally) confirmed existence of a peptidoglycan layer supported their classification as Gram-negative bacteria [21, 22]. Pioneering work by van Teeseling et al. furthermore revealed the existence of a glycosylated cell surface layer in planktonically growing anammox bacterium Ca. Kuenenia stuttgartiensis [23, 24]. Moreover, Boleij and co-workers reported on a glycosylated surface layer protein in the closely related genus Ca. Brocadia sapporoensis [25]. In many archaea and bacteria, surface layer proteins are expressed, and often further modified through post-translational modifications such as glycosylation [26, 27]. Amongst many potential roles, glycans may modulate the surface layer properties in regard to structure, hydrophilicity, provision of functional groups, shielding or may increase protection [10, 28,29,30]. Interestingly, ammonia-oxidizing archaea (AOA) were found to express highly acidic surface layer proteins (SLP), which are supposed to support the acquisition of positively charged ammonium, the substrate, which is further converted into nitrite through the nitrification process [31]. In the context of the unique cellular appearance of anammox bacteria and the analogy to AOA, which both depend on the uptake of ammonium, there is the question on the molecular level characteristics and potential roles of the glycosylated surface layer recently discovered in anammox bacteria.

Historically, protein glycosylation in prokaryotes was initially predominantly associated with pathogens (and commensal bacteria), many of which display complex oligosaccharide structures, including highly specialized sugars such as nonulosonic acids (NulOs, where the NulOs found in animals are commonly referred to as sialic acids) [32]. These sugars, in addition to cellular protection, have been shown to play key roles during pathogenicity and host invasion, such as in C. jejuni, which is one of the most common causes of food related infections globally [33,34,35]. Nevertheless, protein glycosylation is also found in non-pathogenic bacteria and archaea, commonly modulating surface layer proteins (SLPs), pili or flagella [9, 10, 35]. This has been shown to introduce remarkable properties, such as for the pilus glycosylation of the thermoacidophile Sulfolobus islandicus, which provides the physical resistance to survive under extreme conditions [30]. Nevertheless, most likely due to the lack of large-scale data, there was a long standing belief that glycosylation in (non-pathogenic) prokaryotes mediates structural, rather than functional, properties [36] However, there is an increasing volume of literature demonstrating that glycans play multiple roles in all different types of bacterial systems [37, 38].

From an analytical perspective, the enormous chemical and structural diversity expressed in prokaryotes makes large-scale exploration exceptionally challenging. High-performance (glycan database utilizing) approaches [39,40,41] cannot be readily applied to prokaryotic glycoproteomes with unknown oligosaccharide chemistry. Therefore, the glycoproteomic exploration of novel prokaryotic strains often still employs protein fractionation and carbohydrate-specific colorimetric staining [32]. The target proteins and associated carbohydrate structure(s) are then commonly identified following proteolytic digestion via mass spectrometric approaches. Thereby, peptides are sequenced and investigated for spectra sharing peptide and oligosaccharide fragments [23, 25, 32, 42]. This approach takes advantage of the commonly observed monosaccharide fragments (oxonium ions) to differentiate unmodified peptides from glycopeptide spectra. However, this multi-step, manual procedure requires extensive experience and is extremely time consuming. Furthermore, the oxonium markers, supporting (automated) identification for eukaryotic proteins, are not necessarily part of the prokaryotic carbohydrate chains. Therefore, this strategy is heavily operator biased and may leave a large number of oligosaccharide modifications unidentified.

Recently, open modification search approaches, which match fragment ions to a protein sequence database without prior consideration of the intact peptide mass, have advanced the search for unexpected peptide modifications [43,44,45,46]. Very recently, this approach has been combined with glycopeptide enrichment to analyze a range of pathogenic bacteria [47]. Thereby, the enrichment improves spectral coverage, reduces the search space and qualifies observed mass-shifts as carbohydrate-like modifications. Moreover, an additional ion mobility interface (FAIMS) showed promising outcomes even without prior enrichment procedures [48]. Those developments have significantly advanced the identification of oligosaccharide-related modifications in prokaryotic proteomes. However, until now, the large-scale discovery of prokaryotic protein glycosylation remains dependent on specific enrichment procedures (by HILIC or specialized equipment), extensive fragmentation experiments (to ensure peptide sequence coverage), and proteome sequence database availability. The number of sequencing spectra and the database volume, furthermore, impact computational efforts and sensitivity. Nevertheless, common large-scale glycoproteomics approaches have been also established for the application to pure cultures. Most importantly, however, those approaches do not establish the strain-specific sugar components or additional oligosaccharide sequence information. Therefore, the untargeted large-scale exploration, which provides additional chemical and compositional information to support the physicochemical interpretation of the oligosaccharide chains—in particular when exploring non-pure or enrichment cultures—remains a bottleneck in microbial proteome research.

We established a systematic procedure to analyze yet-unexplored prokaryotic protein glycosylation directly from large-scale proteomics data. By this means, we first determine the strain- (or enrichment)-specific sugar components and subsequently the corresponding protein-linked oligosaccharide chains of individual strains in the culture (community). When investigating an enrichment culture of the globally relevant anammox bacteria, the new approach resolved a remarkably complex array of surface-layer oligosaccharides generated by two seemingly unrelated biosynthetic pathways simultaneously. The physicochemical interpretation of the observed array ultimately suggests an evolutionary link to charged surface layer proteins (SLPs) and the requirements of the metabolic lifestyle in anammox bacteria.

Results and discussion

The general microbial glycoproteomics approach

Establishing strain-specific monosaccharide components from proteomics data, the first step in our approach, takes advantage of the uniform fragment-ion patterns obtained when mass binning thousands of individual peptide sequencing spectra from shotgun proteomic data (Fig. 1a–b, SI-DOC-Fig-S1 and SI-EXCEL-T1). The mass patterns are highly comparable across species because they reflect the general amino acid repertoire used throughout all domains of life. The low mass range is dominated by mostly intense fragments of the canonical set of amino acids and combinations thereof (mono, di- and trimers). Carbohydrate-related signals, however, typically lay outside the amino acid fragment mass space due to their distinct chemical composition (SI-DOC-Fig-S12, SI-EXCEL-T12). Previously, mass binning had been proposed to search for stable amino acid modifications [49]. Here, we explore the out-of-range peaks from high-resolution data in search of (novel) carbohydrate-related signals [38]. Thereby, candidates are analyzed using a constructed sugar composition database, containing more than 3300 theoretical chemical sugar compositions, and by investigating for additional water loss and co-occurrence within sequencing spectra (SI-DOC-Fig-S12, SI-EXCEL-T35). To provide maximum differentiation from closely related compositions of peptide fragments, this step employs shotgun proteomic data for which the fragmentation spectra were acquired at very high mass resolution (140 K at m/z 200). Moreover, to provide in-depth chemical information from novel sugar derivatives, selected components (such as NulOs) were analyzed by additional in-source fragmentation experiments (SI-DOC-Fig-S3).

Fig. 1: The mass binning glycoproteomics approach to explore prokaryotic protein glycosylation.
figure 1

a The approach identifies the chemical composition of the proteome specific carbohydrate components and establishes the related protein linked oligosaccharide chains from large-scale (meta)proteomics data. Optional in-source fragmentation and metabolomics experiments provide additional chemical information to guide the exploration of biosynthetic routes or phylogenetic relations. b The first step—establishing the proteome-specific sugar components—performs mass binning of the complete set of fragmentation spectra acquired at very high mass resolution. Binned spectra thereby show highly comparable pattern across proteomes/species because those are the products of the universal set of amino acids (mirror bar chart). Carbohydrate fragments, however, have a characteristic chemical composition and, therefore, (usually) lay outside the amino acid composition space. Signals are automatically matched against a constructed database, consisting of over 3300 theoretical carbohydrate compositions. The procedure is exemplified by the heptose fragment (193.071 m/z), observed in the Ca. Kuenenia stuttgartiensis enrichment (meta)proteome. The heptose signal (193.071 m/z) is unique to Kuenenia, but the neighboring mass peaks (−0.01 m/z and +0.026 m/z) are also observed in the non-glycosylated comparator proteome (small mirror bar chart) and are, therefore, amino acid related. c The second step—termed “parent ion offset binning”—establishes the actual oligosaccharide chains, which are linked to the proteins of individual strains. This is achieved by binning the mass deltas obtained after subtracting the fragment masses (fm) from their parent (ion) peptide mass (pm). Spectra containing the same carbohydrate fragments will, thereby, show repeatedly mass deltas consistent with the mass of the oligosaccharide modification. d The established oligosaccharide chains are then integrated into the metaproteomic database search (using e.g., a metagenomics constructed databases), to identify the target proteins and related strains.

Because prokaryotes often produce species-unique carbohydrate derivatives, this strategy is highly advantageous as opposed to relying only on oxonium ion markers known from eukaryotic glycoproteins (e.g., HexNAc, Hex or NeuAc).

The second step—termed “parent offset binning”—establishes the oligosaccharide chains as modifying individual proteins (Fig. 1c). Thereby, the peak mass lists of fragmentation spectra that contain carbohydrate components—and that (likely) derive therefore from oligosaccharide modified peptides—are now subtracted from the parent peptide mass. This approach generates mass deltas starting with the parent peptide mass at zero. Thereby, fragmentation spectra from glycopeptides will (repeatedly) show numbers consistent with the mass of the peptide-linked oligosaccharide chain(s). This process takes advantage of the predominant fragmentation of the carbohydrate chain over the peptide backbone when performing fragmentation by collision-induced dissociation, such as by higher-energy collisional dissociation (HCD) specific to Orbitrap mass spectrometers [50,51,52,53].

Finally, binning parent offset mass deltas—that originate from spectra containing the same carbohydrate fragments—will provide a histogram of the oligosaccharide chain mass(es) linked to the peptides. This procedure relies on the presence of the Y0 - ion in the fragmentation spectrum (the unmodified, intact peptide fragment peak). Due to the strong fragmentation of the carbohydrate chains, however, this peak is very commonly observed in glycopeptide (CID/HCD) fragmentation spectra [50, 52, 53].

Moreover, owing to the oligosaccharide backbone fragmentation, binned spectra contain additional oligosaccharide fragments, therefore providing sequence and in some cases even linkage-type (N/O) information [50, 53]. Although prokaryotic protein glycosylation shows a large species and strain variability, the structural heterogeneity within one proteome is, generally, comparatively low. Hence, the binning approach is a very useful procedure, particularly for prokaryotic glycoproteomes. In addition, establishing the sugar components or oligosaccharide profiles does not require a proteome sequence database. Therefore, the data processing pipeline requires only a few minutes data processing time, e.g., when applied to single-run QE Orbitrap shotgun proteomics data. Finally, the thereby-established oligosaccharide profiles provide a database for recently developed high-performance glycopeptide annotation tools, [40, 41] or can be simply integrated as variable modifications into (multi-round) search approaches to identify modified proteins and target strains (Fig. 1d). While only the intact oligosaccharide mass will provide a confident match during this procedure, determination of the free peptide mass in sugar-fragment-containing spectra further enables to differentiate between the intact oligosaccharide chain and additional fragments. Ultimately, the variable modification search provides the common statistical parameters, such as peptide scores and false discovery rates. Most importantly, however, the determined chemical and compositional information guides physicochemical interpretation of the oligosaccharide chains, and the exploration of biosynthetic routes or phylogenetic relations.

An unexpectedly complex array of sugars

To verify the developed “sugar-miner” approach for establishing sugar components directly from shotgun proteomics data, we processed a set of well-characterized control samples. In so doing, every control proteome sample showed sugar components reflecting the known oligosaccharide structures of the respective species (Fig. 2a, SI-EXCEL-T6). Moreover, the theoretically constructed carbohydrate composition space, used to match carbohydrate fragments in the mass binning data, only marginally overlapped with peptide related fragments (e.g., those found in the E. coli comparator proteome). One of the few observed coincidences, however, included the compositional overlap of peptide-related fragments with the sugar composition of bacillosamine (diNAcBac) (Fig. 2b, SI-EXCEL-T12, SI-DOC-Fig-S1). Furthermore, we processed all prokaryotic control samples also through the developed “parent offset binning” procedure—which provides the oligosaccharide chains—to confirm the generic nature of this approach (SI-DOC-Fig-S49).

Fig. 2: Carbohydrate profiles obtained from the anammox enrichment cultures (and reference samples) using the MS2 mass binning approach.
figure 2

a The graph (from left to right) shows the carbohydrate profiles (m/z values are distributed along the y-axis) from the C. jejuni sample (C, control), Ca. Kuenenia stuttgartiensis enrichment culture (K), HEK protein sample (H, control), Ca. Brocadia sapporoensis enrichment culture (b), S. cerevisiae sample (Y, control) and H. volcanii sample (V, control). The carbohydrate profiles of the control samples established by the MS2 mass binning approach reflected the known oligosaccharide compositions of the individual species. Only sugar components, which commonly provide only very low abundant or no carbohydrate related fragments (oxonium ions), such as deoxy hexoses, were not observed. NulO and HexNAc fragments are prominent in the C. jejuni proteomics sample, NeuAc/HexNAc and Hex related signals in the HEK derived sample, HexNAC- and hexose-related signals in the S. cerevisiae (yeast) proteomics sample, and hexuronic acid (HexA)- and hexose (Hex)-related signals in the H. volcanii sample. The latter was cultured at high salinity; therefore, the proteins appeared modified by only a single type of glycan structure. Fragments with the same color indicate water loss clusters (-H2O), which are a characteristic consequence of the oligosaccharide chain fragmentation process, and therefore indicators for the discovery process. The mass compositions and sugar type annotations for the Ca. Kuenenia stuttgartiensis enrichment proteome are detailed in the table on the right (c). The abbreviations “NulO” stand for nonulosonic acid, “NeuAc” for N-Acetyl neuraminic acid; “Pse” for pseudaminic acid; “HexNAc” for N-Acetyl-hexose amine; “Hept” for heptose; “Hex” for hexose; “HexA” for hexuronic acid; “dHex” for deoxyhexose and “Me” for methyl, respectively. The sugar symbols are depicted in generic white and gray shades because a further classification into specific types of monosaccharides, beyond sum formulae, chain length and modifications, cannot be obtained from accurate mass experiments. The different gray scales were simply chosen to make the individual sugar symbols more distinguishable within graphs and depicted oligosaccharide structures (b) The constructed carbohydrate chemical composition space. A large chemical composition space was constructed (>3300 theoretical compositions) used to assign chemical compositions to non-peptide related features. At high mass resolution (>100 K) and accuracy (<1 ppm), the mass overlap with amino acid-related fragments is considerably low. Mass recalibration using frequently observed amino acid fragments enables to operate at very high mass accuracy (blue arrow) “(color figure online)”.

When exploring the sugar components found in the Ca. Kuenenia stuttgartiensis enrichment proteome, we observed a surprisingly diverse repertoire of carbohydrate fragments, including yet-undescribed nonulosonic acid derivatives as well as seven-carbon sugars rarely observed in glycoproteins (Fig. 2a, c, SI-EXCEL-T6, SI-DOC-Table-S2–6 and SI-DOC-Fig-S10) [54]. In addition, when investigating the related oligosaccharide profiles using the above-described parent ion offset approach, we observed two completely unrelated types of oligosaccharide chains (Fig. 3a–d, SI-DOC-Fig-S11–15). One type of oligosaccharides resembled the recently described N-acetyl-hexoseamine core, albeit containing nonulosonic acids not resolved in an earlier study (complex type structures; “X-type”) [23]. The second type of oligosaccharide structures consisted of homogeneous heptose chains (oligo-heptosidic; “O-type”). More surprisingly, when integrating the oligosaccharide masses into a (multi-round) variable modification search using a metagenomics constructed database, both types of oligosaccharide structures were exclusively matched to the same surface layer protein (SLP) of Ca. Kuenenia stuttgartiensis (Fig. 4a–b, SI-EXCEL-T7-10, SI-DOC-Table-S7–8 and SI-DOC-Fig-S16–27), which was further confirmed by manual investigation of selected spectra (SI-DOC-Fig-S18–20). Moreover, while both oligosaccharides appeared to be O-linked, and the complex type glycan likely follows the attachment motif GT/S (SI-DOC-Fig-S21), no potential glycosylation motif could be deduced from the amino acid sequences of the oligo-heptosidic modified peptides. However, to unambiguously identify glycosylation motifs, further fragmentation experiments using electron-transfer dissociation (ETD) are required, because the employed higher energy collision dissociation (HCD) does not provide information on the modified amino acid residue(s). Interestingly, to the best of the author’s knowledge, a comparable complexity of unrelated glycans targeting the same SLP simultaneously has only been observed before for archaea [9, 27]. Whether these glycans indeed involve different oligosaccharyl-transferases (O-Tase), or only one O-Tase, or an additional consecutive transfer, can only be determined by performing mutagenesis or heterologous expression experiments. This is in particular important, because it has been shown previously that glycans with rather different compositions can be transferred via a single O-Tase to the same target protein [55, 56].

Fig. 3: Outline of identified sugar components and observed oligosaccharide profiles for the Ca. Kuenenia stuttgartiensis enrichment.
figure 3

a The large pie chart outlines the proportions of the carbohydrate fragments identified in the Ca. Kuenenia stuttgartiensis enrichment proteome using the MS2 mass binning approach. The lower charts depict the proportions of spectra containing carbohydrate related signals (oxonium ions) in the sequencing spectra, and the frequency of individual glycoforms across all spectra. b The graphs outline the furthermore established oligosaccharide chains using parent ion offset binning of fragmentation spectra containing the identified carbohydrate fragments (graphs labeled with 204, 275, 289, and 175 m/z). Thereby, oligosaccharide chains appear in histograms as repeatedly occurring mass deltas. The same parent ion offset approach applied to the complete, non-carbohydrate-filtered dataset, does not reveal any identifiable systematically reoccurring mass deltas (bottom graph). This revealed two completely unrelated oligosaccharide chains. The fragments 204, 261, 275, and 289 m/z belong to variations of a complex type oligosaccharide with a HexNAc core structure (X-type). The fragment 175 (and 193 m/z) retrieved a second, fully unrelated heptose type oligosaccharide chain (O-type). Squares represent HexNAcs (methylations are depicted by a dot), diamonds represent NulO variants, hexagons represent heptoses, triangles are deoxyhexoses, and doted triangles are dimethyl-deoxyhexoses. Moreover, due to the predominant fragmentation of the oligosaccharide chains, nearly the complete sequence of the oligosaccharide can be derived. c The histograms outline the (intensity normalized) low mass bins of fragmentation spectra where the complex type oligosaccharide (upper graph) or the oligo-heptosidic chains (O-type, lower graph) were identified. The thereby-observed sugar fragments correlate with the proposed composition of the individual carbohydrate chains (e.g., 204/261 m/z for complex, or 175/193 m/z for oligo-heptosidic). d The histogram shows binning of mass deltas into very small bin sizes to establish the oligosaccharide compositions at very high mass accuracy (<7.5 ppm).

Fig. 4: Glycosylated proteins and strains present in the explored anammox enrichment cultures.
figure 4

The metaproteomic analysis of the Ca. Kuenenia stuttgartiensis enrichment showed that ~98% of identified peptide sequences derived from Ca. Kuenenia stuttgartiensis (right bar, SI-EXCEL-T7 and T9). The vast majority of glycosylated peptides moreover could be assigned to the surface layer protein (KUST_250_3) of Ca. Kuenenia stuttgartiensis (b, right table, SI-EXCEL-T8 and T10, SI-DOC-Table-S8). This confirmed that both types of oligosaccharides (complex-type and oligo-heptosidic) target the same surface layer protein simultaneously. Furthermore, the metaproteomic analysis of the Brocadia enrichment culture showed that ~56% of the peptide sequences derived from Ca. Brocadia sapporoensis, and significant other proportions to at least three different Ignavibacteria strains (left bar, SI-EXCEL-T12 and T14). A modification search, including the identified oligosaccharide chains, confirmed that the putative surface layer protein of Ca. Brocadia sapporoensis is modified by a HexNAc core type oligosaccharide (204, squares). A second type of oligosaccharide chain (161/193 = hexagons) was assigned to multiple proteins from Ignavibacteria bacterium OLB4, and a third type of oligosaccharide (232, dark gray circles) to several proteins from Ignavibacteria bacterium UTCHB3 (a, left table, SI-EXCEL-T13, SI-DOC-Table-S9). Only the top matches for every database search are shown in the graph. Proteins with peptide matches at all three levels (i–iii) were considered as confirmed (green circle; i=database search, ii=oxonium ions and iii=oligosaccharide mass deltas). Peptides indicates the number of variable modification search matches (VM); oxonium indicates the number of VM matches with additional oxonium ion identifications; squares show the number of HexNAc core type oligosaccharide matches assigned to the same VM matches; gray hexagon counts the number of heptose-type oligosaccharide matches that were also assigned to VM matches; dark gray circle counts the number of 232 sugar-type oligosaccharide matches that were also assigned to the same VM matches.

Furthermore, when comparing these results to a laboratory enrichment of Ca. Brocadia sapporoensis, another anammox species within the Candidatus Brocadiaceae family, a similar carbohydrate profile was observed, albeit lacking the nonulosonic acid-related fragments found in Ca. Kuenenia (SI-EXCEL-T6 and T11, SI-DOC-Fig-S22). Furthermore, the observed sugar components established at least 4 different types of oligosaccharide chains (SI-DOC-Fig-S2326, SI-DOC-Table-S3). One was identical to a recently reported HexNAc core oligosaccharide (204 m/z) [25], another oligosaccharide structure contained characteristic hexose/heptose residues (163/193 m/z), a third included characteristic methyl-deoxy hexose residues (161 m/z), and the fourth was based on a yet-unidentified derivative (232 m/z). When integrating the established oligosaccharide masses into the metaproteomic analysis using a specifically constructed metagenomic database, only the recently reported HexNAc core type oligosaccharide (204 m/z) could be assigned to Ca. Brocadia sapporoensis, thereby exclusively modifying the putative SLP as also described in an earlier study [25]. The other types of oligosaccharide chains could be assigned to different proteins from Ignavibacteria bacterium OLB4 and Ignavibacteria bacterium UTCHB3, respectively, which both are commonly observed community members in anammox enrichment cultures [57, 58] (Fig. 4a–b, SI-EXCEL-T1214, SI-DOC-Table-S9).

The established oligosaccharide chains for Ca. Kuenenia stuttgartiensis were also confirmed by orthogonal HILIC glycopeptide enrichment procedures combined with an open modification search (using Byonic), as proposed very recently [47] (SI-DOC-Fig-S15). Moreover, we investigated an isolate of the Ca. Brocadia sapporoensis SLP separately, to confirm that it is modified by only one type of oligosaccharide (204 m/z). To this end, we performed a conventional SDS-PAGE [25] followed by in-gel proteolytic digestion and mass spectrometric analysis. The shotgun proteomic data were then processed by the developed pipeline. This showed exactly the same HexNAc-type oligosaccharide (204 m/z) as identified from the large-scale data and confirmed the lack of oligo-heptosidic chains found in Ca. Kuenenia stuttgartiensis (SI-DOC-Fig-S26). In summary, both strains—Ca. Kuenenia stuttgartiensis and Ca. Brocadia sapporoensis—share the HexNAc core type oligosaccharides (X-type), but they differ in regard to the presence of terminal nonulosonic acids and the second oligo-heptosidic chains. Nonetheless, the physiological importance of this extensive and complex surface glycosylation discovered for Ca. Kuenenia stuttgartiensis remains to be further investigated experimentally.

Discusion

Almost all archaea and many bacteria are entirely covered by surface layer proteins (SLPs) [28]. Moreover, SLPs frequently show either acidic or basic isoelectric points due to their propensity for charged amino acids [31]. Li et al. [31] demonstrated that the negatively charged SLPs of ammonia-oxidizing archaea (AOA) support the acquisition of ammonium (NH4+), which supposedly makes AOA competitive in ecosystems with low ammonium concentrations [31]. Moreover, Li et al. showed that the opposite case of a positively charged surface layer would create a nutrient barrier, preventing NH4+ from passing through the surface pores into the pseudo-periplasmic space [31, 59].

Ca. Kuenenia stuttgartiensis possesses a very comparable (hexagonally arranged) SLP array covering the entire bacterial cell [24], and the specific SLP has a particularly acidic (predicted) isoelectric point of ~4.25 and a net charge of −60 at physiological pH (Fig. 5a, SI-DOC-Fig-2731, SI-EXCEL-T1518). Intriguingly, however, anammox bacteria not only depend on the acquisition of ammonium but also require negatively charged nitrite. Our findings raise the question of why Ca. Kuenenia stuttgartiensis possesses such a highly acidic SLP, and whether Ca. Kuenenia stuttgartiensis (or anammox bacteria in general) evolved additional modulations to avoid interference with substrate acquisition. This is of particular importance as nitrite is commonly the limiting substrate in engineered and natural ecosystems.

Fig. 5: Physiology of the Ca. Kuenenia stuttgartiensis surface layer protein (SLP) and oligosaccharides.
figure 5

a The surface layer protein (SLP) of Ca. Kuenenia stuttgartiensis is densely covered by two entirely different types of oligosaccharides (“X-type” and “O-type”, SI-DOC-Fig-S16). The dense layer supposedly provides shielding of the very acidic SLP. Interestingly, the investigated Ca. Kuenenia stuttgartiensis strain produces nonulosonic acids (NulOs), which possess an unmasked amine. Those have the potential to counterbalance the carboxylic acid groups. The sugar symbols are depicted in generic white and gray shades because a further classification into specific types of monosaccharides, beyond sum formulae, chain length and modifications, cannot be obtained from accurate mass experiments. The different shades of gray were simply chosen to make the individual sugars more distinguishable within the oligosaccharide structures. The oligosaccharide structures depicted on the cell surface layer (top graph) are colored in blue if those structures represent a X-type (complex type) structure, or in orange if they represent an O-type (oligo-heptosidic) oligosaccharide. The colors do not provide any further indications on the types of monosaccharides. b The Ca. Kuenenia stuttgartiensis surface layer protein shows a predicted pI (isoelectric point) of ~4.25 and a net charge of ~−60 at physiological pH. In fact, the surface layer protein is one of the most acidic proteins of the complete Ca. Kuenenia stuttgartiensis proteome. On the other hand, the putative surface layer protein of Ca. Brocadia sapporoensis has a predicted pI of only 5.4 and a substantially lower net charge of ~−8 at physiological pH. Moreover, Ca. Brocadia, uses also only a related form of the complex-type oligosaccharide to cover its much less acidic surface layer protein. The SDS-PAGE analyzes show protein and sugar staining for the protein extracts from Ca. Kuenenia stuttgartiensis and the additional control strains H. volcanii (glycan-positive control), C. jejuni (glycan-positive control) and E. coli K12 (glycan-negative control). The left lanes each show the total protein staining (P; Brilliant Blue G staining solution), whereas the right lanes each show the carbohydrate staining (C; Pro-Q 488 Emerald staining kit) “(color figure online)”.

Interestingly, Hu et al. recently revealed the ability of anammox to grow on neutral nitric oxide (NO) instead of nitrite. As NO was likely the first oxidized nitrogen form present on early Earth [60], it is tempting to speculate that the acquisition of the highly acidic SLPs—analog to AOA, which only depend on positively charged ammonium—preceded the ability of anammox bacteria to utilise negatively charged nitrite. This seems further corroborated by the recent finding that anammox can also oxidize ammonium in the presence of an electrode as electron acceptor, thereby again eliminating the nitrite dependency [61]. Nevertheless, this would not explain how Ca. Kuenenia stuttgartiensis maintained the highly acidic surface layer while depending also on negatively charged nitrite. In this respect, the development of the here-discovered dense layer of (charge-balanced) oligosaccharides may have provided sufficient shielding of the highly acidic protein layer (Fig. 5b).

However, the investigated Ca. Kuenenia stuttgartiensis strain produces complex oligosaccharides, which contain also nonulosonic acids. Those sugars are commonly associated with enhancing cellular protection through surface diversification (e.g., in response to bacteriophage recognition) [12, 62, 63]. Yet, nonulosonic acids are highly acidic sugars and support a negative surface charge [12, 64]. Thus, although those sugars may be of advantage for cellular protection, it seems counterintuitive to invest cellular energy into an (even more) negatively charged surface layer.

Nevertheless, carboxylic acids can be chemically modified or balanced through basic counterparts, for example, through esterification of the carboxylic acid groups or by free amines, which are otherwise present in alkylated forms [65, 66]. Surprisingly, the chemical composition of the nonulosonic acids, additional in-source fragmentation and labeling experiments indeed indicate the presence of an unmasked amine (SI-DOC-Fig-S32 and S3). Those basic groups unequivocally have the potential to counterbalance the neighboring, highly acidic carboxylic acids (SI-DOC-Fig-S27) [64]. While zwitterionic sugar modifications have been detected (e.g., recently in glycans of non-vertebrates) [67], nonulosonic acids with free amines have been only rarely observed but were described, for example, in nerve and cancer cells under certain conditions [64, 66].

To place our findings in the context of the broader anammox physiology, we also investigated the Ca. Brocadia sapporoensis surface layer protein (SLP). This revealed a substantially smaller and less acidic surface layer protein (predicted pI~5.4), which furthermore is modified by only one type of oligosaccharide (Fig. 5b,SI-DOC-Fig-S33–36). It should be mentioned that similar types of oligosaccharides have been already observed between taxonomically more distantly strains [38, 68, 69]. However, the expressed physiological differences may contribute to the reported divergences between strains of Ca. Kuenenia and Ca. Brocadia, for example in regard to substrate affinity (i.e., the ability to thrive at low nutrient concentrations) or the tendency to grow in free-living planktonic form [70].

Ultimately, the closer investigation of the complex array of oligosaccharides provides new perspectives towards interpreting the appearance of the cell surface layer of anammox bacteria or, in particular, of the unique cell surface layer of Ca. Kuenenia stuttgartiensis. However, the hypothesized roles require further experiments to evidence the proposed shielding of the acidic surface layer protein, or to confirm any of the other (likely multiple) biological roles of the glycan array beyond cellular protection.

Conclusions

We established a universal procedure to explore prokaryotic protein glycosylation from non-pure cultures. The approach provides insights into the chemical identity of novel sugar components and identifies the related protein-linked oligosaccharide chains from large-scale (meta)proteomics data directly. By applying the approach to an anammox bacteria enrichment, we resolve a remarkably complex array of surface layer oligosaccharides. The identified glycans are produced by two apparently independent biosynthetic routes and densely cover a very acidic surface layer protein. Moreover, the investigated anammox strain accomplished charge-balancing of the highly specialized nine-carbon sugars. The molecular mechanisms by which both types of oligosaccharides are transferred to the same surface layer protein, however, remain subject for further studies. Ultimately, the physicochemical interpretation of the discovered spectrum of oligosaccharides suggests a broader link between the development of complex oligosaccharides, charged surface layer proteins and the metabolic lifestyle in anammox bacteria.

Materials and methods

Sample sources microbes

Anammox lab-scale enrichment cultures

Ca. Kuenenia stuttgartiensis was enriched as planktonic cells (~90% relative abundance based on metagenomic data) at 30 °C in a continuous-flow bioreactor equipped with a custom-made microfiltration module (pore size of 0.1 μm) as described elsewhere [71]. The reactor was fed with mineral medium supplemented with 45 mM of ammonium and nitrite [71]. Nitrite concentrations in the effluent were always below detection limit (nitrite test strips MQuant, Merck, Darmstadt, Germany). Anoxic conditions were maintained via continuous sparging with Ar/CO2 (95%/5% v/v) at a rate of 10 ml/min. The reactor hydraulic and solids retention times were ~1.9 and 10.5 days, respectively, and the resulting steady-state OD600 was 1.0–1.1. The pH was controlled at 7.3 with a 1 M KHCO3 solution. The flocculent Ca. Brocadia sapporoensis enrichment was maintained at 30 °C under anoxic conditions using an identical continuous flow bioreactor with a custom-made microfiltration module (pore size of 0.1 μm) fed with a concentrated media of 60 mM ammonium and nitrite as originally described by Lotti et al., 2014 [72].

Comparator strains

Escherichia coli K12 MG1655 was obtained from NCCB, The Netherlands. Haloferax volcanii DSMZ 3757 (cultured at high salinity, >3.5 M NaCl) was obtained from The Leibniz Institute DSMZ, Germany. Campylobacter jejuni 9141 was obtained from Erasmus MC, The Netherlands, and Saccharomyces cerevisiae CEN.PK113-7D was obtained from an in house collection. The control strains were cultured and harvested as described by Kleikamp et al., 2020 [12].

Mass spectrometry based proteomics

Cell lysis and protein extraction

A modified protocol from Kleikamp et al. was used to prepare whole protein extracts [73]. Briefly, 25 mg biomass (wet weight) were collected in an Eppendorf tube and solubilized in a suspension solution consisting of 200 µL B-PER reagent (78243, Thermo Scientific) and 200 µL TEAB buffer (50 mM TEAB, 1% (w/w) NaDOC, adjusted to pH 8.0) including 0.2 µL protease inhibitor (P8215, Sigma Aldrich). Furthermore, 0.1 g of glass beads (acid, washed, ~100 µm diameter, G4649-10G, Sigma Aldrich) were added and cells were disrupted using 3 cycles of bead beating on a vortex for 30 s followed by cooling on ice for 30 s in-between cycles. In the following, a freeze/thaw step was performed by freezing the suspension at −80 °C for 15 min and thawing under shaking at elevated temperature using an Eppendorf incubator (ThermoMixer). The cell debris was pelleted by centrifugation using a bench top centrifuge at max speed, under cooling for 10 min. The supernatant was transferred to a new Eppendorf tube and kept at 4 °C until further processed. Protein was precipitated by adding 1 volume of TCA (trichloroacetic acid) to 4 volumes of supernatant. The solution was incubated at 4 °C for 10 min and subsequently pelleted at 14.000 rpm for 10 min. The obtained protein precipitate was washed twice using 250 µL ice cold acetone. The protein pellet was dissolved in 100 µL of 200 mM ammonium bicarbonate containing 6 M Urea to a final concentration of ~100 µg/µL. To 100 µL protein solution, 30 µL of a 10 mM DTT solution were added and incubated at 37 °C for 1 h. In the following, 30 µL of a freshly prepared 20 mM IAA solution was added and incubated in the dark for 30 min. The solution was diluted to below 1 M Urea using 200 mM bicarbonate buffer and an aliquot of ~25 µg protein were digested using sequencing grade Trypsin (V511A, Promega) at 37 °C over-night (Trypsin to protein ratio of ~1:50). Finally, protein digests were then further desalted using an Oasis HLB 96 well plate (WAT058951, Waters) according to the manufacturer protocols. The purified peptide eluate was dried using a speed-vac concentrator. The protocol used for ZIC-HILIC extraction of glycopeptides can be found in the supplementary materials.

Whole cell lysate shotgun (meta)proteomics and targeted experiments

The vacuum dried peptide fractions were resuspended in H2O containing 3% acetonitrile and 0.1% formic acid under careful vortexing. An aliquot corresponding to ~250 ng protein digest was each analyzed using an one dimensional shotgun proteomics approach [74]. Briefly, samples were injected to a nano-liquid-chromatography system consisting of an EASY nano LC 1200, equipped with an Acclaim PepMap RSLC RPC18 separation column (50 µm x 150 mm, 2 µm and 100 Å), and an QE plus Orbitrap mass spectrometer (Thermo Scientific, Germany). Unless otherwise specified, the flow rate was maintained at 300 nL/min over a linear gradient using H2O containing 0.1% formic acid as solvent A, and 80% acetonitrile in H2O and 0.1% formic acid as solvent B. Solvent gradients and acquisition modes used for the individual experiments are detailed in the following. (A) High-resolution MS2 mass binning experiments: The peptides were analyzed using a gradient from 4% to 30% solvent B over 32.5 min, and finally to 70% solvent B over 12.5 min. The Orbitrap was operated in data-dependent acquisition (DDA) mode acquiring peptide signals form 500–1500 m/z at 70 K resolution with an AGC target of 3e6. The top 10 signals were isolated with a 1.6 m/z window and fragmented using a NCE of 30. The AGC target was set to 2e5, at a max IT of 100 ms, a fixed first mass of 120, and a resolution of 140 K. Dynamic exclusion was set to 20 s. Mass peaks with unassigned charge state, singly, 7 and >7, were excluded from fragmentation. (B) Whole cell lysate shotgun glycoproteomics: The peptides were analyzed using a gradient from 5% to 30% solvent B over 85 min, and finally to 75% B over 25 min. The Orbitrap was operated in data-dependent acquisition (DDA) mode acquiring peptide signals form 550 to 1500 m/z at 70 K resolution with an AGC target of 3e6. The top 10 signals were isolated with a 2.0 m/z window and fragmented using a NCE of 28. The AGC target was set to 2e5, at a max IT of 75/54 ms, a fixed first mass of 120, an isolation offset of 0.1 m/z, and a resolution of 17 K. Dynamic exclusion was set to 20 s. Mass peaks with unassigned charge state, singly, 6 and >6, were excluded from fragmentation. (C) Whole cell lysate shotgun proteomics: The peptides were analyzed using a gradient from 5 to 30% solvent B over 85 min, and finally to 75% B over 25 min. The Orbitrap was operated in data-dependent acquisition (DDA) mode acquiring peptide signals form 385 to 1250 m/z at 70 K resolution with an AGC target of 3e6. The top 10 signals were isolated with a 2.0 m/z window and fragmented using a NCE of 28. The AGC target was set to 2e5, at a max IT of 75/54 ms, a fixed first mass of 120, an isolation offset of 0.1 m/z, and a resolution of 17 K. Dynamic exclusion was set to 60 s. Mass peaks with unassigned charge state, singly, 6 and >6, were excluded from fragmentation. (D) Analysis of in-gel digested proteins: The peptides were analyzed using a gradient from 5 to 25% solvent B over 25 min, and finally to 60% solvent B over 10 min. The flow rate was maintained at 350 nL/min. The Orbitrap was operated in data-dependent acquisition (DDA) mode acquiring peptide signals form 400 to 1400 m/z at 70 K resolution with an AGC target of 3e6. The top 10 signals were isolated with a 1.6 m/z window and fragmented using a NCE of 28. The AGC target was set to 5e4, at a max IT of 150 ms, a fixed first mass of 120, an isolation offset of 0.1 m/z, and a resolution of 17 K. Dynamic exclusion was set to 60 s. Mass peaks with an unassigned charge state, singly, 6 and >6, were excluded from fragmentation. (E) In-source fragmentation experiments: The peptides were analyzed using a gradient from 6% to 30% solvent B over 40 min, and finally to 60% B over 15 min. The flow rate was maintained at 350 nL/min. The Orbitrap was operated in positive ionization mode acquiring signals alternating between PRM and Full MS-SIM mode. The PRM mode was performed at an in-source CID of 75 eV, isolating the target sugar fragments with an isolation window of 0.4 m/z, at 0.1 m/z isolation offset and a loop count of 9. Fragmentation was performed using a NCE of 25, acquiring fragments at a resolution of 70 K, using an AGC target of 5e5 and a max IT of 150 ms. The fixed lowest mass was set to 50 m/z. The Full MS—SIM mode was operated with an in-source CID of 75 eV, acquiring full scan mass spectra at 70 K resolution, at an AGC target of 3e6 and an max IT of 60 ms, over a mass range of 140–1400 m/z. PRM carbohydrate fragment targets for the Ca. Kuenenia stuttgartiensis enrichment were set to 193, 175, 147, 204, 218, 261, 275, 163, 407; for the Ca. Brocadia sapporoensis enrichment to 147, 204, 218, 232, 334, 133, 352; and for the mammalian protein control sample to 147, 204, 292, 163. SDS-PAGE, glycostaining and in-gel proteolytic digestion are described in the supplementary materials.

PEAKS database search

Whole cell lysate shotgun proteomics raw data (see B/C) were analyzed using PEAKS Studio X (Bioinformatics Solutions Inc., Canada) against databases constructed by metagenomics as described below. Database search was performed employing a two-round search strategy, where the first round was used to construct a focused protein sequence database, thereby allowing peptide spectrum matches up to 5% false discovery rate and protein matches without unique peptide assignments. The database search was performed allowing 50ppm mass error, 0.01 Da fragment error tolerance, considering 2 missed cleavages, oxidation/deamination as variable modifications and carbamidomethylation as fixed modification. The cRAP protein sequences were downloaded from ftp://ftp.thegpm.org/fasta/cRAP. The second round search was performed including the identified oligosaccharides masses as variable modifications, allowing up to 2 variable modifications per peptide. Peptide spectra were filtered against 0.1% false discovery rate and reported protein identifications required ≥1 unique peptides, where protein identifications with ≥2 unique peptides were considered as significant.

Verification of glycopeptide spectrum matches

Matlab R2017b was further used to score glycopeptide spectrum matches for the additional presence of expected oxonium ions or for whether glycan structures have been identified in the same scans by parent ion offset binning. Only proteins/species, which provided spectra showing variable modification search identifications, oxonium ion and structural identifications in the same spectra, were considered as confirmed matches. The BYONIC open modification search procedure is described in the supplementary materials.

Glycosylation analysis

Sugar-miner

Identification of strain specific carbohydrate fragments was established by the following steps: (A) mass binning of very high-resolution shotgun proteomics data; (B) establishing a theoretical chemical composition space for carbohydrate fragments; (C) annotation of (non-peptide) mass peaks with possible carbohydrate compositions; (D) verifying the annotations by investigating the presence of water-loss clusters, and by determining the co-occurrence (correlation) of the cluster peaks across the entire proteomics run. More specifically: (A) Mass spectrometric shotgun raw data acquired at very high resolution (140 K) were converted using peak picking “vendor” into Mascot Generic File (MGF) format considering only second-level scans using the msConvertGUI tool (ProteoWizard). For the comparator E. coli K12 sample an additional absolute int. threshold peak filter of 25 K was applied. The MGF files were imported into the Matlab environment using the Matlab “textscan” function. The mass peaks from all second-level (MS2) scans were further combined into a single matrix, and masses in the range from 110–325 m/z were binned into 0.0001 m/z windows using the “histcounts” function. The obtained raw traces were further corrected for mass drifts by alignment to known amino acid fragment peaks (147.1128; 175.1190; 201.1234; 215.1390; 228.1343; 258.1448; 292.1292 m/z) using the “msalign” function, and further normalised to 100 using the “msnorm” function. The thereby generated raw traces were converted into (centroided) peak lists using the “mspeaks” function, employing a heightfilter of 0.02% (relative to the largest peak). The same procedure (except using a relative heightfilter of 0.1%) was applied simultaneously to the E. coli K12 non-glycosylated comparator strain dataset. (B) An empirical carbohydrate fragment composition space was constructed considering the elemental composition space C5-14H4-28N0-2O2-12S0-1. The compositions were further filtered for realistic structures evaluating C/H and CO/N ratios, the degree of unsaturation (DBEs), the mass defect and the min/max absolute masses. For details see SI-EXCEL-T35 and SI-DOC-Fig-S2. (C) To identify possible carbohydrate signals in the sample data, the established sample peak list was matched with the empirical composition space at a mass tolerance of 0.75 ppm. Mass peaks also present in the non-glycosylated E. coli K12 comparator at a relative level >0.5 were considered as amino-acid-related. (D) The established carbohydrate fragment candidates were further evaluated for the presence of water-loss clusters (−18.01 mass deltas between fragments), commonly observed when fragmenting carbohydrate compounds. To ensure that the assigned water-loss clusters/pairs are part of the same parent structure, the clusters were evaluated for co-occurrence within the same scans across the shotgun proteomics dataset. For this, the conventional mass resolution shotgun dataset (of the same sample) was converted to open mzXML data format using the msConvertGUI tool (ProteoWizard), considering first- and second-level scans. The mzXML files was imported into the Matlab environment using the “mzxmlread” function. The obtained “mzxmlstruct” structure was processed using “mzxml2peaks” and “arrayfun” to extract first- and second-level spectral information. By doing so, an extracted ion chromatogram was collected (within ±7.5 ppm mass window) for every carbohydrate candidate, containing information about the occurrence across the scans. To evaluate the degree of co-occurrence (=correlation) between individual carbohydrate fragments, the matrix was further converted into a “correlation matrix”, by dividing the total occurrence by the number of scans shared between fragments. A correlation of 1 indicates that all scans are shared, where a correlation of 0 means that two carbohydrate fragments do not share any scans. To avoid accumulation of background signals, a minimum intensity of 5E4 was required. Only water-loss pairs/clusters with a correlation >0 between, and >0.5 within the entire water-loss cluster were considered for further processing. To avoid accumulation of background signals, the carbohydrate signals required a minimum relative intensity (0.01) or a minimum number of counts (10) across the entire dataset. The established carbohydrate compounds were exported to a Excel table.

Glyco-mod-pro

Establishing the oligosaccharide profiles as modifying the proteins is performed by the following steps: (A) identifying scans in a shotgun proteomics run, which contain the identified carbohydrate fragments; (B) calculating the “parent ion offsets” numbers and (C) binning the offset numbers to identify the reoccurring mass deltas (=oligosaccharide chains). More specifically: (A) Mass spectrometric raw files of conventional shotgun proteomics analysis runs were converted into mzXML files using the msConvertGUI tool (ProteoWizard). Files were further imported into the Matlab environment using the “mzxmlread” function. Furthermore, a table of the exact masses of the identified carbohydrate fragments as established by the previous sugar-mining step, was provided as Excel table and further imported into the Matlab environment using the “xlsread” function. The constructed “mzxmlstruct” structure was further processed using the “mzxml2peaks” and “arrayfun” functions to extract first- and second-level spectral information. A matrix was created containing mass/charge (m/z) values, ion intensities, scan numbers (and related parameters) of mass peaks matching the identified carbohydrate fragments, within a tolerance of ±7.5 ppm. (B) The parent ion offset was further calculated for every second-level (MS2) spectrum containing a particular carbohydrate fragment. Briefly, for a particular fragment, all scans containing the carbohydrate fragment were collected. Scans with an unique parent ion mass (2 digits) were processed one by one, using the Matlab “for” loop functions. First, the complete second-level (MS2) spectrum was extracted. Peaks with a mass delta to neighboring masses indicating a charge state >1 were deconvoluted to singly charged analogs. The scan was only further processed when the carbohydrate intensity was above the specified intensity threshold. The complete mass peak list was then subtracted from its (singly charged) parent ion mass. The thereby generated (negative) mass numbers (=parent offsets) were collected in a separate matrix. By assuming a minimum peptide mass of 500 Da (or roughly 5 amino acids), offset numbers <(500 – (parent mass)) were excluded. The generation of the parent offset numbers was repeated for every scan containing a particular carbohydrate fragment. The collected parent offset numbers (for a particular carbohydrate fragment) were finally trimmed to the specified mass range (0–2000 Da, except otherwise specified), converted into absolute values and binned into 0.01 Da windows, and visualized using the “histogram” function. Alternatively, the parent ion offset binning of the complete shotgun proteomics dataset was performed using exactly the same approach as described for spectra filtered for the occurrence of certain carbohydrate fragment. Furthermore, the intensity normalized low mass bins for spectra containing a certain oligosaccharide chain were generated by binning the low mass range, after normalizing every mass peaks within a scan to the total peak intensity (100). Peaks with a relative abundance below 0.5 were not further considered. High-resolution mass binning to obtain the accurate mass of the oligosaccharide modification was performed by binning a focused mass range from 350–425, or 1150–1250 m/z, respectively, to achieve a resolution of 0.01 units bin size. Oligosaccharide variable modification masses for PEAKS database search were obtained by annotating the most abundant isotope within an oligosaccharide isotope cluster (after mass binning at a bin size of 0.05 m/z, from 150–2000 Da, and normalization) using the “mspeaks” function. Methods used for metabolomics experiments of released nonulosonic acids and activated sugars can be found in the supplementary materials, SI-DOC-Table-S6 and SI-DOC-Fig-S10.

Genomics

Metagenomic sequencing

DNA from the Ca. Brocadia sapporoensis enrichment culture was extracted using the DNeasy UltraClean Microbial Kit (Qiagen, The Netherlands). Following extraction, DNA was checked for quality by gel electrophorese and by using a Qubit 4 Fluorometer (Thermo Fisher Scientific, USA). Metagenomic sequencing was performed by Novogene Ltd. (Hongkong, China). Briefly, for library construction, a total amount of 1 μg DNA per sample was used as input material. Sequencing libraries were generated using the NEBNext Ultra DNA Library Prep Kit (NEB #E7645, USA) following the manufacturer’s recommendations. The DNA sample was fragmented by sonication to a size of 350 bp, then DNA fragments were end-polished, A-tailed, and ligated with a full-length adapter for further PCR amplification. PCR products were purified (AMPure XP system) and libraries were analyzed for their size distribution using an Agilent 2100 Bioanalyzer, and quantified using real-time PCR. The clustering of the index-coded samples was performed on a cBot Cluster Generation System according to the manufacturer’s instructions. After cluster generation, the library preparations were sequenced to generate paired-end reads (HiSeq sequencing platform, Illumina Inc., US).

Sampling from the Ca. Kuenenia stuttgartiensis enrichment culture and sequencing is described in Lawson et al., 2020 [75].

Metagenome assembly and binning

Raw reads were quality checked with FastQC v0.11.7 (www.bioinformatics.babraham.ac.uk/projects/fastqc/), and low-quality reads were trimmed using Trimmomatic v0.39 [76] with the default settings for pair-end reads. Trimmed reads were assembled for using metaSPAdes v3.13.0 [77] with default settings, resulting in 51,275 scaffolds of ≥ 1 kb. Metagenome binning was performed using three different binning algorithms: BusyBee Web [78], MaxBin 2.0 v2.2.4 [79] and MetaBAT2 v2.12.1 [80]. The three bin sets were supplied to DAS Tool v1.1.0 [81] for consensus binning to obtain the final optimized bins, which resulted in 47 metagenome assembled genomes (MAGs). Genome bins were assessed for completeness and contamination using CheckM v1.0.12 [82]. As a result, 27 high-, 19 medium-, and 1 low-quality MAGs in accordance with minimum information about metagenome-assembled genome (MIMAG) standards [83] were reconstructed. MAGs were classified taxonomically using GTDB-Tk v1.0.2 and the Genome Taxonomy Database (release 89). The reconstructed genomes were annotated through the NCBI Prokaryotic Genome Annotation Pipeline [84]. Annotation of the protein-coding genes was performed GhostKOALA tool [85] (accessed October 2019) for Kyoto Encyclopedia of Genes and Genomes (KEGG) enzyme codes and supported with BLASTp (E value <1e-20) [86] searches against the NCBI non-redundant protein database. Phylogenetic analysis, proteome isoelectric point calculation and proteome sequence homology determination are outlined in the supplementary materials.