Introduction

Microbial utilization of organic matter is one of the drivers regulating the marine carbon cycle1. Most of the marine organic matter is biosynthesized in the surface ocean, and about 20% is exported to the ocean’s interior via various mechanisms2,3. Particulate organic matter (POM) has been proposed as the main trophic basis for heterotrophic microbes in the deep-sea because the bioavailability of the bulk of dissolved organic matter (DOM) is low for deep-sea microbes4,5,6. The POM flux driven by zooplankton is proposed to become prominent in the deep ocean as most sinking algal material is consumed within the top 1000 m7,8,9. Additionally, zooplankton can release large quantities of urea10,11, which serves as an additional ammonia source supporting nitrification-mediated dissolved inorganic carbon (DIC) fixation in the aphotic zone12. Thus, zooplankton activity and their release of urea increases organic matter availability in the meso- and bathypelagic ocean and sustains the metabolic activity of deep-sea microbes13,14,15. Gammaproteobacteria are the main producers of extracellular enzymes16. Even in the deep ocean, they can hydrolyze POM where their metabolic activity is comparable to other taxonomic groups despite being impacted by high hydrostatic pressure conditions16,17. Yet, the relatively low abundance of Gammaproteobacteria based on 16S rRNA genes18 appears to be in contrast with their activity and might be indicative of significant cell losses due to viral lysis of this bacterial group19.

All these findings indicate potentially intrinsic links between zooplankton, bacteria, archaea, and viruses in the carbon cycling process. However, evidence supporting these connections remains poor. Specifically, the relative contribution of zooplankton and algae to the POM pool throughout the oceanic water column remains unclear. Whether urea facilitates DIC fixation in the global ocean is also not well documented, as is the role of Gammaproteobacteria in POM remineralization and their interaction with viruses.

We characterized the structure and function of marine microbial communities using a metaproteomics approach to assess the relative abundance of proteins in individual taxa. We also focused on microbial enzyme expression profiles to determine the links between organic matter supply and microbial activities. Proteins are essential biomolecules for microorganisms but also significant sources of organic matter. While the protein expression level is a direct response to the (micro)environment20,21,22, the abundance of proteins is a measure of the contribution of individual populations to the total biomass in a specific depth layer of the oceanic water column23. Despite the relatively low resolution of metaproteomics compared to metagenomic and metatranscriptomic analysis, the non-amplification nature of mass spectrometry analysis supports comparison at the protein level across superkingdoms. It produces semi-quantitative results on the contribution of each taxon to the total protein pool23.

In this work, we show that the taxonomic origin and the expression pattern of the marine microbial protein pool is size fractionated. The differences between size fractions highlight the essential role of microbial activities in mediating the flux of organic matter from the POM pool to the DOM pool.

Results and discussion

Protein profiles of the marine plankton community

We collected 61 metaproteomics samples from 22 stations between 5 m and 4000 m depth from the major ocean basins and three size-fractions (>0.8 μm, 0.2–0.8 μm, and <0.2 μm, 48 samples from 16 stations covered all three size-fractions) to cover eukaryotes, bacteria, archaea, and viruses (Figs. 1A, S1A–C, Supplementary Data 1, see Methods). Data-dependent metaproteomic analysis heavily relies on the completeness of the sequence database for spectral identification. Thus, a well-curated gene catalog from global scale metagenomic and metatranscriptomic surveys24,25,26,27 significantly improves the metaproteomic protein identification and functional profiling. We employed an optimized database construction strategy for robust protein identification28. Metagenomic assemblies from the same sampling stations were combined with publicly available metagenomics/metatranscriptomics assemblies to obtain a deep coverage of microorganisms in the ocean (including micro-eukaryotes, bacteria, archaea, and viruses) throughout the entire water column from global ocean expeditions24,25,26,27. A two-step search approach was implemented to minimize the high false discovery rate in protein identification caused by a large database29. Searching against such a comprehensive database covering marine micro-eukaryotes, prokaryotes, and viruses (Table S1), we identified 234,550 protein entries (Table S2, see Methods). Among them, 90% were bacterial (156,187) and eukaryotic (55,494) sequences, while archaeal (7163) and viral (6213) sequences contributed less than 10% (Table S3). The taxonomic occurrence of identified proteins at the superkingdom level was similar to the corresponding genes in the database (Fig. 1A). A clear difference was found, however, when grouping the proteins/genes according to the lowest taxonomic linkages (taxonomic IDs, Fig. 1A). While a linear relationship was found between the protein and the corresponding gene occurrence in the database, protein sequences affiliated to Alteromonadales (including Alteromonadaceae and Alteromonas) showed high occurrence (~1%) in the metaproteome. Still the occurrence of corresponding genes in the database was low (<0.1%). (Fig. 1A). In the gene database derived from metagenomic and metatranscriptomic analyses, less than 50% of predicted genes were taxonomically and functionally classified24,25,30. In the metaproteome, however, more than 70% of the identified protein sequences (ca. 150,000 protein sequences) were taxonomically classified at the class level and functionally annotated in at least one functional database (Fig. S1D, E), which is similar to other metaproteomic results covering both micro-eukaryotes and prokaryotes31. The function of identified proteins in the metaproteome based on the Cluster of Orthologous Groups (COG) was also distinct from corresponding genes in the database (Fig. 1B). Proteins involved in amino acid metabolism had a high occurrence in bacteria and archaea, and proteins involved in intracellular trafficking, secretion, and cytoskeleton interlinking showed high occurrence in eukaryotes. A clear size-fractionated pattern in sequence richness was observed between superkingdoms (Fig. 1C), where eukaryotic proteins were rich in the >0.8 µm size fraction, bacterial and archaeal proteins dominated the 0.2–0.8 µm fraction. In comparison viral proteins were mainly found in the <0.2 µm fraction. The depth stratification was also profound for eukaryotes and archaea, while the richness of bacterial proteins did not change with depth (Fig. 1D). Previous reports show that proteins occurrence in marine microorganisms (micro-eukaryotes, prokaryotes and virus) varies across different size-fractions (particulate, free-living and dissolved fractions)32,33,34,35. By comparing protein occurrence with the relative abundance of proteins, we found strong correlations between protein occurrence and protein relative abundance (Fig. 1E, R2 > = 0.49, p < 0.001). The major contributors to the marine protein pool varied across size fractions. Proteins from cyanobacteria, eukaryotic algae, and zooplankton (copepods) exhibited a high occurrence and relative abundance in the >0.8 µm fraction. Heterotrophic bacterial proteins showed high occurrence and relative abundance in the <0.8 µm (0.2–0.8 µm and <0.2 µm) fractions with taxonomic variations. For example, proteins from SAR11, Chloroflexi, and Bacteroidetes showed high occurrence and relative abundance in the 0.2–0.8 µm fraction but the occurrence and relative abundance of proteins from Alteromonadales were high in the <0.2 µm fraction. Proteins from prokaryotic autotrophs such as Thaumarchaea also exhibited a high occurrence and relative abundance in the 0.2–0.8 µm fraction. However, protein sequences related to Myoviridae were the most abundant (both occurrence and relative abundance) viral proteins in the <0.2 µm fraction (Fig. 1E).

Fig. 1: Protein sequences identified in the metaproteome.
figure 1

A Relationship between protein occurrence and the occurrence of corresponding genes in the database. The protein/gene sequences were grouped according to NCBI taxonomy IDs. The dashed line represents the 1:1 ratio. The inserted bar-plot shows the taxonomic composition in the gene database and total identified proteins at superkingdom level. B Comparison between gene occurrence and protein occurrence according to clusters of orthologous groups (COGs). CD Rarefaction curve of detected proteins from different size-fractions and depth. E Relationship (two-sided Spearman’s rank correlation coefficient) between protein occurrence and protein relative abundance in the metaproteome dataset. Protein sequences were grouped according to NCBI taxonomy IDs. Each datapoint represents the mean value of relative abundance (x-axis) and protein occurrence (y-axis) from the same depth layer in each fraction. Taxonomic labels are given to the datapoint where the highest mean relative abundance was detected. Uncl, unclassified; α-, alphaproteobacteria; β-, betaproteobacteria; γ-, gammaproteobacteria; δ-, deltaproteobacteria; Source data are provided as a Source Data file.

We further performed a functional assessment of the metaproteome using KEGG-orthologue (KO) based analysis36. Protein sequences were grouped into 3817 KOs (Supplementary Data 3). Compared to the metagenomic and metatranscriptomic analysis, the total number of KOs was relatively low in our metaproteome (Fig. S2A, B), partially due to the limited sample size, and the resolution of mass spectrometry. However, the relative abundance of KOs revealed that KOs with high relative abundances in the metagenome and metatranscriptome were also highly abundant in the metaproteome (Fig. S2C, D). The functional composition (based on the relative abundance of KOs) of the metaproteome, however, was significantly different (Figs. 2A, S2C, PERMANOVA, p < 0.05) from the metagenome and metatranscriptome for both, the eukaryotic-enriched (>0.8 µm) and prokaryotic-enriched (0.2–3 µm) community24,25,30. Diversity analysis on KOs showed that the metaproteomics dataset had the lowest alpha diversity but a high beta diversity (Wilcoxon test, p < 0.05, Fig. 2B, C). Clusters of size fractions were found in the metagenomic and –transcriptomic dataset (Fig. 2A), as well as at the metaproteome level (Fig. 2D, PERMANOVA, p < 0.05). However, the metaproteomic samples were collected from disparate ocean regions (Fig. S1A). The metaproteome KO profile of the >0.8 μm size-fraction exhibited the lowest alpha diversity but the highest variance (Bray-Curtis dissimilarity, Fig. 2E, F). In contrast, in the <0.8 μm size-fraction (both 0.2–0.8 μm and <0.2 μm), alpha-diversity was highest and variance lowest (Wilcoxon test, p < 0.05, Fig. 2E, F). These results suggest that the high variance in the metaproteome is driven by the KO profile of the >0.8 μm size-fraction where low within-site diversity of proteins but high diversity across sites were observed (Fig. 2E, F). As the samples were collected from diverse sites (Fig. S1A), differences in epipelagic biogeochemistry might have affected the phytoplankton community and shaped the diversity of particles37, which might have led to the high beta-diversity in the >0.8 μm size-fraction. A similar pattern was also found for the prokaryotic metatranscriptomes (Fig. 2B, C). The changes in the protein profiles between samples (Fig. 2B, C) imply that the genome (particularly prokaryotic genome) responds to the environment by transcribing and translating proteins/enzymes adapted to specific functions with adequate expression levels20,38. This observation is consistent with the fact that while the protein-coding gene abundance reflects long-term adaptive mechanism of microbial colonization between depths or size-fractions26,39, the transcription response25 and translational regulation of protein synthesis20 control the near-instantaneous microbial interaction with the environment such as particles38,40.

Fig. 2: Diversity profiles of the microbial metaproteome.
figure 2

A Principal coordinate analysis of KO based functional profiles obtained from the metaproteome, metagenome, and metatranscriptome of the planktonic community. Euk, samples enriched in micro-eukaryotes (samples collected in the >0.8 µm fraction), Prok, samples enriched in prokaryotes (samples collected in the 0.2–3 µm fraction). Diversity comparison shows that the metaproteome (n = 61) has low α-diversity (B) but high β-diversity (C) in comparison to the metagenomics (eukaryotic metagenome, n = 158; prokaryotic metagenome, n = 345) and –transcriptomics (eukaryotic metatranscriptome, n = 157; prokaryotic metatranscriptome, n = 521) dataset; significance test (Wilcoxon test) was made using the metaproteome as reference. D Principal coordinate analysis of KO based functional profiles of metaproteomes between depths and size-fractions. Size-fraction based clustering patterns are observed (Permanova, p < 0.05). Metaproteomic samples collected from the >0.8 μm (n = 19) fraction shows low α-diversity (E) but high β-diversity (F) than samples collected from the 0.2–0.8 μm (n = 22) and <0.2 μm (n = 20) fractions; Box shows median and interquartile range (IQR); whiskers show 1.5 × IQR of the lower and upper quartiles or range; outliers extend to the data range. Statistics are based on Wilcoxon test (two-side). ****p < 0.0001, ns, not significant. Source data are provided as a Source Data file.

To determine the key enzymes/proteins in the different size fractions, we identified 1630 (40% of total KOs) KOs where the relative abundance was significantly different either among size-fractions (1387 KOs) or depth-strata (412 KOs) (Wilcoxon test, p < 0.05, Figs. S3, 4, Supplementary Data 4). There were 178 KOs differentially abundant in both, depth layers and size-fractions, predominately originating from eukaryotes (ca. 60%, pie chart in Fig. S3A). However, bacteria dominated the differentially abundant unique KOs in size-fractions (ca. 70%) and depth (55%) (pie chart in Fig. S3A). The highest number of differentially abundant KOs was found between the <0.2 μm and 0.2–0.8 μm size-fraction, and the difference between epi- and bathypelagic was substantial (Fig. S3B–D). In contrast, the number of differentially abundant KOs was relatively low between the <0.2 μm and >0.8 μm fraction (Fig. S3B). This KO distribution suggests that the bacterial proteins in the <0.2 μm fraction likely originate from the particle-attached community retained on the 0.8 μm pore-size filters.

Functional annotations of microbial protein profiles revealed that photosynthesis, nitrification/denitrification, microbial chemotaxis, and motility exhibited different expression profiles between size-fractions (Fig. S4, Supplementary Data 4). Fourteen KOs were predicted to be responsible for the functional clustering between the size-fractions using a machine-learning random forest41 classification (Figs. 3, S5). Among these KOs were enzymes involved in C1 metabolism (carbon-monoxide dehydrogenase, CoxL), CO2 fixation (Rubisco, RbcL), nitrification/denitrification (nitrate reductase/nitrite oxidoreductase, NarG/NxrA), sulfur metabolism (dimethylsulfide dehydrogenase, DdhA) and transporter proteins, all different in relative abundance in the different size-fractions (Fig. 3B, C). Only enzymes related to photosynthesis (RbcL) were depth-related (Fig. 3A), consistent with the dominance of phytoplankton in the euphotic zone. It is also noticeable that ribosomal proteins (RplV, RplX, RpsI, RpsR) exhibited their highest relative abundance in the <0.2 µm fraction. Ribosomal proteins are cytoplasmic proteins mediating protein synthesis inside cells. High expression levels of transcripts encoding ribosomal proteins were reported during viral infection processes because viruses employ a cellular metabolism for viral protein synthesis42. The high relative abundance of ribosomal proteins found in the <0.2 µm fraction suggests infected bacterial cells exhibit a high translation activity (probably due to viral replication) and release cellular components (including ribosomal proteins) into the ambient water after cell lysis.

Fig. 3: Relative abundance profiles and variable importance of the top 14 proteins/enzymes identified using random forest classification.
figure 3

These 14 KOs were identified as feature KOs distinguishing KO profiles in metaproteomes collected from different size-fractions. The boxplot shows the relative abundance of these 14 KOs in different depth layers (A) and size-fractions (B). The importance of the 14 KOs in the different size-fractions predicted by random forest analysis is shown in the dot plot (C). The Mean Decrease Accuracy shows how much accuracy the model losses by excluding each variable and the mean decrease in Gini coefficient reflects how each variable contributes to the homogeneity of the nodes and leaves in the random forest result. Box shows median and interquartile range (IQR); whiskers show 1.5 × IQR of the lower and upper quartiles or range; outliers extend to the data range. Epi, samples collected from epipelagic (<200 m, n = 15); Meso, samples collected from mesopelagic (200–1000 m, n = 16); Bathy, samples collected from bathypelagic (>1000 m, n = 30); >0.8 μm, samples collected from the >0.8 μm fraction (n = 19); 0.2–0.8 μm, samples collected from the 0.2–0.8 μm fraction (n = 22); <0.2 μm, samples collected from the <0.2 μm fraction (n = 20). Source data are provided as a Source Data file.

Zooplankton-supported deep-sea particulate protein flux

The taxonomic composition of the metaproteome also exhibited a size-clustering pattern (Fig. 4A, Supplementary Data 5). The relative abundance of eukaryotic and bacterial proteins constituted 70–80% to the total proteome in our metaproteomic dataset with varying contributions among the size-fractions (Fig. 4A). In the >0.8 μm size-fraction, the ratio between bacterial and eukaryotic proteins (Bact:Euk) was about 1 (Fig. 4A). In the 0.2–0.8 μm size-fraction, however, the Bac:Euk ratio of proteins was 3, and in the <0.2 μm fraction ~5. The increase in the Bac:Euk ratio of proteins towards the smaller size-fractions, particularly in the bathypelagic (Wilcoxon test p < 0.05, Supplementary Data 5), indicates a shift in the source of organic matter, where eukaryotic and bacterial proteins contribute equally to the particle fraction but bacterial proteins dominate the dissolved protein pool.

Fig. 4: Changes in taxonomic composition of the eukaryotic, bacterial, and viral community in the metaproteome dataset.
figure 4

A Relative abundance of proteins of Archaea, Bacteria, Eukaryotes and Viruses in each size-fraction. B Relative abundance of proteins of algae, zooplankton, and fungi in each size-fraction. C Relative abundance of proteins of major bacterial groups in each size-fraction. D Relative abundance of proteins of viruses infecting different bacterial hosts in each size-fraction. The points and ranges show the medians, the 25th and 75th IQR. Epi, samples collected from epipelagic (<200 m, n = 15); Meso, samples collected from mesopelagic (200–1000 m, n = 16); Bathy, samples collected from bathypelagic (>1000 m, n = 30); >0.8 μm, samples collected from the >0.8 μm fraction (n = 19); 0.2–0.8 μm, samples collected from the 0.2–0.8 μm fraction (n = 22); <0.2 μm, samples collected from the <0.2 μm fraction (n = 20). Source data are provided as a Source Data file.

By grouping eukaryotic proteins into taxonomic categories (zooplankton, algae, and fungi, Table S4), changes in eukaryotic protein profiles were observed throughout the water column (Figs. 4B, S6, Supplementary Data 6). Zooplankton-derived proteins exhibited a weak depth-related trend in the >0.8 μm size-fraction (Wilcoxon test, Epi- vs. Mesopelagic, p = 0.095; Epi- vs. Bathypelagic, p = 0.111, Fig. 4B Supplementary Data 5). In the meso- and bathypelagic, however, the relative abundance of zooplankton proteins (ca. 30%) was three times higher than algal proteins (5–10%) in both, the >0.8 μm and 0.2–0.8 μm size-fraction (Wilcoxon test, p < 0.05, Fig. 4B). This difference is in sharp contrast to the epipelagic, where the relative abundances of algal and zooplankton proteins were similar in the >0.8 μm size-fraction (Wilcoxon test, p > 0.05, Fig. 4B). Especially, in the >0.8 μm size-fraction, the relative abundance of algal proteins significantly decreased from the epipelagic (median = 14.0%, IQR = 12.7–18.5%) to the bathypelagic layer (median = 6.3%, IQR = 4.0–7.0%) (Wilcoxon test, p < 0.05).

Proteins are an essential component of total biomass and the changes in protein source reflect the variation in organic matter supply37. The attenuation of algal proteins with water column depth is consistent with the fact that sinking phytoplankton are insufficient to sustain the deep-sea microbiome4,9. It has been suggested that zooplankton-derived POM (fecal pellet, carcasses) and DOM7,43,44 become the primary carbon source in the deep sea, supporting microbial activity in the bathypelagic ocean4. This also implies that sequestration mechanisms like the gravity pump (fast-sinking zooplankton fecal pellet and carcasses packed into sinking particles)8,43 and the zooplankton migration pump (living zooplankton, 150–500 m)3,45 substantially contribute to the carbon flux into the meso- and bathypelagic ocean2,3,7. These different carbon pumps directly influence deep-sea POM’s quantity and composition such as marine snow37.

Gammaproteobacteria are an essential source of bacterial protein

Alpha- and Gammaproteobacteria were the two major heterotrophic bacterial groups with gammaproteobacterial proteins dominating all three size-fractions, especially the <0.2 μm fraction, which is in contrast to the 16S rRNA profile (Figs. 4C, S7A, Supplementary Data 7). The ratio between Gammaproteobacteria and Alphaproteobacteria in the metaproteome in all size-fractions was significantly higher than the 16S rRNA-based ratio (Wilcoxon test, p < 0.05, Fig. S7B), suggesting that Gammaproteobacteria substantially contribute to protein production despite their relatively low abundance. Taxonomic analysis showed that while Alteromonadales and Oceanospirillales were the major contributors to gammaproteobacterial proteins in all fractions, the dominating groups in Alphaproteobacteria varied between fractions (Fig. S8A, B). This variability led to differences in the ratio of Gamma-/Alpha-proteobacterial proteins between fractions (Fig. S7B). For example, the cell abundance (16S rRNA gene) of the most abundant Alphaproteobacterium in the 0.2–0.8 μm fraction, Pelagibacterales (SAR11), was almost 30 times higher than Alteromonadales in the epipelagic (Fig. S9A). However, their protein abundance was lower than that of Alteromonadales throughout the water column (Fig. S9B). Cell-size measurements showed that deep-sea Alteromonas spp. had a larger (1.2 times) biovolume than Pelagibacterales (SAR11) (Wilcoxon test, p < 0.05, Fig. S9C, D). Thus, the smaller biovolume of SAR11 than Alteromonadales might have resulted in a low protein yield. This is consistent with the microfluidic mass sensor-based analysis that the dry mass of SAR11 is five- to twelve-fold lower than Prochlorococcus and almost two orders of magnitude lower than Vibrio (V. splendidus strain 13B01)46. As proteins account for about 50–60% of bacterial dry weight47, protein abundance can be used as a proxy for microbial biomass and add additional value in parameterizing ecological and biogeochemical models48.

Remarkably, gammaproteobacterial proteins dominated (ca. 80%) the <0.2 μm size-fraction (Figs. 4B, S8C, D). While proteins collected in the >0.8 μm and 0.2–0.8 μm size-fraction originated mainly from intact cells, proteins collected in the <0.2 μm fraction consisted of cell-free extracellular enzymes and proteins released from microorganisms. In our dataset, cell-free extracellular enzymes and protein debris in the <0.2 µm fraction were primarily of gammaproteobacterial origin (Fig. S8C, D). Signal peptides indicate protein/enzyme secretion into the environment49 and the yields of extracellular enzymes indicate their major role in POM solubilization and assimilation16,50. Ten to 15% of the proteins in the <0.2 μm fraction was associated with signal peptides and hence, were actively secreted as cell-free extracellular enzymes (Fig. S10A). In contrast, cell-associated extracellular enzymes detected in the 0.2–0.8 μm and >0.8 μm size-fraction varied in their relative abundance (Fig. S10B, C), but the functional composition of the cell-free and cell-associated extracellular enzymes was similar (Fig. S10D–F). In the extracellular enzyme pool, hydrolytic extracellular enzymes only accounted for <20% of the extracellular enzyme pool (Fig. S10D–F). Oxidoreductases involved in the oxidative degradation of algal polysaccharides51, in producing reactive oxygen species and mediating metal bioavailability52 contributed >40% to the extracellular enzyme pool (Fig. S10D–F). Oxidoreductases dominated both the extracellular and cytoplasmic enzyme pool (Fig. S10G–I) but their composition differed (Fig. S11), with cell-free oxidoreductases in the <0.2 μm fraction mainly acting on the CH-OH group (EC 1.1). Such functional difference suggests distinct enzymatic activities in extracellular substrate processing and cellular metabolism. Notably, the cell-associated extracellular enzymes might contribute to the cell-free extracellular enzyme pool when the cell is lysed (either due to viral lysis or cell decay), leading to overestimating the cell-free extracellular enzyme pool. However, we found significant differences between cell-associated and cell-free extracellular enzymes, which makes it unlikely that previously cell-associated extracellular enzymes contributed substantially to the dissolved enzyme pool. For example, although oxidoreductases were the most abundant extracellular enzymes in all fractions (Fig. S10D–F), in the 0.2–0.8 µm fraction, the enzyme class EC1.17 was relatively abundant but hardly contributed to <0.2 m or >0.8 µm fraction (Fig. S11A). Similarly, the enzyme class EC1.8 was found in both, the 0.2–0.8 µm and >0.8 µm fractions but was barely detected in the <0.2 µm fraction (Fig. S11A). Such changes suggest marginal interference of cell-associated extracellular enzymes to the cell-free extracellular enzyme pool and show metabolic adaptations of microbes present in the different size fractions (0.2–0.8 µm vs. >0.8 µm).

As zooplankton proteins are a major particulate protein source in the deep ocean (Fig. 4B), the dominance of extracellular enzyme production suggests that Gammaproteobacteria utilize zooplankton-derived POM, which might support their cellular metabolism either for growth, respiration and/or mitigate oxidative stress caused by high hydrostatic pressures17. Leucine incorporation rates, used as a proxy for heterotrophic microbial activity, revealed that the metabolic activity of Alteromonas is, although reduced by the high hydrostatic pressure in the deep ocean, still comparable to pressure tolerant groups like SAR20217. Leucine incorporation rates of Alteromonas spp. contributed 25–50% of the mean leucine incorporation rate in the deep ocean (Fig. S9E, Supplementary Data 10), but the cell abundance of Alteromonas spp. was relatively low (~103cells ml−1)18,53, which suggests disproportionally high cell loss. Consistently, we found that proteins released from cell decay (proteins without signal peptide) accounted for 80% of the dissolved protein pool in the <0.2 μm size-fraction (Fig. S10A), where Alteromonadales contributed 40–50% of dissolved gammaproteobacterial protein (Figs. 4C, S8D). This protein percentage indicates that lysed Alteromonadales cells might be a major source for the dissolved protein pool.

Viral lysis and zooplankton grazing are the major causes of cell death in microbes8,19,27. High viral lysis rates were observed in marine detrital particles as a significant fraction of deep-sea heterotrophic microbes preferentially associated with particles16,54. Genetic analysis suggests that particle attached-microbes have a higher growth efficiency than free-living microbes in the deep sea55,56. Thus, the lytic infection of active Gammaproteobacteria would efficiently convert cellular organic matter into DOM due to viral lysis19,57. In our dataset, viral proteins constituted 1–5% of the total proteome (Fig. 4A), with Myo-, Podo-, and Siphoviridae comprising most of the viruses (Fig. S12A). About 3–18% of viral proteins expressed by viruses could be linked to putative hosts (Supplementary Data 8). Linking viral proteins to the putative host (see Methods, Figs. 4D, S12B, C) revealed that the relative abundance of viruses infecting Gammaproteobacteria was highest in all fractions (viruses reproducing in the cell, Fig. 4D, Wilcoxon test, p < 0.05, Supplementary Data 9). Viral proteins detected in the >0.2 μm (>0.8 μm and 0.2–0.8 μm, Fig. 4C) fraction might be viruses actively infecting host cells42,58. Recent metagenomic examination of cell-associated viral communities in the deep ocean indicate that Alpha- and Gammaproteobacteria are the major hosts of deep-sea viruses59. A remarkable niche separation has also been reported in host preference of deep-sea viruses where Gammproteobacteria are primarily hosts on particles59. However, no metagenomic data are available from the <0.2 µm fraction in the deep ocean, which might lead to underestimating deep-sea viruses. Viruses rely entirely on their host cells’ translation machinery to produce proteins like capsid, essential for viral replication. Previous results showed that the transcripts of host ribosomal protein become upregulated within the first hour of viral infection and are probably released together with viral progenies during the lytic process58. In our dataset, the high relative abundance of ribosomal proteins in the <0.2 µm fraction (Fig. 3) provides a strong indication of viral intervention on the host machinery for viral propagation60,61, where viruses use bacterial ribosomal protein to synthesize viral proteins and the ribosomal protein is released into ambient water after cell lysis. The high relative abundance of viruses infecting Gammaproteobacteria in the >0.8 µm and 0.2–0.8 µm fractions suggests an active generation of progeny viruses in gammaproteobacterial cells, eventually resulting in cell lysis and release of gammaproteobacterial cellular proteins into the <0.2 μm size-fraction (Figs. 4C, S8D). Potentially, Gammaproteobacteria might be more “fragile” than other bacteria, and cells might break during filtration, which would have biased our conclusion. However, in our filtration process, we applied 1.5–2.0 bar air pressure, much lower than pressures necessary for cell disruption (French press, 1300–2700 bars). Hence, cell rupture of specifically, Gammaproteobacteria is highly unlikely. Also, Gammaproteobacteria cannot easily pass through the 0.2 μm filter pores as SAR11 is much smaller than Alteromonadales, which were well retained by our filtration setup. Contamination due to growth in the filtrate is also unlikely because the growth rate of Gammaproteobacteria in unamended deep-sea waters is very slow6, and we used SDS for growth inhibition in the <0.2 µm fraction.

To further examine the possible contribution of Gammaproteobacteria to the DOM pool, we analyzed the metagenomic profile from the <0.2 µm fraction collected by the Tara Ocean expedition27. We also found a high relative abundance of gammaproteobacterial DNA in the metagenomes of the <0.2 μm fraction in the epipelagic realm with a taxonomic profile similar to our metaproteome (Fig. S13A). Classification at the order level showed that Sphingomonadales and Rhodobacterales from Alphaproteobacteria (Fig. S13B), Alteromonadales and Oceanospirillales from Gammaproteobacteria (Fig. S13C) were the major groups in the metagenomes of the <0.2 μm fraction, which is consistent with our metaproteome data (Fig. S8). This further indicates Gammaproteobacteria’s high contribution to the DOM pool, which starkly contrasts their 16S rRNA based relative abundance (Fig. S7A). Thus, the cellular components released to the DOM pool, together with the high yields of extracellular enzymes, indicate a thus far overlooked role of Gammaproteobacteria in the deep ocean’s carbon cycle (Fig. 4A, B). However, this conclusion has to be taken with caution as no direct lytic activity was detected on Gammaproteobacteria.

A close virus-host interaction was also found for Cyanobacteria in the epipelagic layer (Fig. 4D). Cyanobacteria were most abundant in the >0.8 μm size-fraction (ca. 30%) in the epipelagic and decreased in relative abundance with depth (ca. 2% in the bathypelagic, Fig. 4C, Supplementary Data 6). Along with the cyanobacterial hosts, cyanophages were abundant in the epipelagic waters in the >0.8 µm and <0.2 µm fraction (Fig. 4D), indicating intensive viral infection of cyanobacteria. The relative abundance of the viral photosystem-II (psbA) was similar to the relative abundance of cyanobacterial psbA, especially in the >0.8 μm size-fraction (Figs. 4B, C, S12D) reflecting close virus-host interactions58. These results are consistent with those from metatranscriptomic analyses, where 50% of psbA transcripts originated from cyanophages, confirming the major role of viruses in regulating photosynthetic processes in the sunlit surface ocean62. Stable isotope labeling proteomics63,64 together with lysis rate measurements54 will provide an in-depth view on virus mediated carbon flux in the global ocean.

Bacteroidetes derived proteins were also abundant in the epipelagic but decreased in abundance with depth (Wilcoxon test, p < 0.05) in both the >0.8 μm and 0.2–0.8 μm size-fraction (Fig. 4C), consistent with the results of 16S rRNA analysis (Fig. S7A). Recently, it has been shown that high hydrostatic pressure substantially inhibits the metabolic activity of Bacteroidetes and Alteromonas in the deep sea17; however, compared to Alteromonas, Bacteroidetes suffers higher oxidative stress under high pressure conditions17. Also, Bacteroidetes need excessive trimethylamine to maintain their morphology in the deep sea65. The different adaptive mechanisms of Altermonas and Bacteroidetes might shape the distinct depth profile throughout the water column.

Urea fuels nitrification-mediated dark inorganic carbon fixation

Thaumarchaea and Nitrospinae were the major chemolithoautotrophs in our metaproteome, with the highest relative abundances detected in the 0.2–0.8 μm size-fraction in the mesopelagic (Fig. 5A). The relative abundance of Nitrospinae in the metaproteome was higher than expected from the 16S rRNA analysis (Figs. 5A, S7C, D) because Nitrospinae cells are larger compared to other bacterial taxa15. In contrast, thaumarchaeal cells have a small biovolume, as revealed by microscopic analyses (only 60% of SAR11, Fig. S9C, D). Ammonia monooxygenase (AmoA) and nitrate oxidoreductase (NxrA) are the key enzymes used by Thaumarchaea and Nitrospinae, respectively, for energy harvesting to fuel dark DIC fixation. In our metaproteome, the relative abundance of NxrA was almost two orders of magnitude higher than that of AmoA (Fig. 5B), which compensates for the lower abundance of Nitrospinae compared to Thaumarchaea (Fig. S7C). Caution should be paid that tryptic digestion and ionization efficiency may differ between proteins/peptides from different taxonomic/functional groups. Still, the high expression of NxrA in our analysis supports previous reports that NxrA is one of the most abundant oxidoreductases in the mesopelagic20,66. Nitrospinae also exhibits one order of magnitude higher DIC fixation rates15 but lower (three- to four-fold) energy conversion efficiency67 than Thaumarchaea. Thus, a high cell-specific oxidation rate is suggested for Nitrospinae and the high expression level of NxrA (Fig. 5B) found in our dataset suggests homeostasis of the nitrogen flux between Nitrospinae and Thaumarchaea67. However, the difference in the mortality rate between Thaumarchaea and Nitrospinae was also suggested as an alternative explanation for differences in cell abundance and energy thermodynamics in oxygen depleted areas68.

Fig. 5: Nitrification process revealed in the metaproteome.
figure 5

A Relative abundance of proteins of nitrifiers in each size-fraction. B Relative abundance of key enzymes involved in nitrification in the 0.2–0.8 μm size-fraction. C Relative abundance of urease in each size-fraction. D Taxonomic breakdown of urease in each size-fraction. The x-axis represents metaproteomic samples from different stations, and the depth layer is indicated in the horizontal color bar on top of the panels. The points and ranges show the medians, the 25th and 75th IQR. Epi, samples collected from epipelagic (<200 m, n = 15); Meso, samples collected from mesopelagic (200–1000 m, n = 16); Bathy, samples collected from bathypelagic (>1000 m, n = 30); >0.8 μm, samples collected from the >0.8 μm fraction (n = 19); 0.2–0.8 μm, samples collected from the 0.2–0.8 μm fraction (n = 22); <0.2 μm, samples collected from the <0.2 μm fraction (n = 20). Source data are provided as a Source Data file.

The first nitrification step is ammonia oxidation, providing energy for DIC fixation. Previous reports suggest that, due to insufficient ammonia supply, urea might serve as an alternative ammonium source for Thaumarchaea as they can use urease for urea cleavage14,20,53. In the 0.2–0.8 μm size-fraction almost 90% of the urease present in the mesopelagic (Fig. 5C, D) was expressed by Thaumarchaea on the metaproteome level, which reached their highest relative abundance in the mesopelagic layers (Figs. 4A, 5A, S7C). This high contribution of thaumarchaeal urease coincided with the high contribution of zooplankton-derived proteins to the total protein pool in the mesopelagic realm (Fig. 4B). In the open ocean, zooplankton can excrete large amounts of urea and its concentration is significantly correlated with zooplankton biomass10,69. Nitrifiers like Thuamarchaea can directly use the urea excreted by copepods12. Thus, the activity of zooplankton in the mesopelagic ocean might provide POM for heterotrophs and indirectly support dark DIC fixation of Thaumarchaea via the release of urea. Although the highest relative abundance of thaumarchaeal urease was found in the mesopelagic in the 0.2–0.8 µm fraction, we also detected thaumarchaeal urease in the deep-sea (Fig. 5D). Genes encoding thaumarchaeal urease were found in deep-sea metagenomes70 and Thaumarchaea contributes 10–20% to the total prokaryotic production in Atlantic deep waters53. The detection of thaumarchaeal urease in the bathypelagic metaproteome further supports the hypothesis that deep-sea Thaumarchaea also use urea as an ammonia source for DIC fixation53,70. Nitrospinae was also reported to use urease to cleave urea15,68, but Nitropinae related urease was not found in our metaproteome. We constructed a phylogenetic tree (Fig. S14) of UreC protein sequences identified in our metaproteomics together with reference UreC sequences of Gammaproteobacteria (Alteromonadales, Oceanospirilalles), Alphaproteobacteria (SAR11, Rhodobacterales, Rhodospirilalles), AOA (Nitrosopumilus), Cyanobacteria (Synechococcus) and Nitrospinae68,71, the result showed that although the phylogenetic placement of Nitrospinae-UreC sequences was close to the gammaproteobacterial UreC, but none of the identified UreC from our analysis was placed into the Nitrospinae-UreC cluster. We further added the Nitrospinae-UreC sequences to our database to check whether any peptide overlooked from our analysis due to the low number of Nitrospinae-UreC sequences in the database, but still, no Nitrospinae-UreC related peptide detected. Such results are similar to other metaproteomic studies15,20, indicating a low abundance of Nitropinae-related urease in the open ocean. Future metaproteomic sampling in marine oxygen minimum zones might provide stronger signal for Nitrospinae urease as the gene and the enzymatic activity are primarily reported in the oxygen depleted seawaters68,71.

In addition, niche partitioning was found in heterotrophic bacterial urea utilization, where urease in the >0.8 μm size-fraction was expressed by Alphaproteobacteria while gammaproteobacterial urease dominated the <0.2 μm size-fraction (Fig. 5D). As bacterial urease is located in the cytoplasmic space, the dominance of gammaproteobacterial urease in the <0.2 µm fraction further suggests that Gammaproteobacteria are actively utilizing urea when the cell is intact and release urease into ambient seawater after cell lysis. This supports our conclusion that active Gammaproteobacteria may be exposed to high viral lysis rates.

We also detected denitrification related enzymes such as dissimilatory nitrate reductase (NapA/NarG) in the metaproteome (Fig. S15C, D). In general, the denitrification enzymes exhibited the highest relative abundance (0.3–0.5% of the total proteome) in the >0.8 µm fraction (Fig. S15C). This supports the hypothesis that detrital particles provide a niche for anaerobic microbial metabolism in the dark ocean72. Enzymes involved in aerobic respiration (CoxA/CyoA/CcoN/CydA, cytochrome oxidase) were drastically reduced in the >0.8 μm size-fraction in the mesopelagic (Fig. S15A, B) compared to the epipelagic waters. However, only NapA, NarG, and NirK were found in the >0.8 µm fraction. In contrast, Nitrous Oxide Reductase (NosZ) and hydrazine synthase (HzsA), two typical enzymes involved in anaerobic denitrification, were found in the 0.2–0.8 µm and <0.2 µm fraction. The relative abundance of denitrification enzymes in the 0.2–0.8 µm and <0.2 µm fraction was 0.05% and 0.02% of the total proteome, respectively (Fig. S15D), NosZ and HszA only contributed <0.5% to the total denitrification enzymes. Phylogenetic analysis further revealed that the NosZ protein sequence detected in our metaproteome showed high similarity (Fig. S15E) to known NosZ sequences. This suggests the capability of marine bacteria to convert N2O to N2, but the microenvironment might primarily determine the expression. Also, the NosZ in the dissolved fraction (<0.2 µm) might originate from cell lysis because NosZ is a cell-associated extracellular oxidoreductase, and it contributed marginally to the dissolved protein pool (1 × 10−4%). A similar analysis was also made for HszA (Fig. S15F). The HszA sequences in the metaproteome were similar to hydrazine synthase in Ca. Scalindua rubra [ODS33869.1].

Knowledge on the composition and functional variability of the ocean’s microbiome is crucial to understanding the biogeochemical processes in the different strata of the water column1. By characterizing and quantifying the protein abundance of the entire microbial consortia, our metaproteomic analyses provide semi-quantitative results on the taxonomic contribution to the marine protein pool. Our metaproteomic approach revealed that zooplankton detritus contributes about 30% to the eukaryotic protein pool in the meso and bathypelagic ocean (Fig. 4B), and likely serves as a major POM source (zooplankton migration pump, fecal pellet and carcasses packed into sinking particles) dominating over phytoplankton flux in the meso- and bathypelagic ocean.

Besides providing POM, urea released from zooplankton might be one of the major ammonia sources supporting nitrification in the mesopelagic zone, as indicated by the relatively high abundance of thaumarchaeal urease at depth (Fig. 5). The expression of thaumarchaeal urease in the 0.2–0.8 µm fraction in the mesopelagic provides support for the tentative link between nitrification and zooplankton activity12,14. Hence, dark DIC fixation in the mesopelagic is likely supported by the zooplankton release of urea, ultimately resulting in additional organic carbon becoming available to the microbial community in the dark ocean. We also found significant differences between the relative abundance of nitrification enzymes (AmoA vs. NxrA). These results provide insight into the DIC fixation in the mesopelagic mediated by marine nitrifiers.

Gammaproteobacteria primarily utilize POM by secreting extracellular enzymes (Figs. 4C, S8C). Gammaproteobacteria also maintain their metabolic activity under high pressure conditions (Fig. S9C). Despite their relatively low abundance (~103 cells ml−1)18,53, Gammaproteobacteria contribute 10–30% to the bacterial protein pool in the 0.2–0.8 µm fraction (Fig. 4C). However, they also experience significant cell losses via grazing pressure and/or viral lysis (Fig. S8D). The detection of viral proteins in the size-fractions >0.8 µm and 0.2–0.8 µm together with the high relative abundance of ribosomal proteins in the <0.2 µm fraction suggests dynamic viral-host interactions between gammaproteobacterial hosts and their corresponding viruses (Figs. 3, 4C, D). The viral lysis of active Gammaproteobacteria appears to be critical in converting cellular organic matter into DOM. Eventually, it contributes to the bioavailable DOM pool in the dark ocean, a process known as the viral shunt57,73.

Our characterization of the bulk microbiota at the protein level provides holistic and direct information on trophic interactions mediating the carbon flux in the deep ocean. Our data suggest that zooplankton-derived POM and inorganic nutrients substantially support the carbon cycle in the meso- and bathypelagic oceans. We also provided evidence on how the metabolic activities of heterotrophic and autotrophic microbes, together with viral lysis convert POM and inorganic nutrients into labile DOM.

Methods

Sampling and filtration

About 100–400 L of seawater were sequentially filtered through 0.8 μm and 0.2 μm pore-size polycarbonate membranes (Isopore, 142 mm diameter, Millipore) at 22 stations in the Pacific, Atlantic and Southern Ocean (Fig. S1A, Supplementary Data 1). Two large volume filtration holders (Sartorius) in conjunction with diaphragm pumps operated at a positive pressure not exceeding 1.5–2.0 bars were used. The filtration process generally finished within 1.5–2 h. The filtrates were amended with SDS to a final concentration of 0.1% (w/w) to avoid protein aggregation and inhibit bacterial growth. The 0.2 µm filtrate was further concentrated with tangential flow filtration driven by peristaltic pumps and using low protein binding membranes of a molecular weight cutoff at 5000 Da. To reach a final volume of ~50 ml of viral and dissolved proteins, an 0.5 m2 ultrafiltration cassette (Pellicon 2 Ultracel membrane, Millipore) and a 200 cm2 polyethersulfone cartridge with a hold up volume <2 ml (Vivaflow 200 module, Sartorius) were combined. The filters and the concentrates were immediately frozen in liquid nitrogen and stored at −80 °C until extraction. Eukaryotic detritus like algal aggregates, zooplankton carcasses and fecal pellets, cyanobacteria, and particle-attached heterotrophic bacteria were mainly collected in the >0.8 μm fraction and free-living prokaryotes in the 0.2–0.8 μm fraction; viruses together with dissolved proteins/enzymes, either secreted or released, were obtained in the <0.2 μm fraction27,30,74,75.

Metagenomics and metaproteomics extraction and sequencing

From the 0.2–0.8 μm filters, slices of about one-eighth to one-quarter were used for DNA extractions, corresponding to ~20 L of sample76. Filters were lysed with a buffer containing 0.75 M sucrose, 50 mM Tris-HCl (pH = 8), 40 mM EDTA (pH = 8), lysosome (1 µg ml−1), ProteinaseK (2.5 mg ml−1) and 5% SDS. The supernatant was cleaned with phenol:chloroform:IAA (25:24:1) and precipitated with 35% isopropanol. The total DNA was sequenced on an Illumina NextSeq 500 platform. Protein extraction from the remaining filter was performed in the lab using lysis buffer containing 7 M urea, 2 M thiourea, 1% DTT, 2% CHAPS, and protease inhibitor cocktail. The mixture was homogenized with bead-beating, sonicated at high power with pulses of 10 s over 10 min. The supernatant from the slurry and the concentrates were separately concentrated to 250 μL with a 3000 Da Amicon Ultra-15 Centrifugal Filter Unit (Millipore). The protein fraction was precipitated with cold ethanol overnight at −20 °C, and resuspended with 7 M urea and 2 M thiourea. The protein pellet of each sample was digested using the filter-aided in-solution trypsin digestion method (1:100, w/w)77. The tryptic peptide pellets were dissolved in 4% (v/v) acetonitrile, 0.1% (v/v) formic acid. After desalting (C18) the peptides were sequenced on a Q-Exactive™ Hybrid Quadrupole-Orbitrap™ Mass Spectrometer (ThermoFisher Scientific). At least 2 replicates per sample were loaded on C18 reverse-phase columns (EASY-Spray 500 mm, 2 µm particle size, ThermoFisher Scientific). Separation was achieved with a 90 min gradient from 98% solution A (0.1% formic acid in high purity water) and 2% solution B (90% ACN and 0.1% formic acid) at 0 min to 40% solution B (90% ACN and 0.1% formic acid) at 90 min with a flow rate of 300 nL min−1. Nano-electrospray ionization MS/MS measurements were performed with the following settings: Full scan range 350–1800 m/z resolution 120,000 max. 20 MS2 scans (activation type CID), repeat count 1, repeat duration 30 s, exclusion list size 500, exclusion duration 30 s, charge state screening enabled with the rejection of unassigned and +1 charge states, minimum signal threshold 500. The mass spectrometry proteomics data were deposited to the ProteomeXchange Consortium via the PRIDE78 partner repository with the dataset identifier PXD034421.

Acquisition of gene catalogs for the marine planktonic community

Marine eukaryotic and viral sequences were downloaded from previous publications27,30. A prokaryotic gene catalog was construct from metagenomics reads downloaded from the National Center for Biotechnology Information (NCBI) website, together with an in-house database of sequencing results (Supplementary Data 2). Reads from the metagenomic dataset were assembled individually using Megahit v1.1.2 with default settings79. Subsequently, putative genes were predicted on contigs longer than 200 bp using Prodigal version 2.6.3 under metagenome mode (-p meta)80 and further clustered at 90% similarity (-c 0.9 -G 0 -aS 0.9) using CD-HIT v4.6.881 to construct the prokaryotic gene database for downstream metaproteomic analysis. To retrieve the relative abundance of prokaryotic taxa in the metagenome, miTAG analysis82 was performed by extracting 16S rRNA genes using SortMerRNA83 for downstream analysis with LotuS84. Analysis of gene-based operational taxonomic units (mOTUs)85 was also performed for both, the metagenomics and metatranscriptomic datasets to cover the microbial abundance and activity, respectively.

Proteomic annotation and analysis

The construction of a robust database is key to interpret metaproteomic samples, thus we used an optimized database construction strategy28. We combined the predicted genes from our in-situ metagenomics assembly with publicly available gene catalogs from metagenomics/metatranscriptomic assemblies of the global ocean Tara, Malaspina, and Bio-Geotraces expeditions21,24,25,30 to obtain a deep coverage of microorganisms (including micro-eukaryotes and viruses) throughout the entire water column. All these sequences were concatenated and de-replicated to construct a non-redundant database to avoid biases by introducing any over-represented sequences. All mass spectrometry files from different size fractions were searched against the same database. Due to the large size of the gene catalogs a two-step database searching method was used29. Briefly, in the first search, the MS/MS spectra of proteomic samples were pooled and searched using the SEQUEST-HT86 engines against proteins in the databases of eukaryotes, prokaryotes, and viruses with a loose false discovery rate (FDR) of 10%. Sequences identified in this step were exported to a refined database for the second search where the proteomic samples were analyzed separately. In this step the FDR was set to 1% for protein selection and the scoring function Xcorr threshold was set to 1 per charge (2 for +2 ions, 3 for +3 ions, etc.). The variable modifications were set to acetylation of the N-terminus and methionine oxidation, with a mass tolerance of 10 ppm for the parent ion and 0.8 Da for the fragment ion. We allowed 2 missed and non-specific cleavages and only dynamic modifications were used. Percolator in Proteome Discoverer 2.1 (ThermoFisher Scientific) was used for validation. A minimum of two peptides and one unique peptide were required for protein identification, no protein grouping was used in this analysis. Functional annotation of identified protein sequences was performed by searching against EggNOG87, KEGG36 and Pfam88, using emapper89. Taxonomic affiliation of sequences was determined using the lowest common ancestor algorithm (LCA, diamond blastp --top 10 –sallseqid -outfmt 102) adapted from DIAMOND v2.0.990 blast by searching against the non-redundant (NR) database (downloaded from NCBI in March 2022). The top 10% hits with an e-value < 1 × 10−5 were used for taxon determination (--top 10). Metabolic annotation for proteins/enzymes involved in photosynthesis, nitrification, denitrification and respiration was done using DIAMOND v2.0.990 (--max-target-seqs 1, --max-hsps 1) searching against metabolic marker protein databases (https://bridges.monash.edu/collections/_/5230745)91. Since these are highly conserved sequences, they require a high level of discrimination to differentiate them (i.e., NxrA and NarG share the same KO). SignalP v5.049 was used to detect the presence of signal peptides for extracellular enzymes of bacterial origin. The gram-positive mode was used for sequences affiliated to Actinobacteria and Firmicutes. Sequences with hits on COG0804 |COG0831|COG0832|COG0829| COG0378 in the EGGNOG database and/or K01427 |K01428|K01429|K01430| K14048 in the KEGG annotation were kept as urease and the taxonomic affiliation of urease was determined via LCA analysis as described above. Sequence alignment of UreC sequences was conducted using MAFFT (online server, version 7, with default setting)92. Phylogenetic tree of ureC sequences was constructed using FastTree (2.1)93 and visualized using Interactive Tree of Life (iTOL, v6)94. All ureC sequences for the phylogenetic analysis are available via FigShare (https://doi.org/10.6084/m9.figshare.24570104). Virus-host prediction was achieved by blasting viral proteins against predicted viral genes derived from previous reports on marine viral communities collected from the entire water column27,59. We assumed that viral proteins with similarity >90% share the same host. Protein quantification was conducted with the normalized area abundance factor (NAAF), a chromatographic label-free method based on peak area95. The NAAF is calculated as:

$${NAAF}=\frac{{x}_{i}}{{L}_{i}}/\sum \frac{{x}_{i}}{{L}_{i}}$$
(1)

Where xi represents the peak area of a peptide and Li represents the length of the peptide. Only the peak area of unique peptides (a peptide not shared with other proteins or protein groups) and Razor peptide (a peptide shared by multiple different proteins will be assigned to the proteins with the highest number of unique peptides but with the shortest protein length) was employed for the quantitation.

Measurements of prokaryotic cell size

Cell sizes of prokaryotes used here were analyzed in the study of Amano et al.17 by measuring the area of 4′,6-diamidino-2-phenylindole (DAPI)-stained cells that in addition were labeled with catalyzed reporter deposition fluorescence in situ hybridization (CARD-FISH) using group specific oligonucleotide probes. Briefly, the CARD-FISH samples were collected from ~450–4000 m at 7 stations in the Atlantic and Southern Ocean where 4 stations overlap with the sampling stations of the metaproteomic analysis (Supplementary Data 1). Target organisms filtered onto 0.2 µm-pore size filters were visualized with the 5’-horseradish-peroxidase-labeled oligonucleotide probes: Alt1413 probe for Alteromonas/Colwellia96, a mix of SAR11-152R, SAR11-441R, SAR11-542R and SAR11-732R probes for the SAR11 clade97 and a mix of Cren537 and GI-554 probes for Thaumarchaea98,99. Original images were taken on an epifluorescence microscope (Axio Imager M2, Carl Zeiss). Image analysis was conducted with the ACMEtool3 (Zeder, M. 2005-2021, Software for Biology, http://www.technobiology.ch) by segmenting the DAPI signals from the 8-bit grayscale images and sorting CARD-FISH positive signals. After manually checking the detection of the cells and their morphology, the cell volume (V) was calculated via the area size and perimeter of each DAPI signal by assuming the rod-model:

$$V=\pi {r}^{2}\cdot \left(l-2r\right)+\frac{4}{3\pi {r}^{3}}$$
(2)

where r is radius of a cell and l is the length of a cell. Although size estimates using DAPI can be smaller than the actual cell size as discussed previously100,101, we found that DAPI-stained cell volumes correspond to the amount of DNA in a cell and to the cell biomass102.

Determination of cell-specific leucine incorporation rates

Cell-specific leucine incorporation rates were measured in a previous study17. Briefly, the size of the silver grain halo around each DAPI-positive cell was measured using Axio Vision SE64 Re4.9 (Carl Zeiss) following microautoradiography performed on CARD-FISH processed samples (MICRO-CARD-FISH). The halo areas were converted to leucine incorporation rates in mol day−1 with an equation obtained from correlating the total area of the halos with the bulk leucine incorporation rates103.

Statistical analysis and visualization

We used a machine-learning random forest41 classification to predict feature KOs between different size-fractions using randomforest104 package in R. Random forest classification was carried out with a relative abundance table of KOs from 61 samples. Only KOs with a relative abundance >=1% in at least one sample was kept as prediction features (231 out of 3817 KOs). The relative abundance table of KOs was randomly spilt into “training data” (containing 70% of the sample, 41 out of 61) and “testing data” (containing 30% of the sample, 20 out of 61). A random forest model was built using “training data” by classifying the relative abundance of KOs against size-fractions. KOs ranked by random forests according to feature importance were determined over 1000 iterations (ntree = 1000). The number of decision trees (ntree) was determined by Out-of-Bag (OOB) error for different values (ntree = c[100, 200, 500, 1000,1500]) of ntree. A minimal OOB error was found when ntree = 1000. The number of marker KOs were identified using a 10-fold cross-validation with the rfcv() function. The cross-validation error became stabilized when using at least the top 14 most important KOs (Fig. S5). The random forest model was further applied to the “test data” to examine prediction accuracy with the predict() and confusionMatrix() function. The accuracy, sensitivity, and specificity of the model can be found in the Supplementary Data 11. The classification rule for each decision tree is provided in Supplementary Data 12. Other statistics and visualization were also performed using packages in R. Specifically, Vegan105, ggplot2106, circlize107, pheatmap108 were used for ordination, diversity calculation, and visualization, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.