Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Towards omics-based predictions of planktonic functional composition from environmental data

## Abstract

Marine microbes play a crucial role in climate regulation, biogeochemical cycles, and trophic networks. Unprecedented amounts of data on planktonic communities were recently collected, sparking a need for innovative data-driven methodologies to quantify and predict their ecosystemic functions. We reanalyze 885 marine metagenome-assembled genomes through a network-based approach and detect 233,756 protein functional clusters, from which 15% are functionally unannotated. We investigate all clusters’ distributions across the global ocean through machine learning, identifying biogeographical provinces as the best predictors of protein functional clusters’ abundance. The abundances of 14,585 clusters are predictable from the environmental context, including 1347 functionally unannotated clusters. We analyze the biogeography of these 14,585 clusters, identifying the Mediterranean Sea as an outlier in terms of protein functional clusters composition. Applicable to any set of sequences, our approach constitutes a step towards quantitative predictions of functional composition from the environmental context.

## Introduction

Planktonic organisms play an essential role in biogeochemical cycles through the capture and export of carbon into the deep ocean, nitrogen fixation, remineralization of organic matter, or the production of dimethyl-sulfur, hence impacting global climate1,2,3,4,5. The understanding and modeling of such biogeochemical functions is key for predicting the global functioning of oceanic ecosystems, and especially their response to climate change6,7,8. These biogeochemical functions are usually modeled by simulating the dynamics of plankton functional types (PFT) that are theoretical entities grouping planktonic organisms according to shared functional capacities (e.g., calcifiers, nitrogen fixers, or silicifiers)6. This approach allows to incorporate the functional diversity of marine plankton into biogeochemical models8,9,10,11 but often relies on a priori and restricted choices of the considered types of planktonic organisms and of their physiological rates or parameters12. For example, bacteria are often lacking an explicit representation in global PFT models9,11, even though more than 1030 bacterial cells inhabit the ocean’s subsurface3. To tackle this issue, recent works proposed to switch towards data-driven modeling of planktonic communities and their impact on the environment, notably through the use of high-throughput sequencing data10,13,14,15,16.

Next-generation sequencing technologies have led to significant advances in the knowledge of the taxonomic and functional diversity of planktonic organisms5,17,18. Bioinformatics workflows allow the assembly of metagenome-assembled genomes (MAGs), which are near-complete genomes retrieved from DNA fragments coming from environmentally sequenced individuals of one or a few closely related populations19,20,21,22. MAGs can be taxonomically annotated using multi-marker gene approaches, and organism-level functional profiles can be drawn from their genomic content19,20,21. Reads from environmental meta-omics datasets can also be mapped to their reconstructed sequences to obtain abundance measurements both at MAG and single protein level21,22,23. MAGs can be considered as representative of the genetic potential of natural populations, hence allowing to retrieval of genomes of cultivable, uncultivable, or even unknown species present in the environment. They constitute a promising tool for investigating as a whole the functional potential of known and unknown planktonic life forms.

Recently, a genomics-based model revealed that the gene content of planktonic communities is more relatable to biogeochemical gradients than taxonomic content10. In another study, omics data were used to quantitatively estimate global nitrogen fixers abundance through machine learning algorithms24. It illustrates how quantitative, data-driven biogeochemical models can be built from global omics datasets. However, these studies focused only on a relatively small number of well-described genes (e.g., nif or amtB genes, involved in dinitrogen and ammonium fixation, respectively)10,24, far from exploiting the rich functional diversity observed in omic samples. In this way, the large proportion of unknown sequences detected in environmental meta-omics datasets, that is to say, the open reading frames (ORFs) which can not be linked to any biological functions (usually around 40% for bacteria and archaea, and about 50% for eukaryotes), is as yet untapped4,5,23,25,26,27. Besides, many meta-omics studies have either focused on semi-quantitative diversity and interactions surveys at global scales25,28, on specific taxonomic groups (e.g., Collodaria29), or on particular biological functions (such as nitrogen fixation or mixotrophy21,24,30). A recent study has grouped protein sequences of marine planktonic bacteria and archaea according to their annotated metabolic pathways to investigate their differential abundance and expression, mainly focusing on pre-selected biogeochemical functions such as photosynthesis or nitrogen fixation23. By investigating the response of biogeochemistry-related protein groups to environmental conditions, significant differences in terms of presence and expression were identified between polar and non-polar areas, and between mesopelagic and surface depths23. These results highlight the potential of function-clustering-based approaches for deciphering global ocean biogeochemistry but could be further extended by skipping any sequence pre-selection step requiring database-dependent metabolic pathways annotations.

In this study, we followed a similar approach while avoiding any a priori choices of particular genes or metabolic pathways. We used 51 quantitative and qualitative environmental variables to detect both known and unknown protein clusters that are sensitive to environmental gradients. We re-analyzed 885 high-quality MAGs from marine planktonic Bacteria (n = 820) and Archaea (n = 65), assembled by Delmont et al.21. using 93 Tara Oceans picoplanktonic metagenomes from the surface of the global ocean. With these almost 2 million sequences, we built functional clusters of proteins using a sequence similarity network (SSN), i.e., a graph in which nodes are protein sequences, and edges represent the similarity and coverage between each pair of sequences31,32,33,34,35. Such approaches allow for the construction of sequence clusters putatively homogenous in function31 and were recently used to investigate the genomic basis of functional diversity in bacteria and archaea36, in a lineage of eukaryotes33, or in natural microbial communities35. Particularly, we are here interested in knowing if the abundance of some protein clusters could be predicted from environmental data in the oceanic ecosystem. For example, is the distribution of biogeochemistry-related protein clusters more sensitive to environmental gradients than one of the other clusters? We thus explored the biogeography of environment-related protein clusters in light of their potential functional and/or taxonomic annotation, in order to identify the ones being specific to certain environmental conditions, such as oligotrophic or particularly cold waters.

We introduce here a data-driven, large-scale, fast, and automatable approach, potentially applicable to any set of environmental sequences, which involves (1) the network-based construction of sequence clusters, putatively homogeneous in function, (2) the functional annotation of these clusters, (3) the calculation of environmental abundance values for each of these protein clusters through environmental reads re-mapping, and (4) the description of statistical relationships between cluster abundances and environmental gradients through machine learning and constrained ordination methods. We then present a biogeographical analysis of known and unknown bacterial and archaeal protein functional clusters (PFCs) identified as sensitive to environmental gradients in the global ocean, with no a priori choice of specific functions or taxa. Particularly, we investigate the biogeography of 14,585 PFCs from which the abundance is predictable from the environmental context. We identify biogeographical provinces as the best predictors of PFCs distribution, and the Mediterranean Sea as an outlier in terms of PFCs composition. Our results demonstrate the potential of omics-based predictions of planktonic communities functional composition based on environmental data.

## Results

### From SSN to PFCs

We analyzed the 1,914,171 proteins from 885 MAGs from marine plankton, recovered from 12 geographically bound assemblies of metagenomic sets corresponding to a total of 93 Tara Oceans samples from the 0.2 to 3 µm and 0.2 to 1.6 µm size fractions21. A flowchart of our bioinformatic pipeline is available in Supplementary Fig. 1. 39.6% of the MAGs’ proteins (757,457) were involved in our SSN, i.e., they had at least one similarity relationship with another protein that satisfied the chosen threshold of 80% similarity and 80% coverage (see “Methods”). In total, 51.1% of the network proteins could be annotated to 4922 unique molecular function IDs in the KEGG database37, associated with 327 distinct metabolic pathways (a full list of these pathways is available in Supplementary Data 1). In total, 85.2% of the network proteins were annotated to 17,009 eggNOG functional descriptions38,39.

The SSN involved 233,756 connected components (CCs), i.e., groups of nodes (here proteins) connected together by at least one path and disconnected from the rest of the network. According to KEGG and eggNOG databases, 15.3% and 48.5% of the CCs remained without any functional annotation (i.e., all sequences from the CC were unmatched in the databases, or had a match but were not yet linked to any biological function, Table 1), and 14.8% were functionally unannotated for both databases. We ranked the functional homogeneity of CCs involving at least one functional annotation from 0 (all annotations in the CC are different) to 1 (all annotations in the CC are the same) and found mean homogeneity scores of 0.99 over 1 for KEGG annotations and 0.94 over 1 for eggNOG ones (see “Methods” for score calculation details). Only 88 (0.04%) CCs had a homogeneity score below 0.5 in both annotation databases, all with sizes below five proteins. 177 CCs (0.07%) had a score below 0.8 in both databases, all under 12 proteins in size. These CCs were kept in the analysis while tagged as poorly homogenous. We thereafter considered each CC as a PFC, numbered from #1 to #233,756.

To check for the influence of taxonomic relationships between the MAGs on our PFCs, we computed different metrics based on MAGs taxonomic annotations provided by Delmont et al.21. (Table 1). This taxonomic annotation based on 43 single-copy core genes allowed to annotate 100% of the MAGs at the domain level, and 95% of the MAGs at the phylum level, the remaining 5% corresponding to Bacteria MAGs of unidentified phyla21. Only 1330 PFCs (0.6%) mixed proteins from the Archaea and Bacteria domains. PFCs were very homogeneous at the phylum level, then the homogeneity decreased at lower taxonomic rank, meaning that PFCs studied here were generally not specific from a single class, order, family, genus, or MAG (Table 1). In total, 7834 PFCs (3.4%) were only composed of proteins with no functional annotation in KEGG and eggNOG databases, and no taxonomic annotation under the phylum level. Their sizes ranged from 2 to 30 proteins (mean of 2.62). Their 20,552 proteins came from Euryarchaeota MAGs (12,458; 60.6%), Bacteria MAGs of unidentified phylum (2742; 13.3%), Candidatus Marinimicrobia MAGs (2451; 11.9%), Proteobacteria MAGs (1528; 7.4%), Acidobacteria MAGs (1031; 5%), Verrucomicrobia MAGs (103; 0.5%), Planctomycetes MAGs (89; 0.4%), Bacteroidetes MAGs (79; 0.4%), Chloroflexi MAGs (59; 0.3%) and Candidate Phyla Radiation MAGs (12; 0.05%). We hereafter considered these functionally and taxonomically unknown PFCs as “dark” PFCs40,41. Their nucleotidic sequences are available in separate supplementary files (see “Data availability”). The abundance of dark PFCs was significantly different from the abundance of other PFCs in 85 samples over 93 (two-sided Wilcoxon rank-sum test, p-value < 0.05). The median abundance of dark PFCs was higher than the one of other PFCs in 36 of these 85 samples, and lower in the 49 others. Further details on dark PFCs’ abundances are available in section I of Supplementary notes.

### Identification of PFCs highly related to environmental gradients

To identify the PFCs that responded the most to environmental gradients, we first selected the 228,914 clusters with non-zero variance abundance profiles (i.e., at least 10% of distinct abundance values across all samples, and less than a 95 to 5 ratio between the most and the second most observed abundance value, please see “Methods” for more details), to avoid the creation of constant or near-constant training and/or test sets. We then built random forest regression models for each of these 228,914 clusters. We used the sequence abundances as response variables or labels, and 52 environmental variables as explanatory variables (see “Methods” for details of model training and tuning). More than half of the random forest regression models showed a clear statistical signal: 130,651 models (55.9%) had R2 values over 0.25, corresponding to PFCs linked to environmental conditions, and 14,585 (6.4%) had R2 values over 0.5 (Fig. 1A), corresponding to PFCs highly linked to environmental gradients (hlePFCs), from which the abundances were potentially predictable from the environmental context (Fig. 1B). The mean R2 value over all models was 0.29, with a maximum of 0.88 (Fig. 1A). Longhurst biogeographical provinces42 were detected as the most important predictor in 98,450 models (43.0%) and were in the top three most important predictors in 167,039 models (73.0%) (Fig. 1C). Among models with biogeographical provinces as the best predictor, the mean R2 reached 0.30. The temperature was in the top three most important predictors in 12,673 models (5.5%). Models with temperature as the best predictor had a mean R2 of 0.43, which was the fourth-highest value of all quantitative variables, behind sunshine duration (0.49), ammonium at 5 m depth (0.46), and annual density (0.43).

We focused on the 14,585 PFCs associated with models showing R2 values over 0.5, hereafter called “hlePFCs” for hlePFCs PFCs. 246 KEGG pathways were associated with the 14,585 hlePFCs, i.e., 75% of the pathways identified on the full network were detected in hlePFCs. Supplementary Data 1 gives a detailed list of all pathways detected in our network, along with their total number of occurrences in PFCs, and their number of occurrences in hlePFCs.

The functional homogeneity of the 14,585 hlePFCs was similar to the one of the total 233,756 PFCs (Tables 1, 2). In parallel, 9.2% of the hlePFCs were functionally unannotated (e.g., only composed of unannotated proteins in the KEGG and the eggNOG functional databases), while 14.8% of the PFCs were functionally unannotated.

Proportions of taxonomically homogeneous PFCs were similar between hlePFCs (Table 2) and total PFCs (Table 1), all above 90% at the phylum, class, order, and family level when considering only PFCs including at least one protein taxonomically annotated. At the genus level, the proportion of taxonomically homogeneous PFCs decreased from 91.9% in total PFCs to 66.6% in hlePFCs (Tables 1 and 2). Hence, hlePFCs tended to be shared by more genera compared to all PFCs, while retaining a high functional homogeneity. The proportion of taxonomically unannotated hlePFCs was lower than the one of total PFCs at the phylum and class levels but was higher at the order, family, and genus levels (Tables 1 and 2).

The mean R2 values of models associated with the 7834 dark PFCs (i.e., PFCs without any functional annotation and without taxonomic annotation below the phylum level) was of 0.27, and 166 (i.e., 2.1%) of them were selected among the 14,585 hlePFCs. These 166 dark hlePFCs corresponded to 357 proteins belonging to 51 unique MAGs, annotated as Proteobacteria (8 MAGs, 186 proteins), Euryarchaeota (19 MAGs, 40 proteins), Candidatus Marinimicrobia (13 MAGs, 97 proteins), Bacteroidetes (2 MAGs, 2 proteins), Chloroflexi (1 MAG, 5 proteins), Verrucomicrobia (1 MAG, 2 proteins) and Bacteria unannotated at Phylum level (7 MAGs, 25 proteins). Forty-six of these 51 MAGs were estimated to be over 70% complete, and the remaining five showed completion estimates between 50 and 70%21.

### Global biogeography of the PFCs hlePFCs

The canonical correspondence analysis (CCA) achieved on the 14,585 hlePFCs to investigate their biogeographical repartition had an R2 value of 68.2% and was significant (p-value < 0.001). The first axis (15.38% of explained variance) opposed warm samples from the Indian Ocean (CCA1 > 0) to saline samples from the Mediterranean (CCA1 < 0) (Fig. 2). The second axis (13.17%) opposed cold and nutrient-rich samples from the Southern Ocean (CCA2 > 0) to warmer and more oligotrophic samples. Samples from geographically close biogeographical provinces appeared close to each other in the CCA space, with samples from the Southern Ocean and the Atlantic zones on the bottom, the Pacific Ocean in the middle, and the Indian Ocean on the right of the CCA1-CCA2 space (Fig. 2). The closest sample from the Mediterranean ones in the CCA space was from the surface of the closest Atlantic station to the strait of Gibraltar (station TARA_004) at the entrance of the Mediterranean Sea (sample highlighted by a black arrow in Fig. 2).

Combining the CCA results with the functional annotation of hlePFCs, we observed that the vast majority of metabolic pathways were not enriched in particular environmental conditions (Fig. 3). Among the few exceptions, Atrazine degradation was lightly enriched in the Mediterranean Sea (CCA10), while RNA polymerase and AMPK signaling pathways were lightly enriched in nutrient-rich, cold waters (Fig. 3). Pathways related to biogeochemical functions (e.g., carbon fixation pathways in bacteria/archaea, Nitrogen metabolism, or methane metabolism) or linked to ecological interactions between organisms (e.g., biosynthesis of antibiotics, quorum sensing, or ABC transporters) were present homogeneously in the CCA space (Supplementary Fig. 3). Functionally unannotated hlePFCs were more abundant around −1 and above 1 along CCA1 (Fig. 2), corresponding to hlePFCs associated with Mediterranean samples and Indian Ocean samples, respectively. The 166 dark hlePFCs were associated with Mediterranean samples, and almost absent from polar samples (Fig. 5B, Supplementary Fig. 3).

We then examined the position of hlePFCs associated with different levels of taxonomic annotations in the CCA space. hlePFCs containing sequences from the phylum Candidatus Marinimicrobia were found mostly in Mediterranean samples and were almost absent from polar waters (Fig. 4A). Overall, the various phyla did not show any strong associations to a particular niche in the CCA space (cf. the central positions and large standard deviation bars for the various phyla on Fig. 4A). Similarly, most of the classes’ barycenters on Fig. 4B showed important standard deviation bars, with the exception of the cyanobacterial class of Prochlorales which seemed restrained to warm and oligotrophic waters (Fig. 4B). No particular order, family, or genus was found to be strongly associated with Mediterranean samples, and the genus Polaribacter was the only taxa strongly associated with polar samples (barycenter position on CCA2 = −1.46).

In total, 3345 hlePFCs were highly overabundant in the Mediterranean Sea (CCA1), corresponding to 8736 proteins from 193 different MAGs of 10 classes. 85 of these MAGs originated from the same assembly performed by Delmont et al.21. on Mediterranean samples, and accounted for 5537 proteins (63.4% of the 8736 present in the zone, Supplementary Fig. 4). A similar yet less marked pattern was observed along CCA2, with 189 of the 697 (26%) proteins of hlePFCs correlated to cold and rich waters (CCA2 < −2) coming from MAGs of the Southern Ocean assembly (Supplementary Fig. 4).

Our analysis allowed us to identify environmental variables driving the abundance of functionally unannotated hlePFCs. For example, PFC #90,382 was composed of eight unannotated proteins coming from four different MAGs (3 Flavobacteriales, 1 Gammaproteobacteria), and had a strong response to high temperature (Fig. 5A), as well as other environmental variables (R2 value of 0.501 for the associated random forest model). Conversely, PFCs #102,286 (two proteins coming from the same Saprospiraceae, R2 value of 0.68), #210,456 (two proteins from two distinct Flavobacteriaceae, R2 value of 0.83), and #233,673 (two proteins from two distinct Flavobacteriales, R2 value of 0.63) were highly linked to cold temperature (Fig. 5A). PFCs #172,397, #172,465 and #26,732 were dark hlePFCs overabundant in Mediterranean samples (Fig. 5B).

Positions of all functionally and/or taxonomically unannotated hlePFCs in the CCA space, the most important drivers of their abundance according to random forest models, the nucleotidic sequences of their proteins, and their MAGs of origins are all publicly accessible (link in “Data availability”).

### Robustness of the observed biogeographical patterns

In total, 6.2% of the 233,756 PFCs that were created in this study were used in our biogeographical analysis. In section II of Supplementary notes, we provide a similar biogeographical analysis, this time including the 130,651 PFCs showing R2 values above 0.25, i.e., 55.9% of the total 233,756 PFCs. A CCA based on these 130,651 PFCs allowed to identify two samples from the surface and deep chlorophyll maximum of station 93 as strong outliers due to their singular position along CCA1 (Supplementary Fig. 5A). The outlying nature of these two samples was explained by the strong overabundance of PFCs composed of proteins from 5 MAGs of the genus Pseudoalteromonas (Supplementary Figs. 6 and 7). The general patterns observed using the 14,585 hlePFCs (Fig. 2) were confirmed by the third and fourth dimensions of the CCA on 130,651 PFCs (Supplementary Figure 5B).

## Discussion

The functional composition of bacterial/archaeal planktonic communities is driven by interactions between multiple environmental factors rather than by single variables

Building statistical models including 52 environmental variables to test for their effect on PFCs abundance, we were able to quantify the impact of each variable both globally and in each individual random forest model. Our results hence give access to the most influential predictors of 228,914 PFC’s abundances (see “Data availability”), while pushing towards a consideration of other variables in addition to temperature and oxygen when studying bacterial/archaeal communities functional composition. Indeed, water temperature is commonly presented as the most influential determinant of the taxonomic and functional composition of bacterial/archaeal communities5,23,43. Here, we found temperature to be one of the best quantitative predictors of PFC abundance. Still, it was determined as less important than other quantitative variables like iron, carbonate, or oxygen concentrations, but also salinity or bathymetry. However, when the temperature was the most important variable in a model, it increased the accuracy of abundance predictions, and it had a strong influence on the biogeography of hlePFCs, showing how key this variable is on at least some ecosystemic functions. Among all environmental predictors, Longhurst biogeographical provinces42 were by far the most important variable in our random forest models and were well distinguished on the CCA triplot (Fig. 2). Longhurst provinces represent homogeneous areas both in terms of physico-chemical and ecological conditions42, and the fact that temperature is one of the main variables used to define such provinces could explain why it was not among the most important variables in many statistical models, as the two variables could bring redundant information in regression trees. Here we thus suggest that functional composition is more impacted by interactions between multiple variables, than by one or a few variables like it has been previously suggested5,23. This result adds to a similar observation made by a recent global biogeographical analysis of planktonic communities, finding Longhurst biogeographical provinces to match the distribution of viruses, bacteria, and eukaryotes smaller than 20 μm44. Sampling depth had among the lowest impacts on our regression models, confirming the weak differences in functional composition between surface and deep chlorophyll maximum samples of picoplankton23.

### Identifying PFCs and metabolic pathways associated with particular environmental conditions

The identification of biogeographical provinces as best predictors of PFC’s abundance can be interpreted as a consequence of the Baas Becking hypothesis “everything is everywhere, but the environment selects”45,46, which implies that all microbes are potentially ubiquitous, but dominant taxa depend on the environmental niche. A precedent study observed this pattern at the protein level, by comparing protein families sampled in different ecosystems such as seawater, sludge water, or soils, and showing that the ecosystem type had more impact on protein families composition than geographical distance46. Another study based on meta-omics data found the environmentally-mediated selection to have a strong impact on the biogeography of the cosmopolitan SAR11 order Pelagibacterales43. Here, by identifying Longhurst biogeographical provinces to be the best predictors of PFCs abundance, we verify that environmental niches are the most determinant drivers of marine bacterial/archaeal communities functional composition. However, the fact that 44.1% of our PFCs showed poor responses to environmental conditions challenges the extent of applicability of the Baas Becking hypothesis at the protein level, at least within a single ecosystem. It could be explained by the high decoupling observed between functional diversity and taxonomic diversity among marine bacterial/archaeal communities17. Indeed, functional redundancy among bacteria/archaea can lead to stable functional diversity even with high taxonomic variability47. In our analysis, we chose to focus on changes in communities functions because it could lead to more stable abundance measures than when relying on taxonomic entities, and provide thus more valuable information on the ecosystem functioning and associated biogeochemical functions17,23,47.

Still, 55.9% (130,651 over 233,756) of the PFCs were linked to environmental gradients, and 6.2% (14,585 over 233,756) showed strong responses to particular environmental conditions. Among these 14,585 PFCs identified as hlePFCs, we observed a clear distinction between the ones associated with polar nutrient-rich waters and those abundant in tropical nutrient-poor ones, which is coherent with classical observations in marine ecology30,48. Metabolic pathways like RNA polymerase or Ribosome could be associated with eutrophic conditions and colder waters (Fig. 3), potentially reflecting the higher growth potential and metabolic activity of micro-organisms in the eutrophic conditions of the polar summer49. No particular metabolic pathway could be associated with warm and oligotrophic waters, but proteins from a class of Cyanobacteria were overrepresented in hlePFCs abundant in these waters, which may reflect that cyanobacteria are particularly abundant in subtropical waters50. Biogeochemistry-related pathways such as methane metabolism, carbon fixation in photosynthetic organisms, carbon fixation pathways in bacteria/archaea, or nitrogen metabolism were correlated to a wide range of physico-chemical conditions (Supplementary Fig. 3). Hence, such key biogeochemical functions seem ubiquitously present in the global ocean but can be achieved by different actors and protein families depending on the environmental conditions.

More surprisingly, we identified Mediterranean samples as clear outliers, with an important part of hlePFCs showing higher abundances in Mediterranean samples than elsewhere. These samples could not be characterized by the relative over-abundance of particular metabolic functions, and only displayed a light over-abundance of proteins from MAGs of the Candidatus Marinimicrobia phylum, which is composed of poorly known and yet uncultivable bacteria of potentially high biogeochemical impact51. Our strongest hypothesis to explain this pattern lies in the fact that the Mediterranean Sea is a semi-enclosed sea that experienced multiple isolations and colonization events52. For some pelagic species, the strait of Gibraltar constitutes a phylogeographic barrier causing genetic contrasts between Atlantic and Mediterranean populations52,53. Here, we identified most Atlantic samples (especially the South Atlantic ones) to be closer to Pacific samples than to Mediterranean ones in terms of hlePFCs composition, with the exception of one which came from the mouth of the strait of Gibraltar. The Mediterranean Sea was the biogeographical zone exhibiting the strongest over-abundance of locally assembled proteins, while even hlePFCs associated with Southern Ocean samples contained many proteins from Atlantic-assembled MAGs. This way, hlePFCs overabundant in Mediterranean samples shared only very few links with proteins from MAGs of other assemblies in our SSN, highlighting their functional and taxonomical originality. We then propose that the strait of Gibraltar and the Suez canal could shape the genetic and functional structure of some planktonic bacterial/archaeal populations, as it is observed in some eukaryotic species52,53.

Finally, our analysis of PFCs showing R2 values above 0.25 (Section II of supplementary notes) highlighted station 93 as a strong outlier in PFC composition. This station is the closest to the Chilean coast in our dataset and corresponds to an upwelling zone, which makes it an original station in terms of environmental conditions. It was characterized by the over-abundance of PFCs associated with the Pseudoalteromonas genus. Pseudoalteromonas is a cosmopolitan genus of Gammaproteobacteria that is known for producing biologically active molecules notably including toxins, anti-bacterial and anti-fungal agents, sometimes used as weapons against other organisms and which can be pharmaceutically relevant54,55. Pseudoalteromonas bacteria are known for being able to survive in extreme environments56 and are not commonly associated with upwelling areas, although they are known to interact with macroalgae in such nutrient-rich coastal environments57. Here, we give the public access to a set of 1928 PFCs associated with the Pseudoalteromonas genus that was particularly abundant in a strong upwelling area, among which 127 were unannotated in both KEGG and eggNOG databases (see “Data availability”). These clusters could be related to ecological interactions between Pseudoalteromonas bacteria and algae, but the originality of station 93 in our dataset as a coastal and upwelling area prevents us from drawing any strong conclusions on their ecological role here. In section II of the Supplementary notes, we discuss why these PFCs were not found among hlePFCs, and highlight how the inclusion of more samples from similar coastal and/or upwelling areas would help to strengthen our results concerning these Pseudoalteromonas related PFCs.

### Mining the unknown to identify potential key organisms and proteins

The main originality of our approach is its ability to take into account both annotated and unannotated sequences. It enables the identification of PFCs composed of functionally unannotated sequences, of taxonomically unannotated sequences, and of both, leading here to the inclusion of at least 15% more proteins than methods excluding functionally unannotated sequences. By including 7834 PFCs corresponding to 20,552 protein sequences that could not be annotated under the phylum level nor to a biological function, we propose an original way to highlight the response of dark omics abundance to environmental gradients. While a previous study estimated that the inclusion of dark omics sequences could increase by up to 58% the amount of analyzed sequences40, we provide here a pragmatic bioinformatic pipeline that helps to extend our knowledge in environmental microbiology.

It is often proposed that most of the unidentified microbial diversity could come from rare organisms, described as part of the “rare biosphere”, and which are considered as diversity reservoirs able to respond rapidly to environmental changes58,59. Our results partly corroborate this theory (see section I of Supplementary notes), but we also found the 7834 microbial dark PFCs to be relatively overabundant in 41% of our samples. This can be explained by the fact that 72.5% of the proteins from our dark PFCs came from Candidatus Marinimicrobia and Euryarchaeota MAGs, two yet poorly studied and uncultivable phyla identified as highly abundant in the global ocean, and potentially impacting biogeochemistry51,60,61.

Our analysis allowed us to describe the biogeography of 1347 functionally unannotated hlePFCs, which might participate in metabolic pathways involved in functional responses to peculiar environmental conditions. They included 166 dark hlePFCs, mainly related to Proteobacteria and Candidatus Marinimicrobia MAGs, the latter being associated with Mediterranean samples. In addition, more than half (52.9%) of the models associated with dark PFCs showed R2 values above 0.25, and the mean R2 values over the 7834 models associated with dark PFCs were similar to the one over all models. We then show that the response of dark PFCs to environmental gradients is comparable to the one of taxonomically and functionally annotated PFCs.

Our method being applicable to any set of sequences, we predict that an accumulation of similar results on multiple datasets will help identify recurrent unannotated protein clusters linked to specific environmental niches62. It could further help to target wet-lab studies towards the description of unknown proteins particularly adapted to specific conditions, like subtropical nutrient-poor waters or oxygen minimum zones51. However, functionally unannotated hlePFCs sometimes contained proteins from only one MAG, and in this case their response to environmental gradients could be a reflection of the global abundance of this MAG instead of a real functional level response. We then advise future wet-lab investigations to mainly select PFCs involving proteins from different MAGs. To pave the way for such further analyzes, we have provided all nucleotide sequences for each microbial dark matter PFC, as well as the statistics associated with their response to environmental gradients (see “Data availability”).

### Towards more global quantitative studies of meta-omics at the function level

Statistical models in this study were based on the abundances of each PFC in 93 metagenomic samples. For each random forest model, 75% of the samples (i.e., 70 samples) were used as a training set. Even though each model was run 10 times on 10 distinct training sets, it remains a relatively low amount of samples to do abundance predictions and extrapolations at the global ocean scale (as a way of comparison, 181 samples allowed to predict diatoms abundance from environmental data in a Chinese river63). Hence, machine learning models were not used to provide extrapolated predictions in this study, but to detect PFCs hlePFCs and the main drivers of their biogeography. However, as more and more omics datasets are collected in the global ocean23,26,64,65, we assume that similar approaches could be conducted with much more samples in the near future, which should increase models’ performances. By using less than 100 samples, we were nonetheless able to obtain 14,585 models with R2 values over 0.5. It highlights the potential of such quantitative approaches for predicting the abundance of key protein families in the global ocean. Moreover, our dataset was only composed of metagenomics samples, when it is hypothesized that a big part of bacterial/archaeal communities response to environmental change comes from variations in gene expression66. This assumption was recently disputed23, but applying our method to metatranscriptomes in the future would allow using the environmental context to predict protein expressions instead of metagenome sequence abundances, which could help to improve the accuracy of models predictions.

In the future, biogeochemical modeling should benefit from our ability to quantify and predict biological functions using environmental and omics data10,13,14,24,67. Through our quantitative and data-driven analysis, we have shown one illustration of how metagenomics data can be used without a priori choices of a taxon or metabolic function. We (1) identified qualitative variables such as ocean regions and Longhurst provinces to be more informative than single quantitative variables to predict the functional composition of marine bacterial and archaeal communities, (2) identified temperature not to be a better predictor of PFC abundance in comparison to other correlated quantitative variables such as oxygen or salinity, (3) identified Mediterranean samples as outliers in terms of PFC composition, which to our knowledge had not been observed before using meta-omics data, (4) investigated dark PFCs response to environmental fluctuations, whereas the majority of studies ignore them41. Finally, we provide public access to PFCs along with their taxonomic and functional annotations, but also information on their relationships with particular environmental contexts, the predictability of their abundances, and their most important environmental drivers. We have then paved the way for more quantitative analysis taking advantage of the richness of global omics datasets, both at the functional and taxonomic level, which should in the long term increase our ability to better predict future global climate.

## Methods

### Samples collection and MAGs

We focused our study on the 885 non-eukaryotic MAGs made publicly available21. The whole bioinformatic workflow designed to build these MAGs, as well as all the links leading to the fasta files and Anvi’o68 profiles for each MAG can be found at http://merenlab.org/data/tara-oceans-mags/. These MAGs were built from 93 Tara Oceans metagenomes retrieved from 61 surface samples and 32 deep chlorophyll maximum samples collected worldwide in the global ocean, using a size filter targeting free-living microorganisms (0.2–3 μm). Original metagenomes are available under the European Bioinformatics Institute (EBI) repository with project ID ERP001736. To date, the work achieved by Delmont et al.21. constitutes the only database of manually refined MAGs constructed using the Tara Oceans project data. Automated binning efforts provided larger numbers of MAGs and focused on multiple sizes fractions20, but are subject to higher binning errors, causing sometimes obvious contigs misplacement (as discussed here: https://bjtully.github.io/posts/2018/10/re-visiting-tmed-mags/). Further information on the MAGs’ genomic features, such as their completion or GC content, can be found in Supplementary Table 5 of Delmont et al.21.

### Gene detection and quantification

Prodigal v2.6.369 was run to retrieve the nucleotide and protein sequences of each detected gene for each of the 885 MAGs. By concatenation, one nucleotide and one protein fasta files were created, containing each in total 1,914,171 sequences. The nucleotide sequences were then used for the mapping and quantification step (hereafter developed) whereas the protein sequences were used for building the SSN (cf. next paragraph).

### Building an SSN from 885 bacterial/archaeal MAGs

An SSN is a graph object in which vertices correspond to sequences and edges represent the similarity and coverage between pairs of sequences31,32,33,34,35. Diamond v0.8.2271 was used in blastp mode to compute the percentage of similarity between every pair of proteins detected in the MAGs, using options “−e 1e−3 −p 30–sensitive”. An SSN was built with the diamond output using 80% identity and 80% coverage threshold. This coverage threshold is commonly used in SSN studies33,35,72 and we also tested 4 other similarity thresholds: 70%, 75%, 85%, and 90%. We selected the intermediary 80% identity threshold to minimize the amount of singletons while maximizing the functional homogeneity between linked proteins.

### Extracting, annotating, and quantifying PFCs in the SSN

An SSN is made of singletons (vertice or sequence without any homology with other sequences) and CCs (subgraphs composed of at least two vertices disconnected from the rest of the network). In our case, a CC corresponds to a group of at least two protein sequences that are linked together (directly or via neighbors), and that have no link with other groups of sequences in the SSN. We assume that the proteins contained in a CC potentially share a similar molecular function31,32,33,72. The term “protein family” is often used to describe such clusters of homologous proteins, but as this term is usually used to deal with evolutionary relationships, we here prefer the use of PFC.

Our SSN was composed of 233,756 PFCs, including 757,457 proteins (i.e., 1,156,714 singletons were excluded from the analysis). These proteins were functionally annotated using eggNOG mapper v4.5.138,39 and KofamScan v1.2.037. EggNOG emapper was run using the diamond mode and the–no_annot flag. It produced a table containing seed orthologous sequences for 677,684 of our proteins (89.5%), the rest of them not being similar enough from any sequence in the eggNOG database. The annotation phase was then launched on these 677,684 proteins, using the seed orthologous sequences table as input to the emapper function, and the–annotate_hits_table flag. We obtain an annotation table with GO IDs, KEGG IDs, and eggNOG descriptions. KoFamScan was launched with default options and -mapper flag. The KEGG API was then used to retrieve KEGG pathways ID and descriptions for each KEGG ID identified by KoFamScan in our protein catalog. To assess for the functional homogeneity in our PFCs, we computed a homogeneity score Fhom:

$${N}_{{annot}} \,> \, 1\Rightarrow {F}_{{\hom }}=1-\frac{{N}_{{annot}}}{{N}_{{prot}}}$$
(1)
$${N}_{{annot}}=1\Rightarrow {F}_{{\hom }}=1$$
(2)

With Nannot the number of unique annotation terms found in the PFC (either KEGG IDs or eggNOG terms), and Nprot the number of proteins in the PFC.

As multiple eggNOG terms can exist for similar functions (e.g., “UBA-ThiF-type NAD FAD-binding protein” and “UBA-THIF-type NAD FAD-binding”), they can lead to artifactually low homogeneity scores. For this reason, PFCs with low homogeneity scores obtained with the EggNOG database were tagged as poorly homogeneous but were kept in the analysis.

Statistics on functionally unannotated PFCs presented in Tables 1 and 2 include both (1) query sequences that did not match to any reference in public databases, and (2) query sequences that match to one or multiple references in public databases but could not yet be associated to any biological function.

To assess taxonomic diversity in our PFCs, we used the taxonomic annotation of the 885 MAGs provided by Delmont et al.21. This taxonomic annotation was inferred from 43 single-copy core genes through the combined use of CheckM73, RAST74, and manual BLAST searches (see21 for further details).

We computed a mean abundance for each PFC in each of the 93 metagenomes, using relative protein abundances (see Gene detection and quantification). We obtained an abundance table composed of 233,756 rows, corresponding to PFCs, and 93 columns, corresponding to the 93 Tara Oceans metagenomes used in the study.

### Environmental dataset

For each of the 93 Tara Oceans metagenomes, we retrieved the environmental context from Faure et al. (https://figshare.com/articles/Data_MixoBioGeo_Faure_et_al_2018/6715754)30. To complete this environmental dataset, we added 10 climatology variables retrieved from the World Ocean Atlas:75 temperature, salinity, density, conductivity, dissolved oxygen, percent oxygen saturation, apparent oxygen utilization, silicate, phosphate, and nitrate. Finally, we added the upper limit of the size fraction as an environmental variable, as it varied across samples. For temperature, salinity, and conductivity we retrieved the mean and the mean seasonal anomaly at each sampling point (precision of 1°) over the 2005–2012 period. Only the mean was retrieved for density. For the 6 other variables, we retrieved the mean and the mean seasonal anomaly at each sampling point (precision of 1°) over all available years. In total, we obtained 74 environmental variables, which we reduced to 51 by getting rid of near-zero variance variables and too highly correlated ones, using options “nzv” and “corr” from the preProcess function of the caret package v6.076 in R v3.5.377. A detailed description of these variables is available in Supplementary Data 2. We then scaled and centered the 51 selected environmental variables, and used a k-nearest neighbors approach to replace NA values (6.6% of the data) by the mean of the concerned variable in the five nearest samples in terms of global environmental profile (knnImpute option from caret’s preProcess function76).

### Identification of PFCs varying along environmental gradients

Among the 233,756 PFCs, we detected 4842 (2,1%) clusters with near-zero variance using caret preProcess function76 with default parameters, i.e., they had less than 10% of abundance values across all samples that were distinct, and a ratio between the most common abundance value and the second most common one that was higher than 95–5. These clusters were removed from further statistical analysis. We built a random forest regression model for each of the remaining 228,914 PFCs, using the environmental variables as predictors of cluster relative abundance. To suppress eventual biases linked to over/underfitting due to training set selection, each model was launched 10 times using 10 different training sets built using 75% of the 93 samples available. For each iteration, i.e., for each pair of the training set and PFC, a random forest regression model was trained using a fivefold cross-validation process, and the number of randomly tested predictors at each split was optimized between a minimum of 5 and a maximum of 9 (the default value being of 7, the floored square root of the number of environmental variables), while the number of trees was fixed to 500. The model minimizing the root mean square error (RMSE) was selected for each iteration and used to compute the mean prediction error over the 10 iterations, as well as the mean R2, and the mean rank of importance in the model for each environmental predictor. For each iteration, the selected model was also used to produce predictions over the test set, and the mean R2 of these predictions was computed over the 10 iterations. This way, we have for each PFC an R2 based on the out-of-bag samples of the cross-validation process, and an R2 based on predictions over the test set (Fig. 1B). The mean R2 obtained from the out-of-bag samples was used to discriminate PFCs following significant environmental gradients from the ones showing no response to the environmental context. Specifically, we considered every PFCs associated with a model with a mean R2 value over the arbitrary threshold of 0.5 to be hlePFCs, and over 0.25 to be linked to environmental gradients. Different thresholds ranging from 0.25 to 0.75 were tried for the definition of hlePFCs, thresholds higher than 0.5 tended to select too few PFCs, and mainly the ones overabundant in the Mediterranean Sea, while too low thresholds tended to diminish the R2 value and readability of the CCA (cf next section). All random forest models were launched using the rf function of the randomForest v4.6 R package78 through the train function of the Caret package76.

We used a CCA to describe in a more integrated way the relationships between PFCs and environmental variables. The CCA used the relative abundance table of all PFCs linked to environmental gradients (mean R2 of random forest regressions > 0.5) as response variables, and 17 selected environmental variables as explanatory variables: biogeographical province, ocean region, season moment (i.e., early/middle/late), temperature, depth, depth of the euphotic zone, conductivity seasonal anomaly, sea surface temperature gradient, moon phase proportion, depth of the O2 minimum, calcite saturation state, fluorescence, NO3 at 5 m, chlorophyll a, total alkalinity, salinity, iron at 5 m. The 17 environmental variables were selected through a backward and forward stepwise selection based on the AIC criterion79.

Using positions of PFCs in the two first dimensions of the CCA space (29.09% of variance), we computed a barycenter position for each metabolic pathway detected among hlePFCs (Fig. 3). Similarly, we computed barycenters for phyla, classes, and genomic assemblies in the CCA space (Fig. 4, Supplementary Fig. 3). Finally, convex hulls englobing all PFCs associated with a pathway were drawn for a selection of pathways corresponding to (1) pathways linked to inter-organisms interactions, (2) pathways associated to a priori selected biogeochemical functions, and (3) pathways composed of only unknown sequences (Supplementary Fig. 3).

The exact same methods were applied to compute the biogeographical analysis of the PFCs associated with the models with R2 values above 0.25 (see Section II of Supplementary materials).

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

Instructions on how to build or download the MAGs and metagenomes used in this study are available at http://merenlab.org/data/tara-oceans-mags/. Tools and databases used for functional annotations are available at http://eggnog-mapper.embl.de/ and https://www.genome.jp/tools/kofamkoala/. All other data used in this study are available at 10.6084/m9.figshare.12030795, including fasta files containing nucleotide sequences of all proteins in PFCs, hlePFCs, dark PFCs, PFCs associated with Station 93, and PFCs associated with Station 93 linked with Pseudoalteromonas MAGs. In this figshare repository, we also provide summary tables including all PFC and random forest associated statistics (e.g., all homogeneity and unknown scores, R2 values, variables importances) for each PFCs, hlePFCs, and dark PFCs. Finally, we offer tables at the single protein level showing the PFC ID, taxonomic and functional annotations, and nucleotide sequences of each protein in PFCs, hlePFCs, and dark PFCs.

## Code availability

All bash, perl, and R codes necessary to reproduce our analysis are available at https://github.com/EmileFaure/MAGsProteinFunctionalClusters80.

## References

1. 1.

Falkowski, P. G., Barber, R. T. & Smetacek, V. Biogeochemical controls and feedbacks on ocean primary production. Science 281, 200–206 (1998).

2. 2.

Guidi, L. et al. Plankton networks driving carbon export in the oligotrophic ocean. Nature 532, 465–470 (2016).

3. 3.

Whitman, W. B., Coleman, D. C. & Wiebe, W. J. Prokaryotes: the unseen majority. Proc. Natl Acad. Sci. 95, 6578–6583 (1998).

4. 4.

Ferrera, I., Sebastian, M., Acinas, S. G. & Gasol, J. M. Prokaryotic functional gene diversity in the sunlit ocean: stumbling in the dark. Curr. Opin. Microbiol 25, 33–39 (2015).

5. 5.

Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359–1261359 (2015).

6. 6.

Le Quéré, C. et al. Ecosystem dynamics based on plankton functional types for global ocean biogeochemistry models. Glob. Change Biol. 11, 2016–2040 (2005).

7. 7.

Litchman, E. et al. Global biogeochemical impacts of phytoplankton: a trait-based perspective. J. Ecol. 103, 1384–1396 (2015).

8. 8.

Follows, M. J., Dutkiewicz, S., Grant, S. & Chisholm, S. W. Emergent biogeography of microbial communities in a model ocean. Science 315, 1843–1846 (2007).

9. 9.

Aumont, O., Ethé, C., Tagliabue, A., Bopp, L. & Gehlen, M. PISCES-v2: an ocean biogeochemical model for carbon and ecosystem studies. Geosci. Model Dev. 8, 2465–2513 (2015).

10. 10.

Coles, V. J. et al. Ocean biogeochemistry modeled with emergent trait-based genomics. Science 358, 1149–1154 (2017).

11. 11.

Leles, S. G., Valentin, J. L. & Figueiredo, G. M. Evaluation of the complexity and performance of marine planktonic trophic models. Acad. Bras. Ciênc 88, 1971–1991 (2016).

12. 12.

Flynn, K. J. S. T. et al. Acclimation, adaptation, traits and trade-offs in plankton functional type models: reconciling terminology for biology and modelling. J. Plankton Res. 37, 683–691 (2015).

13. 13.

Mock, T. et al. Bridging the gap between omics and earth system science to better understand how environmental change impacts marine microbes. Glob. Change Biol. 22, 61–75 (2016).

14. 14.

Stec, K. F. et al. Modelling plankton ecosystems in the meta-omics era. Are we ready? Mar. Genomics 32, 1–17 (2017).

15. 15.

Louca, S. et al. Integrating biogeochemistry with multiomic sequence information in a model oxygen minimum zone. Proc. Natl Acad. Sci. USA 113, E5925–E5933 (2016).

16. 16.

Martini S, et al. Functional trait-based approaches as a common framework for aquatic ecologists. Limnol Oceanogr.

17. 17.

Louca, S., Wegener Parfrey, L. & Doebeli, M. Decoupling function and taxonomy in the global ocean microbiome. Science 353, 1272–1277 (2016).

18. 18.

Delmont TO, et al. Functional repertoire convergence of distantly related eukaryotic plankton lineages revealed by genome-resolved metagenomics. bioRxiv 2020; 2020.10.15.341214.

19. 19.

Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol 2, 1533–1542 (2017).

20. 20.

Tully, B. J., Graham, E. D. & Heidelberg, J. F. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data 5, 170203 (2018).

21. 21.

Delmont, T. O. et al. Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes. Nat. Microbiol 3, 804–813 (2018).

22. 22.

Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

23. 23.

Salazar, G. et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell 179, 1068–1083.e21 (2019).

24. 24.

Tang, W. & Cassar, N. Data-driven modeling of the distribution of diazotrophs in the global ocean. Geophys Res Lett. 46, 12258–12269 (2019).

25. 25.

de Vargas, C. et al. Eukaryotic plankton diversity in the sunlit ocean. Science 348, 1261605–1261605 (2015).

26. 26.

Acinas SG, et al. Metabolic architecture of the deep ocean microbiome. bioRxiv 2019; 635680.

27. 27.

Carradec, Q. et al. A global ocean atlas of eukaryotic genes. Nat. Commun. 9, 1–13 (2018).

28. 28.

Lima-Mendez, G. et al. Determinants of community structure in the global plankton interactome. Science 348, 1262073 (2015).

29. 29.

Biard, T. et al. Biogeography and diversity of Collodaria (Radiolaria) in the global ocean. ISME J. 11, 1331–1344 (2017).

30. 30.

Faure E, et al. Mixotrophic protists display contrasted biogeographies in the global ocean. ISME J 2019.

31. 31.

Atkinson, H. J., Morris, J. H., Ferrin, T. E. & Babbitt, P. C. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS ONE 4, e4345 (2009).

32. 32.

Forster, D. et al. Testing ecological theories with sequence similarity networks: marine ciliates exhibit similar geographic dispersal patterns as multicellular organisms. BMC Biol. 13, 16 (2015).

33. 33.

Meng, A. et al. Analysis of the genomic basis of functional diversity in dinoflagellates using a transcriptome-based sequence similarity network. Mol. Ecol. 27, 2365–2380 (2018).

34. 34.

Bittner, L. et al. Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol. Direct. 5, 47 (2010).

35. 35.

Lopez P, Halary S, Bapteste E Highly divergent ancient gene families in metagenomic samples are compatible with additional divisions of life. Biol Direct. 2015; 10.

36. 36.

Cheng S, et al. Sequence similarity network reveals the imprints of major diversification events in the evolution of microbial life. Front Ecol Evol. 2014; 2.

37. 37.

Aramaki T, et al. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. bioRxiv. 2019; 602110.

38. 38.

Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).

39. 39.

Huerta-Cepas, J. et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).

40. 40.

Bernard, G., Pathmanathan, J. S., Lannes, R., Lopez, P. & Bapteste, E. Microbial dark matter investigations: how microbial studies transform biological knowledge and empirically sketch a logic of scientific discovery. Genome Biol. Evol. 10, 707–715 (2018).

41. 41.

Vanni C, et al. Light into the darkness: unifying the known and unknown coding sequence space in microbiome analyses. bioRxiv. 2020; 2020.06.30.180448.

42. 42.

Longhurst AR. Ecological Geography of the Sea. Academic Press; 1998.

43. 43.

Delmont TO, et al. Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 subclade. eLife; 8.

44. 44.

Richter DJ, et al. Genomic evidence for global ocean plankton biogeography shaped by large-scale current systems. bioRxiv 2019; 867739.

45. 45.

Wit, R. D. & Bouvier, T. ‘Everything is everywhere, but, the environment selects’; what did Baas Becking and Beijerinck really say? Environ. Microbiol 8, 755–758 (2006).

46. 46.

Fondi, M. et al. “Every gene is everywhere but the environment selects”: global geolocalization of gene sharing in environmental samples through network analysis. Genome Biol. Evol. 8, 1388–1400 (2016).

47. 47.

Louca, S. et al. High taxonomic variability despite stable functional structure across microbial communities. Nat. Ecol. Evol. 1, 1–12 (2016).

48. 48.

Ibarbalz, F. M. et al. Global trends in marine Plankton diversity across kingdoms of life. Cell 179, 1084–1097.e21 (2019).

49. 49.

Alonso‐Sáez, L., Sánchez, O., Gasol, J. M., Balagué, V. & Pedrós‐Alio, C. Winter-to-summer changes in the composition and single-cell activity of near-surface Arctic prokaryotes. Environ. Microbiol. 10, 2444–2454 (2008).

50. 50.

Flombaum, P. et al. Present and future global distributions of the marine Cyanobacteria Prochlorococcus and Synechococcus. Proc. Natl Acad. Sci. USA 110, 9824–9829 (2013).

51. 51.

Hawley, A. K. et al. Diverse Marinimicrobia bacteria may mediate coupled biogeochemical cycles along eco-thermodynamic gradients. Nat. Commun. 8, 1–10 (2017).

52. 52.

Patarnello, T., Volckaert, Fa. M. J. & Castilho, R. Pillars of Hercules: is the Atlantic–Mediterranean transition a phylogeographical break? Mol. Ecol. 16, 4426–4444 (2007).

53. 53.

Lowe, C. D., Martin, L. E., Montagnes, D. J. S. & Watts, P. C. A legacy of contrasting spatial genetic structure on either side of the Atlantic-Mediterranean transition zone in a marine protist. Proc. Natl Acad. Sci. USA 109, 20998–21003 (2012).

54. 54.

Bowman, J. P. Bioactive compound synthetic capacity and ecological significance of marine bacterial genus pseudoalteromonas. Mar. Drugs 5, 220–241 (2007).

55. 55.

Holmström, C. & Kjelleberg, S. Marine Pseudoalteromonas species are associated with higher organisms and produce biologically active extracellular agents. FEMS Microbiol Ecol. 30, 285–293 (1999).

56. 56.

Parrilli E, et al. The art of adapting to extreme environments: the model system Pseudoalteromonas. Phys Life Rev 2019.

57. 57.

Hettle, A. G. et al. Insights into the κ/ι-carrageenan metabolism pathway of some marine Pseudoalteromonas species. Commun. Biol. 2, 1–14 (2019).

58. 58.

Logares, R. et al. Patterns of rare and abundant marine microbial eukaryotes. Curr. Biol. 24, 813–821 (2014).

59. 59.

Lynch, M. D. J. & Neufeld, J. D. Ecology and exploration of the rare biosphere. Nat. Rev. Microbiol 13, 217–229 (2015).

60. 60.

Santoro, A. E., Richter, R. A. & Dupont, C. L. Planktonic marine. Archaea. Annu Rev. Mar. Sci. 11, 131–158 (2019).

61. 61.

Francis, C. A., Roberts, K. J., Beman, J. M., Santoro, A. E. & Oakley, B. B. Ubiquity and diversity of ammonia-oxidizing archaea in water columns and sediments of the ocean. Proc. Natl Acad. Sci. USA 102, 14683–14688 (2005).

62. 62.

Wyman, S. K., Avila-Herrera, A., Nayfach, S. & Pollard, K. S. A most wanted list of conserved microbial protein families with no known domains. PLoS ONE 13, e0205749 (2018).

63. 63.

Shin Y, et al. The prediction of diatom abundance by comparison of various machine learning methods. Math. Probl. Eng. 2019, 5749746 (2019).

64. 64.

Vorobev A, et al. Transcriptome reconstruction and functional analysis of eukaryotic marine plankton communities via high-throughput metagenomics and metatranscriptomics. bioRxiv 2019; 812974.

65. 65.

Planes, S. et al. The Tara Pacific expedition—a pan-ecosystemic approach of the “-omics” complexity of coral reef holobionts across the Pacific Ocean. PLoS Biol. 17, e3000483 (2019).

66. 66.

Moran, M. A. et al. Sizing up metatranscriptomics. ISME J. 7, 237–243 (2013).

67. 67.

Reed, D. C., Algar, C. K., Huber, J. A. & Dick, G. J. Gene-centric approach to integrating environmental genomics and biogeochemical models. Proc. Natl Acad. Sci. USA 111, 1879–1884 (2014).

68. 68.

Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for omics data. PeerJ 3, e1319 (2015).

69. 69.

Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).

70. 70.

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

71. 71.

Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

72. 72.

Rizzolo, K. et al. A widely distributed diheme enzyme from Burkholderia that displays an atypically stable bis -Fe(IV) state. Nat. Commun. 10, 1–10 (2019).

73. 73.

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

74. 74.

Aziz, R. K. et al. The RAST server: rapid annotations using subsystems technology. BMC Genom. 9, 75 (2008).

75. 75.

Boyer T, et al. The World Ocean Atlas 2018: improvements and uses of climatological mean fields. AGU Fall Meet Abstr 2018; 13.

76. 76.

Kuhn M. Building predictive models in R using the caret Package. J Stat Softw 2008; 28.

77. 77.

R Core Team. R: A Language and Environment for Statistical Computing. 2019. R Foundation for Statistical Computing, Vienna, Austria; 1998.

78. 78.

Liaw A, Wiener M. Classification and Regression by randomForest. 2002; 2: 5.

79. 79.

Legendre P, Legendre LFJ. Numerical Ecology. Elsevier Science; 1998.

80. 80.

Faure E, Ayata SD, Bittner L. Towards omics-based predictions of planktonic functional composition from environmental data. MAGsProteinFunctionalClusters, https://doi.org/10.5281/zenodo.4742385. 2021.

## Acknowledgements

We would like to particularly thank Loïs Maignien, Olivier Aumont, Eric Pelletier, Christian Tamburini, Sébastien Monchy, Thomas Mock, and Ingrid Obernosterer for the insightful discussions concerning this study. We also thank the Meren Lab (https://merenlab.org/) and all the people involved in Tara Oceans for producing the data we used and making them publicly available. We would also like to thank all participants of the GOBITMAP and GREENOCEAN workshops for their useful advice. Finally, EF would like to thank Raphaël Berthier and Jean-Olivier Irisson for their advice concerning machine learning, as well as François Duchenne, Elise Kerdoncuff, Benoît Pérez and Eric Bapteste for their helpful comments. This work was funded mainly by our salary as French State agents and therefore by French taxpayers’ taxes. EF acknowledges a 3-year Ph.D. grant from the “Interface Pour le Vivant” (IPV) doctoral program of Sorbonne Université. SDA acknowledges the CNRS for her two sabbatical years as visiting researcher at ISYEB in 2018-2020. LB acknowledges the Institut Universitaire de France for her 5-year nomination as Junior Member (2020-2025). Additional support was provided by the Institut des Sciences du Calcul et des Données (ISCD) of Sorbonne Université through the support of the sponsored junior team FORMAL (From ObseRving to Modeling oceAn Life). EF acknowledges the financial help of the Korean Institute of Ocean Science and Technology to attend the IMBeR 2019 conference, which led to very helpful discussions concerning this work.

## Author information

Authors

### Contributions

E.F., S.D.A. and L.B. conceived the study. E.F. processed and analyzed the data, with inputs from S.D.A. and L.B. E.F., S.D.A. and L.B. wrote the manuscript.

### Corresponding author

Correspondence to Emile Faure.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Communications thanks Tom Delmont and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Faure, E., Ayata, SD. & Bittner, L. Towards omics-based predictions of planktonic functional composition from environmental data. Nat Commun 12, 4361 (2021). https://doi.org/10.1038/s41467-021-24547-1

• Accepted:

• Published: