Main

The sunlit surface ocean is dominated by heterotrophic bacterioplankton, particularly those belonging to the SAR11 clade of Alphaproteobacteria (Pelagibacterales)2. SAR11 bacteria are globally distributed and abundant, constituting 20–45% of prokaryotic cells and around 18% of the biomass in the surface ocean and having an estimated global population size of 2.4 × 1028 cells1,7. Like other bacteria adapted to oligotrophic (nutrient-poor) environments, they exhibit a small size (cell volume of 0.02–0.06 µm3)8, extremely streamlined genome (1.2–1.4 Mb), and limited metabolic versatility2,9,10. SAR11 bacteria rely largely on the uptake of dissolved organic matter (DOM) to meet their requirements for carbon, nitrogen, sulfur and phosphorus, and are highly active consumers of low molecular mass DOM, accounting for 30–60% of assimilation of amino acids, taurine, glucose and dimethylsulfoniopropionate (DMSP) in the surface ocean3,11,12,13,14,15. Owing to their high abundance, SAR11 bacteria have major biogeochemical importance—for example, they produce climate-active gases such as methane3 and dimethylsulfide4, and divert carbon from the biological carbon pump through respiration of dissolved organic carbon16 (DOC). Thus, understanding the physiology and metabolic capabilities of SAR11 bacteria is critical to our understanding of marine ecosystems.

Fig. 1: Ubiquitous SAR11 bacteria use SBP-dependent transporters to uptake DOM from the ocean and have a critical role in oceanic cycles.
figure 1

Schematic of a SAR11 bacterium showing the SBP-dependent transporters of Ca. P. ubique HTCC1062 and their relationship with key pathways of carbon, nitrogen and sulfur metabolism. Metabolites identified here as SBP ligands are highlighted in colour: blue, amino acids and their derivatives; red, carbon sources; yellow, sulfur-containing metabolites; orange, C1 sources; purple, inorganic ions. Metabolic pathways are based on reactions proposed here (Extended Data Fig. 9) and in refs. 28,29,42,44,56,57. [C1] refers to a C1 unit oxidized to CO2 via the THF-linked oxidation pathway. The ABC transporter structure shown is an X-ray crystal structure of MalEFGK2 from E. coli (Protein Data Bank (PDB) 2R6G) and the structure representative of the TRAP and TTT architecture is a model of the TRAP transporter SiaPQM from Haemophilus influenzae58. DMS, dimethylsulfide; glucose-6P, glucose-6-phosphate; MMPA, methylmercaptopropionate; PEP, phosphoenolpyruvate; ribulose-5P, ribulose-5-phosphate; THF, tetrahydrofolate; TMAO, trimethylamine-N-oxide. Figure partly created with BioRender.com.

To compete for nutrients in the oligotrophic ocean environment, SAR11 bacteria rely heavily on solute-binding proteins (SBPs) to facilitate uptake of specific growth substrates. SBPs are associated with three families of membrane transport systems: ATP-binding cassette (ABC) transporters, which represent the most abundant high-affinity substrate uptake systems in bacteria, as well as tripartite ATP-independent periplasmic (TRAP) transporters and tripartite tricarboxylate transporters (TTTs)17,18. In Gram-negative bacteria, SBPs bind their substrates with high affinity in the periplasm (with dissociation constant (Kd) values typically in the nanomolar to micromolar range) and deliver them to a transmembrane protein complex that facilitates translocation of the substrate across the inner membrane against a concentration gradient, which is driven by coupling to ATP hydrolysis (ABC transporters) or an electrochemical gradient (TRAP and TTT transporters). Consistent with the physiological importance of SBP-dependent transporters for high-affinity substrate uptake, SAR11 bacteria devote a large proportion of their streamlined proteome to SBPs5,6; for example, SBPs represented around 67% of SAR11-derived spectra in metaproteomic analysis of environmental samples from the Sargasso Sea5.

The high abundance1,7 and substrate uptake activity3,11,12,13,14,15 of SAR11 bacteria in the surface ocean, combined with the abundance of SBP-dependent transporters in these bacteria5,6, suggests that a small number of transport proteins in SAR11 bacteria make a substantial contribution to global assimilation of key components of low molecular mass DOM in the surface ocean. However, the properties of these transporters and their specific functions (that is, the transported metabolites) are mostly unknown, which limits our knowledge of the full range of DOM that can be assimilated by SAR11 bacteria, nutrient exchange within the marine microbial community, and the molecular mechanisms for high-affinity substrate uptake. Although homology-based predictions are available, these predictions of protein function have limited accuracy, especially for functionally diverse protein superfamilies such as the ABC and TRAP transporters19,20,21. Transport proteins can also be characterized experimentally through radioassays of substrate uptake in cultured cells, and this approach has been used to characterize the broad-specificity osmolyte transporter of the SAR11 bacterium ‘Candidatus (Ca.) Pelagibacter ubique’22. However, the difficulty of cultivating slow-growing and fastidious SAR11 bacteria makes this a challenging approach for high-throughput characterization of SBP-dependent transporters. Furthermore, owing to the genetic intractability of SAR11 bacteria, the observed transport activity cannot be linked to specific transporter genes, limiting integration of the resulting physiological data with existing multi-omics datasets to uncover the broader geochemical and ecological significance of transport activity. Given these limitations of in vivo approaches, we hypothesized that a heterologous approach, based on heterologous expression, purification and biochemical characterization of SBPs, would be an effective alternative strategy to elucidate the functions and properties of SBP-dependent transporters in SAR11 bacteria. This approach is supported by the fact that the specificity and affinity of substrate uptake by SBP-dependent transporters is mainly determined by the binding specificity and affinity of the corresponding SBPs23, and has been proven to be a valuable method for discovery of new metabolic pathways24,25.

Here we used this approach to systematically interrogate the function of all SBP-dependent transporters in the genome of the prototypical SAR11 bacterium Ca. P. ubique strain HTCC1062. Using high-throughput screening together with rigorous structural and biophysical characterization, we identified the function of the majority of these transporters. Revision of homology-based functional predictions enabled us to accurately interpret patterns of SAR11 transporter abundance in global ocean metagenome and metatranscriptome datasets and identify new transport capabilities and potential carbon sources for SAR11 bacteria. In particular, we identified a high-affinity, broad-specificity transporter for C4 and C5 dicarboxylates that is widely found among SAR11 ecotypes and abundantly distributed in metagenomic and metatranscriptomic datasets, implicating these dicarboxylates as major physiologically relevant carbon sources. Finally, we show how the identification of systematic trends in SBP properties, including their extremely high binding affinity, moderately high binding specificity and limited functional redundancy, provides insight into the evolutionary success of SAR11 bacteria in the oligotrophic ocean environment.

Functional profiling of SAR11 SBPs

We identified 18 SBPs through genomic analysis of Ca. P. ubique strain HTCC1062 (Methods, Supplementary Table 1); these SBPs are found widely across SAR11 bacteria, and conversely, most of the SBPs that are abundant across SAR11 bacteria are represented in this strain (Extended Data Fig. 1). Expression of most of these SBPs in cultured and/or environmental SAR11 cells has been previously demonstrated by proteomic analysis5,6 (Supplementary Table 2). Fourteen of the SBPs yielded soluble protein upon heterologous expression in Escherichia coli strains BL21(DE3) or SHuffle T7 and were successfully purified. Close homologues of two of the remaining SBPs (denoted SAR11_0271* and SAR11_1346*) from a different SAR11 strain (‘Ca. Pelagibacter’ sp. HIMB1321) could also be expressed and purified, whereas the remaining two SBPs, SAR11_0266 and SAR11_1290, could not be expressed in soluble form under any tested condition, nor refolded in vitro from insoluble material (Methods). Two of the SBPs, SAR11_1179 and SAR11_1238, were predicted to represent proteins that are found widely in bacteria and bind inorganic solutes with high specificity: phosphate and iron(iii), respectively. Thus, the functional predictions for these two proteins were directly tested and confirmed by differential scanning fluorimetry (DSF) and isothermal titration calorimetry (ITC) (SAR11_1179; Extended Data Fig. 2a–c) or UV–vis spectroscopy (SAR11_1238, Extended Data Fig. 2d–i) rather than high-throughput screening.

In the remaining cases, the tentative function of each SBP was first identified by high-throughput screening of metabolite libraries by DSF. First, the target protein was screened by DSF against a commercially available metabolite library, representing a set of around 330 unique metabolites, including many common carbon, nitrogen, phosphorus and sulfur sources (full list in Supplementary Table 3). This library was supplemented with a manually curated set of around 40 metabolites that are known to be important for Ca. P. ubique and other marine bacteria26,27,28,29 (for example, osmolytes, sulfonates, and vitamin derivatives) or that were considered to be potential ligands on the basis of the computational annotations of SBP function (for example, opines). Metabolites that resulted in an increase in melting temperature (ΔTM) of the protein of at least 2 °C by DSF were considered to be potential ligands (Supplementary Table 4). Second, a representative subset of the resulting hits was selected, and binding of this subset of ligands to the target protein was confirmed and rank-ordered by repeating DSF with each ligand at a fixed concentration (10 mM) (Fig. 2). Finally, to provide further evidence that the observed increases in TM were a result of specific, high-affinity protein–ligand interactions rather than non-specific protein stabilization, the DSF experiments were repeated with a range of ligand concentrations (Supplementary Fig. 1). Using this workflow, tentative functions were identified for 15 SBPs—that is, all proteins that could be expressed and purified, except SAR11_1068. We showed previously that SAR11_1068 does not have the annotated function (cyclohexadienyl dehydratase activity) and reported extensive but ultimately unsuccessful efforts to identify its function30. The protein was subjected to further high-throughput screening in this work, but no potential ligands were identified.

Fig. 2: Ca. P. ubique HTCC1062 possesses SBPs with varying levels of binding specificity.
figure 2

We measured the change in denaturation temperature (ΔTM) in the presence of 10 mM ligand by DSF. Ligands that resulted in a significant increase in TM (≥2 °C) at a concentration of 10 mM are shown. Protein–ligand interactions that were verified by ITC are labelled with the measured Kd value for the interaction. Columns represent the mean of two technical replicates shown as individual data points. Only proteins with a high-affinity ligand confirmed by ITC are shown. The DSF results show only significant protein–ligand interactions in vitro; whether these interactions are biologically important depends on the binding affinity and the environmental concentration of the ligand. Dagger indicates that the value is from the literature31. DHPS, 2,3-dihydroxypropane-1-sulfonate; GABA, γ-aminobutyrate.

Next, the function of each SBP was confirmed by ITC, enabling accurate quantification of binding affinity, which is important to establish the physiological relevance of the observed protein–ligand interactions. We typically performed titrations for 2 to 5 ligands for each protein, for a total of 32 protein–ligand interactions, aiming to select ligands with a range of ΔTM values to enable estimation of binding affinity for a broader range of ligands, on the basis of the assumption that protein–ligand interactions that yield similar ΔTM values in DSF experiments for a given protein are similar in binding affinity. Using ITC, most of the protein–ligand interactions identified by DSF could be verified (Fig. 2, Supplementary Fig. 2 and Supplementary Table 5), and at least one high-affinity ligand (with Kd < 500 nM) could be identified for each SBP, except SAR11_0271* and SAR11_0797 (Supplementary Note 1). The interaction between SAR11_1302 and TMAO was confirmed by ITC in another report while this work was in progress31. Thus, in total, based on the DSF and ITC data, 13 out of the 18 SAR11 SBPs of Ca. P. ubique HTCC1062 could be confidently assigned a binding function (Fig. 1 and Supplementary Table 6). Further evidence in support of the functional assignment for six SBPs could be obtained on the basis of a metabolome screening approach24 using X-ray crystallography and/or gas chromatography–mass spectrometry (GC–MS) (Extended Data Fig. 3 and Supplementary Note 2). In addition, the physiological relevance of the proposed binding functions is highlighted by the identification of SBPs that bind known substrates of Ca. P. ubique HTCC1062 such as amino acids, d-glucose, DMSP and taurine with high affinity3,11,12,13,14 and the fact the measured concentrations of these substrates in the surface ocean frequently exceed the measured Kd values of the corresponding SBPs (Supplementary Note 3 and Supplementary Table 7).

The SBPs of Ca. P. ubique HTCC1062 showed remarkably high binding affinities, with multiple SBPs having Kd values as low as 20–30 pM (Fig. 3a,b). Seven out of the thirteen SBPs had Kd values of <5 nM, which are below the quantification limit of direct ITC experiments; thus, to obtain accurate Kd values for these interactions, we also performed competitive ITC binding experiments (Supplementary Table 5). In the case of SAR11_1210, titration with l-arginine in the presence of d-octopine as a competing ligand indicated a Kd value between 10 pM and 100 pM for l-arginine, but variable results were obtained with different concentrations of d-octopine (Supplementary Data 1). Thus, to confirm the high affinity of this interaction, we also performed a protein–protein competition experiment in which SAR11_1210 was mixed with a previously characterized arginine-binding protein, ArgT from Salmonella enterica (Kd = 15 nM), and then titrated with l-arginine. Fitting the resulting data to a two-sets-of-sites binding model yielded a Kd of 32 pM for the interaction between SAR11_1210 and l-arginine (Fig. 3c). We also solved the crystal structure of SAR11_1210 complexed with l-arginine, which showed an unusual binding mode involving a direct interaction between the ligand and the flexible hinge region linking the two α/β domains of the SBP, suggesting a possible structural basis for the high binding affinity (Fig. 3e, Extended Data Fig. 4 and Supplementary Note 4). Finally, in the case of SAR11_0769, titration with d-glucose reproducibly yielded a biphasic binding isotherm (Fig. 3d), which most probably reflects differential binding of the α and β anomers of d-glucose, as supported by a crystal structure of SAR11_0769 complexed with β-d-glucose (Fig. 3f, Extended Data Fig. 5 and Supplementary Note 5). Fitting the ITC data to a competitive binding model enabled estimation of the upper limit of Kd (lower limit of affinity) for the high-affinity anomer as approximately 27 pM. A systematic survey of literature data (n = 206 SBPs) revealed that the typical range of SBP Kd values for organic solutes is 10–1,000 nM, with a lower limit of 200–400 pM (log10 Kd values −6.76 ± 1.15 (mean ± s.d.), Fig. 3a). Together, these results provide robust evidence that some SBPs in SAR11 bacteria exceed the previously established limits of SBP affinity for organic solutes.

Fig. 3: SBPs from Ca. P. ubique HTCC1062 exhibit ultra-high binding affinity.
figure 3

a, Comparison of Kd values for SBPs from Ca. P. ubique HTCC1062 with bacterial SBPs previously reported in the literature (n = 206; Supplementary Data 2). In cases where the SBP binds multiple ligands, data for the highest-affinity ligand is shown. SBPs with Kd < 5 nM are highlighted by red shading. b, Kd values for the highest-affinity interaction for each SBP from Ca. P. ubique HTCC1062. Values for SAR11_0655 and SAR11_0769 represent upper bounds on Kd (lower bounds on affinity). Bars represent mean of 2–5 technical replicates (independent titrations), shown as individual data points. Data from the literature data are shown for SAR11_130231. c,d, Determination of binding parameters for high-affinity interactions using competitive ITC experiments. Data fitting was performed in MicroCal PEAQ software. c, Simultaneous titration of SAR11_1210 and SeArgT with l-arginine. Data are representative of two replicates (separate titrations). d, Titration of SAR11_0769 with d-glucose, showing a biphasic binding curve. Data are representative of four replicates (separate titrations). The negative control titration (buffer plus d-glucose) is shown in grey. e, Binding mode of l-arginine in the crystal structure of SAR11_1210 (1.32 Å resolution). f, Binding mode of d-glucose in the crystal structure of SAR11_0769 (1.86 Å resolution).

Reinterpreting transport gene function

Comparison of the experimentally determined functions of each SBP with the homology-based predictions indicated that the accuracy of the predictions was low (Supplementary Table 6). The binding specificities of four proteins, SAR11_0807, SAR11_1179, SAR11_1203 and SAR11_1238, were correctly predicted as taurine, phosphate, citrate and iron(iii), respectively. The predictions of SAR11_0769, SAR11_0953 and SAR11_1346 as a sugar-binding protein, general amino acid-binding protein and branched-chain amino acid-binding protein were broadly correct, although experimental characterization enabled identification of specific ligands. By contrast, 7 out of the 15 testable functional annotations were incorrect; in 5 of these cases, the binding specificity could be determined experimentally. For example, SAR11_1336 (encoded by potD), which was annotated as a putative spermidine or putrescine-binding protein, showed broad specificity for glycine betaine, DMSP and other osmolytes. The binding specificity of this protein matches the transport activity of a broad-specificity osmolyte transporter that was previously characterized in vivo, which was putatively attributed to SAR11_0797 (proX)22. Overall, these results show that the SBP-dependent transporters of SAR11 bacteria transport a narrower range of nitrogen sources and broader range of carbon sources and exhibit less functional redundancy than predicted32.

The assignment of transport capabilities to specific genes enables integration of the functional data with existing genomic, transcriptomic, and proteomic data. For example, functional assignment of the Ca. P. ubique HTCC1062 SBPs enabled analysis of the geographical distribution of various transport capabilities across SAR11 and other marine bacteria using ocean metagenome and metatranscriptome data. First, using the Ocean Gene Atlas tool33, we analysed the abundance of homologues of the characterized SBPs in metagenome and metatranscriptome datasets from the Tara Oceans project (Extended Data Figs. 6 and 7), which enabled us to identify abundantly transcribed transport genes that might contribute to assimilation of various components of DOM and determine whether transport gene expression correlates with known patterns of nutrient limitation and uptake. We also performed a separate metagenome analysis limited to SAR11 bacteria, estimating the percentage of SAR11 bacteria to contain a given SBP gene at each site based on the relative abundance of different SAR11 genomospecies34, which yielded similar patterns of SBP abundance (Supplementary Fig. 3).

Consistent with the global abundance of SAR11 bacteria and the broad distribution of the characterized SBPs among SAR11 bacteria (Extended Data Fig. 1), and consistent with another recent analysis35, homologues of most of the Ca. P. ubique HTCC1062 SBP genes were present at high abundance across stations in the metagenome and metatranscriptome datasets, including surface, deep chlorophyll maximum (DCM) and mesopelagic samples (Extended Data Figs. 6 and 7). In the metatranscriptome dataset, the mean abundance (fraction of mapped reads) of these SBP genes across surface stations varied from 3 × 10−5 (SAR11_0655) to 3 × 10−3 (SAR11_0953); for comparison, the mean total abundance of SBP transcripts across surface stations was estimated to be 2.7 × 10−2, indicating that the SBPs of Ca. P. ubique HTCC1062 and their putatively isofunctional homologues account for a substantial proportion (around 40%) of SBP transcripts in the surface ocean. Although there is not necessarily a quantitative correlation between transcript abundance and the rate of substrate uptake, these results show qualitatively that the functional assignments of the Ca. P. ubique HTCC1062 SBPs are potentially significant in the broader context of global DOM assimilation. For example, the DMSP/glycine betaine transport gene SAR11_1336 showed a mean transcript abundance of 1.6 × 10−3. The high abundance of SAR11_1336 suggests that, together with a recently described and unrelated DMSP-specific transport protein36, SAR11_1336 and its homologues in other bacteria may make a significant contribution to global microbial uptake of DMSP, a metabolite that has an important role in the marine sulfur cycle and climate regulation via its microbial conversion to the climate-active gas dimethylsulfide37.

In both the metagenome and metatranscriptome analyses, transporters for sulfonates, amino acids, TMAO, glycine betaine, DMSP and dicarboxylates showed a near-universal distribution and particularly high abundance, whereas transporters for l-pyroglutamate, phosphate, iron(III) and d-glucose showed a geographically limited distribution (Extended Data Figs. 6 and 7). These results are consistent with the known contribution of SAR11 bacteria to uptake of taurine, amino acids and DMSP across different environments, compared with ecotype-specific and geographically variable uptake of d-glucose2. Similar patterns of SBP gene abundance were typically observed in the metagenome and metatranscriptome datasets, consistent with high constitutive expression and limited transcriptional regulation of most SBP genes in SAR11 bacteria, with the exception of SBPs for phosphate and iron, which showed higher expression in regions of known phosphate and iron limitation38,39. Notably, these interpretations of the metagenomic and metatranscriptomic data are contingent on accurate functional annotation; for example, misidentification of the SAR11 osmolyte transporter as SAR11_0797 would suggest a much more limited role for DMSP and glycine betaine uptake and broader role for polyamine uptake across SAR11 and other marine bacteria35,40.

Novel functions of SAR11 SBPs

In addition to identifying transporters for known substrates, functional characterization of SBPs also enabled identification of new transport capabilities. SAR11_0655 (l-pyroglutamate) and SAR11_1361 (C4 and C5 dicarboxylates) represent new classes of ABC transporters and previously unknown transport capabilities of SAR11 bacteria. l-Pyroglutamate, which binds to SAR11_0655 with Kd < 5 nM, was an unexpected ligand, as it is a non-proteogenic amino acid that is not known to be a significant component of DOC16. Although the occurrence of SAR11_0655 is limited among SAR11 ecotypes and is restricted mainly to high latitudes (Extended Data Figs. 1, 6 and  8), other SAR11 bacteria appear to achieve l-pyroglutamate uptake using an alternative transporter (Supplementary Note 6), suggesting that l-pyroglutamate is widely utilized. Analysis of genome context suggested a putative pathway for utilization of exogenous l-pyroglutamate as a source of l-glutamate (Extended Data Fig. 9). The fact that SAR11 bacteria, despite their extremely streamlined genome, retain specific and high-affinity transporters for l-pyroglutamate indicates that this amino acid must be a widely available and useful source of carbon and/or nitrogen in the ocean. More generally, given the significant challenges of identifying environmentally important metabolites in heterogeneous, dilute and variable DOC16, this result suggests that identification of new transport capabilities from characterization of SBPs from oligotrophic marine bacteria might be a useful approach to identify new environmentally significant ocean metabolites from the DOC pool.

SAR11_1361 showed binding of a broad range of dicarboxylates that participate in the tricarboxylic acid (TCA) cycle (Figs. 1 and 2). This gene is known to be associated with carbon starvation in SAR11 bacteria; transcription and/or expression is upregulated upon carbon limitation in the dark41 (that is, energy-starved conditions) and downregulated upon nitrogen and sulfur limitation42,43. Analysis of genome context also suggested a putative pathway (via SAR11_1354) for utilization of exogenous glutarate, which was subsequently confirmed to be a substrate of SAR11_1354 by 1H-NMR and a ligand of SAR11_1361 by DSF (Extended Data Fig. 9). These results, together with the identification of a specific and high-affinity citrate-binding protein (SAR11_1203), suggest a broad capacity of Ca. P. ubique HTCC1062 to assimilate dicarboxylates and TCA cycle intermediates. Genomic and biogeographical analysis indicated that the capability for dicarboxylate uptake is also widely distributed among SAR11 bacteria: the dicarboxylate transport protein SAR11_1361 shows a broader distribution among SAR11 genomes than the glucose transport protein SAR11_0769 (Fig. 4a,b and Extended Data Fig. 1), and shows a broader geographical distribution in the Tara Oceans metagenome and metatranscriptome datasets, including both coastal and open ocean samples (Fig. 4c–f), despite the fact that SAR11_1361 has a much more limited phylogenetic distribution among bacteria (Extended Data Fig. 10). In the context of uncertainty surrounding the carbon sources that are universal to SAR11 bacteria44 (Supplementary Note 7), the identification of an SBP with high affinity (Kd < 10 nM) and broad specificity for C4 and C5 dicarboxylates that is conserved among SAR11 ecotypes despite stringent genome streamlining, and widely distributed and highly transcribed throughout the ocean, provides strong evidence that these dicarboxylates are physiologically important carbon sources in SAR11 bacteria.

Fig. 4: The dicarboxylate transport gene SAR11_1361 is abundantly distributed across the surface ocean.
figure 4

a,b, Estimated proportion of SAR11 bacteria containing homologues of SAR11_0769 (a) and SAR11_1361 (b) in surface stations from the Tara Oceans dataset, based on metagenome recruitment values of 159 SAR11 genomes at each location34 and the presence or absence of an SBP homologue in the corresponding genomes (Methods). The completeness of the genomes used for this analysis was 77 ± 15% (mean ± s.d.)34—thus, the given values are underestimates. For comparison, the proportion of SAR11 bacteria belonging to the Ca. P. ubique HTCC1062 clade and subclade of SAR11 bacteria at each location is shown in Supplementary Fig. 7. cf, Abundance of SAR11_0769 and SAR11_1361 homologues in surface metagenome and metatranscriptome datasets from the Tara Oceans project. SAR11_0769 and SAR11_1361 homologues were obtained from the Ocean Gene Atlas v2.0 and filtered using an e-value threshold of 10−30 and sequence identity threshold of 40%. Abundance is expressed as the fraction of mapped reads in each sample and is represented by point area on a linear scale. c, SAR11_0769 homologues in the Tara Oceans metagenome datasets. d, SAR11_1361 homologues in the Tara Oceans metagenome datasets. e, SAR11_0769 homologues in the Tara Oceans metatranscriptome datasets. f, SAR11_1361 homologues in the Tara Oceans metatranscriptome datasets.

Specificity and affinity of SAR11 SBPs

Systematic characterization of SBPs provided a global view of transporter specificity and affinity in Ca. P. ubique HTCC1062, providing broad insight into the physiology of oligotrophic bacteria. It has long been hypothesized that oligotrophic bacteria with streamlined genomes, including SAR11 bacteria, rely on broad-specificity transporters to enable transport of a broad range of substrates with a limited number of transporters9. Indeed, a broad-specificity osmolyte transporter had previously been identified22, and three more broad-specificity SBPs for amino acids and dicarboxylates were characterized in this work. However, the majority of SBPs (at least 8 out of 13) showed high binding specificity, suggesting a more nuanced view of uptake specificity in oligotrophic bacteria (Supplementary Note 8). Genome streamlining results in reduction of metabolic genes in addition to transporter genes; thus, broad-specificity transporters are associated with a risk of futile uptake of metabolites that cannot be utilized, especially given the high compositional complexity of ocean DOC. Our results show that Ca. P. ubique HTCC1062 is highly selective in its substrate uptake, using a small number of broad-specificity transporters mainly for metabolites that can be utilized without dedicated catabolic pathways, including amino acids and TCA cycle intermediates. The remaining transporters show high specificity and mainly cover specific gaps in the broad-specificity transporters; indeed, there is little redundancy in binding specificity between the SBPs, except for some overlap between the two broad-specificity amino acid-binding proteins. The high specificity of these transporters does not result from a negative tradeoff between specificity and affinity; for example, SAR11_0953 is estimated to have nanomolar affinity for around 15 proteinogenic amino acids (on the basis of measured ΔTM and Kd values), with a maximum of 550 pM for l-glutamate, demonstrating that broad specificity is compatible with high affinity (Supplementary Note 9). Furthermore, three out of the four broad-specificity transporters (SAR11_0953, SAR11_1336 and SAR11_1346) appear to be widely distributed among Proteobacteria, indicating that use of broad-specificity transporters is not unique to oligotrophic bacteria (Extended Data Fig. 8 and Supplementary Fig. 4). Overall, these considerations suggest that oligotrophic bacteria probably show greater selectivity in substrate uptake than previously assumed.

Our results revealed that a systematic increase in SBP binding affinity is a major adaptation of Ca. P. ubique HTCC1062 to low substrate concentrations in the oligotrophic environment. The binding affinity of the Ca. P. ubique HTCC1062 SBPs was remarkably high on average, and substantially exceeded the known range of SBP binding affinity in some cases (Fig. 3a). Kd values in the picomolar to low nanomolar range were observed in most cases, in concordance with the picomolar to low nanomolar concentrations of amino acids and other substrates typically observed in the surface oligotrophic ocean45,46 (Supplementary Note 3 and Supplementary Table 7) and picomolar to low nanomolar uptake affinities (specifically, Ks + [S], the sum of the half-saturation constant (Ks) and the in situ substrate concentration) for various metabolites in environmental samples from the surface ocean45,47, for which the corresponding transport proteins have generally not been identified22. Although the SBPs may have slightly different properties in vitro compared with their native cellular environment, a strong correlation is generally observed between in vitro properties of SBPs and the in vivo properties of the corresponding transporters23. In addition, the physiological relevance of the observed binding affinity in SBPs is indicated by several considerations: (1) the observed Kd of 2.0 nM for the interaction between SAR11_1336 and glycine betaine is in excellent agreement with the previously measured Ks value for the corresponding transporter22 (0.89 nM); (2) mathematical models of ABC transporter activity indicate that uptake affinity should be greater than SBP binding affinity when SBP concentration is high48 (as in SAR11 bacteria); and (3) the binding affinity of an SBP has physiological significance itself, because it determines the concentration at which substrates can be accumulated in the periplasm49. Whereas previous work has shown how extreme selective pressure driven by large population size under low-nutrient conditions has driven systematic adaptation of SAR11 bacteria at the genome and cellular levels (for example, reduction of GC content9 and increase in periplasmic volume8), this work shows that systematic adaptation of the biophysical properties of SBPs is another important factor in the evolutionary success of SAR11 bacteria. We speculate that the evolutionary tradeoffs underlying ultra-high binding affinity in SBPs may also be an important factor shaping the physiology of SAR11 and other oligotrophic bacteria (Supplementary Note 10).

The identification of SBPs with unprecedented binding affinity in the genome of Ca. P. ubique HTCC1062 resolves uncertainty about the discrepancy between the observed affinity of substrate uptake by microbial communities in the ocean and the binding affinity of previously characterized transporters for substrate uptake. To explain this apparent discrepancy, various alternative mechanisms for high-affinity substrate uptake in oligotrophic bacteria have been proposed. For example, a recent modelling study showed that the uptake affinity of an ABC transporter depends on both SBP concentration and binding affinity, and suggested that oligotrophic bacteria might use high SBP expression to achieve high uptake affinity without increasing SBP binding affinity48; by contrast, our results show (without invalidating this model) that high uptake affinity can be explained without accounting for periplasmic SBP concentration. As another example, the observation that the binding affinity of known phosphate-binding proteins (around 1 µM) is much higher than concentrations of inorganic phosphate in phosphate-depleted regions (less than 5 nM) led to the proposal of an alternative mechanism for accumulation of inorganic phosphate in the periplasm of oligotrophic bacteria49,50. Of note, the phosphate-binding protein of Ca. P. ubique HTCC1062 does indeed have relatively low binding affinity (133 nM), which may reflect the challenge of discriminating phosphate from sulfate, which is present at a concentration of around 28 mM in the ocean; although phosphate-binding proteins from sulfate-rich environments can achieve a discrimination factor of greater than 105 (ref. 51), there is presumably a biophysical limit on discrimination of these two anions due to their physicochemical similarity. Consistent with the hypothesis that the binding affinity of phosphate-binding proteins may be constrained by the requirement for discrimination of phosphate and sulfate, SAR11_1179 showed a small but significant decrease in apparent binding affinity in the presence of 28 mM sulfate (6.7-fold decrease to 890 nM, P < 0.0001, two-tailed t-test on log10 Kd values (Extended Data Fig. 2c)).

Discussion

Systems-level approaches based on metatranscriptomics and related methods are highly valuable for profiling putative biological functions in complex microbial communities across different environments, providing insight into their ecological and biogeochemical functions52,53. However, a limitation of these methods is that they depend on homology-based predictions of protein function, which vary markedly in accuracy between protein families and are usually not validated19. Here we have shown how targeted functional characterization of environmentally abundant proteins can be integrated with existing multi-omics and physiological data to provide insight over multiple biological scales, ranging from mechanisms of functional adaptation at the molecular level to global patterns of substrate uptake capabilities in SAR11 bacteria. We anticipate that improved computational annotation and continued experimental annotation of protein function will be essential to extract maximum value from increasingly high-resolution ocean microbiome datasets and fulfil the broader goal in microbial ecology of bridging microbial gene function and ocean ecosystems biology on a planetary scale54,55.

Methods

Identification of SBP genes

Nineteen candidate SBP genes in the genome of Ca. P. ubique strain HTCC1062 were identified through a search of the TransportDB 2.0 database59 (http://membranetransport.org; accessed 22 January 2020). One of these genes, SAR11_0371, was annotated as a ‘possible transmembrane receptor’ in UniProt and showed a non-canonical predicted domain structure consisting of a short SBP-like domain (170 amino acids) followed by a coiled coil domain and unidentified C-terminal domain. Additionally, genome context analysis showed that, unlike the other ABC SBP genes in Ca. P. ubique HTCC1062, SAR11_0371 was not colocalized with genes encoding the membrane permease or ATP-binding cassette components of an ABC transport system. Thus, SAR11_0371 was considered not to represent the SBP component of an SBP-dependent transport system and was excluded from the analysis. We also attempted to identify additional SBP genes through a search of the UniProt database for proteins in Ca. P. ubique belonging to Pfam clans CL0177 (PBP; periplasmic binding protein) and CL0144 (Periplas_BP; periplasmic binding protein like); however, this search did not return any additional candidate genes.

Cloning

The protein sequence of each SBP from Ca. P. ubique HTCC1062 was obtained from the UniProt database. Signal sequences were predicted using the SignalP 5.0 server60 and removed. The protein sequences were then back-translated and codon-optimized for expression in E. coli, and the resulting genes were obtained as synthetic DNA from Twist Bioscience or Integrated DNA Technologies. The synthetic genes were cloned into the NdeI/XhoI site of the pET-28a(+) expression vector by In-Fusion cloning using the In-Fusion HD Cloning Kit (Takara Bio), yielding expression constructs with an N-terminal hexahistidine tag and thrombin tag. Correct assembly of each expression vector was confirmed by Sanger sequencing (FASMAC). The putative csiD gene, SAR11_1354, and several homologues of the Ca. P. ubique HTCC1062 SBPs (Supplementary Table 8) were cloned similarly into the pET-28a(+) vector, except that the thrombin tag was removed from the constructs of SAR11_1354, SAR11_0266 (Fub), or SAR11_1290 (SAR324). The sequences of oligonucleotides and synthetic genes used in this study are listed in Supplementary Table 9.

Optimization of protein expression

Protein expression was initially tested in E. coli BL21(DE3) cells grown in Luria-Bertani (LB) and Terrific Broth (TB) media at 30 °C and 17 °C. SAR11_0655 showed optimal soluble expression in LB medium at 17 °C, SAR11_1203 showed optimal soluble expression in TB medium at 30 °C, and 7 proteins (SAR11_0797, SAR11_0807, SAR11_0864, SAR11_1068, SAR11_1179, SAR11_1210, SAR11_1238, and SAR11_1361) showed optimal soluble expression in TB medium at 17 °C. Next, the remaining proteins were tested for expression in E. coli SHuffle T7 cells (New England Biolabs) in TB medium at 17 °C; this strain expresses the disulfide bond isomerase DsbC, which can increase soluble recombinant expression of cytoplasmic proteins by promoting correct formation of disulfide bonds. Soluble expression of SAR11_0769, SAR11_0953, SAR11_1302, and SAR11_1336 was achieved under these conditions. Due to the lack of soluble expression for the remaining four proteins (SAR11_0266, SAR11_0271, SAR11_1290 and SAR11_1346), we also tested expression of one or two close homologues of each protein (Supplementary Table 8). The SAR11_0271 homologue from ‘Ca. Pelagibacter’ sp. HIMB1321 (denoted SAR11_0271*) could be expressed in soluble form in SHuffle T7 cells in TB medium at 17 °C, while the SAR11_1346 homologue from the same species (denoted SAR11_1346*) could be expressed in soluble form in BL21(DE3) cells in TB medium at 17 °C. SAR11_0271* and SAR11_1346* share 91.4% and 88.9% sequence identity, respectively, with the corresponding proteins from Ca. P. ubique HTCC1062, and the binding site residues are completely conserved (Supplementary Fig. 5), indicating that the functions and properties of the homologous SBPs are likely to be identical. Neither homologue of SAR11_0266 or SAR11_1290 could be expressed in soluble form in BL21(DE3) or SHuffle T7 cells. Expression of SAR11_0266 and SAR11_1290 without His6 or thrombin tags also yielded insoluble protein.

Protein expression was typically evaluated by SDS–PAGE analysis as follows. Cells transformed with the relevant expression vector by electroporation were spread from a frozen glycerol stock onto an LB agar plate containing 0.2% (w/v) glucose and 25 µg ml−1 kanamycin and incubated at 30 °C overnight. The cells were then scraped into a small volume of LB medium and used to inoculate 3 ml of the relevant growth medium containing 25 µg ml−1 kanamycin in a 10 ml round bottom tube at a starting OD600 of 0.05. The culture was incubated at 37 °C with shaking at 220 rpm until the OD600 reached 0.5. One-millilitre aliquots were transferred to clean round bottom tubes and isopropyl β-d-1-thiogalactopyranoside (IPTG) was added to a final concentration of 0.5 mM. The induced cultures were incubated with shaking at 220 rpm at 17 °C overnight or 30 °C for 3 h. A 500-µl aliquot of each culture was resuspended in lysis buffer (20 mM Tris, 0.5 M NaCl, 1% (v/v) Triton X-100, pH 8.0) and incubated at room temperature for 10 min. The cell lysate was centrifuged at 21,000g for 5 min (4 °C). The soluble fraction of the cell lysate was transferred to a tube containing 30 µl cOMPLETE His-Tag purification Ni-NTA resin (Roche) suspended in 500 µl buffer A (8 M urea, 20 mM Tris, 0.5 M NaCl, pH 8.0), while the insoluble fraction of the cell lysate was dissolved in 500 µl buffer A, centrifuged at 21,000g for 5 min, and then transferred to a tube containing 30 µl Ni-NTA resin suspended in 500 µl buffer A. In both cases, the resin was incubated at room temperature for 10 min, washed twice with 500 µl buffer A, and then eluted by incubation with 50 µl buffer B (8 M urea, 20 mM Tris, 0.5 M NaCl, 0.5 M imidazole, pH 8.0) at room temperature for 5 min. Fifteen microliters of supernatant was mixed with 5 µl of 4× SDS–PAGE sample loading buffer and heated at 90 °C for 10 min, then loaded onto a 4–15% pre-cast SDS–PAGE gel (Bio-Rad). The gel was run at 200 V for 30 min and visualized with Coomassie Blue.

Large-scale protein expression and purification

For expression and purification of the Ca. P. ubique SBPs, E. coli BL21(DE3) or SHuffle T7 cells transformed with the relevant expression vector were spread from a frozen glycerol stock onto an LB agar plate containing 0.2% (w/v) glucose and 25 µg ml−1 kanamycin, and incubated at 30 °C overnight. The cells were then scraped into 3 ml LB medium, and 500 µl of the resulting cell suspension was used to inoculate 500 ml LB or TB medium supplemented with 25 µg ml−1 kanamycin in a 2 l or 3 l flask, preheated at 37 °C. The culture was incubated at 37 °C with shaking at 220 rpm until the OD600 reached 0.5, then cooled briefly in an ice-water bath until the temperature reached ~25 °C. IPTG was added to a concentration of 0.5 mM, and the culture was incubated at 17 °C with shaking at 220 rpm for a further 16 h. Cells were pelleted by centrifugation (3,300g, 15 min, 4 °C) and frozen at −20 °C until use. For protein purification, cells were thawed on ice, resuspended in 100 ml Ni binding buffer (20 mM Tris, 500 mM NaCl, 20 mM imidazole, pH 8.0), and lysed by sonication. After addition of 500 U Benzonase Nuclease (Sigma-Aldrich) to digest DNA, the cell lysate was centrifuged at 10,000g for 1 h (4 °C). The supernatant was filtered through a 0.45-µm syringe filter and then loaded onto a 1 ml HisTrap HP column (Cytiva) equilibrated with Ni wash buffer using an ÄKTA Pure FPLC system (Cytiva). For purification under native conditions, the column was washed with 10 ml Ni binding buffer followed by 10 ml Ni wash buffer (20 mM Tris, 500 mM NaCl, 44 mM imidazole, pH 8.0), and then the target protein was eluted in 10 ml Ni elution buffer (20 mM Tris, 500 mM NaCl, 500 mM imidazole, pH 8.0). For purification under denaturing conditions, the column was washed with denaturing Ni binding buffer (8 M urea, 20 mM Tris, 250 mM NaCl, 20 mM imidazole, pH 8.0) at 1 ml min−1 for 30 min after loading of the clarified cell lysate, and the target protein was eluted with 10 ml denaturing Ni elution buffer (8 M urea, 20 mM Tris, 250 mM NaCl, 250 mM imidazole, pH 8.0). Proteins purified under native conditions were concentrated to 400 µl using a 10 kDa molecular weight cut-off (MWCO) Amicon Ultra-4 centrifugal spin concentrator (Merck-Millipore) and purified by size-exclusion chromatography using a Superdex 200 Increase 10/300 column (Cytiva), eluting in DSF buffer (20 mM HEPES, 0.3 M NaCl, pH 7.50). For storage, proteins were concentrated to a volume of 0.5–2 ml and glycerol was added to a concentration of 10% (v/v). The protein was then flash-frozen in 100–200-µl aliquots in liquid nitrogen and stored at −80 °C until use. ArgT from S. enterica was expressed from a pETMCSIII plasmid and purified as described previously61.

Protein refolding

In most cases, protein purified under denaturing conditions was diluted to a concentration of 0.5 mg ml−1 and volume of 10–30 ml in denaturing Ni binding buffer (8 M urea, 20 mM Tris, 250 mM NaCl, 20 mM imidazole, pH 8.0) and transferred to 10 kDa MWCO SnakeSkin dialysis tubing (Thermo Scientific). The protein was then dialysed against 2 l dialysis buffer (20 mM Tris, 150 mM NaCl, pH 8.0) at 4 °C with three buffer changes over a period of 24 h. The protein was collected and exchanged into DSF buffer using a 10 kDa MWCO Amicon Ultra-15 centrifugal concentrator, then concentrated to 400 µl and purified by size-exclusion chromatography as described above. For SAR11_1346*, an improved yield of monomeric protein was obtained using the rapid dilution for refolding: 2 ml of denatured protein (5 mg ml−1 in denaturing Ni binding buffer) was added dropwise with stirring to 40 ml pre-chilled refolding buffer (20 mM Tris, 150 mM NaCl, 10% (v/v) glycerol, pH 8.0) and incubated at 4 °C with stirring for 20 h. The protein was then concentrated and purified by size-exclusion chromatography as above.

Differential scanning fluorimetry

DSF experiments were performed using a StepOnePlus Real-Time PCR System and StepOne software (Applied Biosystems) based on literature protocols62,63. Reaction mixtures were prepared in twin.tec Real-Time PCR Plates (Eppendorf) and contained 5× SYPRO Orange (Sigma-Aldrich), 2.5 µM protein, and 2 µl 10× ligand in a total volume of 20 µl DSF buffer. The plate was sealed with optically clear sealing film and centrifuged at 2,000g for 1 min before loading into the real-time PCR instrument. The temperature was ramped at a rate of 1% (approximately 1.33 °C min−1), typically over a 60 °C window centred on the melting temperature (TM) of the target protein. Fluorescence was monitored using the ROX channel. TM values were determined by taking the derivative of fluorescence intensity with respect to temperature and fitting the resulting data to a quadratic equation in a 6 °C window in the vicinity of the TM in R software.

Proteins were initially screened for binding to metabolites in four Phenotype MicroArray plates, PM1 to PM4 (Biolog). The contents of each well were dissolved in 50 µl (PM1 to PM3) or 20 µl (PM4) sterile filtered water, giving a concentration of approximately 10–20 mM in each well63. The plates were then sealed with aluminium sealing films and stored at −80 °C. Prior to use, the plates were thawed at room temperature and then shaken at 30 °C until the compounds had redissolved. Two microliters of each compound was added to 18 µl reaction mixture prepared as described above. A 2 °C increase in TM compared with the median value across the plate was taken as indicative of binding63,64.

For screening of individual compounds and confirmatory assays, compounds were dissolved at a concentration of 100 mM in ligand buffer (0.1 M HEPES pH 7.5), and the pH was adjusted with 1 M NaOH or 1 M HCl if necessary (specifically, if the pH of a 10 mM solution of the compound diluted in DSF buffer fell outside the range 6.5–8.0). These stock solutions were stored at −20 °C. Two microlitres of each compound was directly added to 18 µl reaction mixture, giving a final concentration of 10 mM, or first diluted 10-fold or 100-fold in DSF buffer to give final concentrations of 1 mM or 0.1 mM in the assay. A list of chemicals used for screening, including the supplier and catalogue number, is provided in Supplementary Table 3. Sodium (R)- and (S)-2,3-dihydroxypropane-1-sulfonate were synthesized from (R)- and (S)-3-chloro-1,2-propanediol following a literature protocol65 and verified by 1H and 13C NMR.

In the case of the TRAP and TTT SBPs, SAR11_0864 and SAR11_1203, we hypothesized that a metal ion might be required for high-affinity binding, due to the biphasic melting curve observed in the presence of isethionate in Biolog screening experiments, suggesting the presence of a mixture of active and inactive protein (SAR11_0864) or due to the discord between the highly charged ligand and the largely uncharged binding site of the SBP (SAR11_1203). Therefore, we tested the effect of the addition of metal ions (Mg2+, Ca2+, K+, Zn2+, Mn2+, Co2+, Ni2+, Fe2+ and Fe3+) on binding of isethionate to SAR11_0864 and citrate to SAR11_1203 by DSF (Supplementary Fig. 6). DSF experiments were performed using refolded protein as described above, with the addition of 1 mM metal ion and 1 mM ligand. Based on these results, and considering the concentration of each metal ion in seawater66, 10 mM CaCl2 (SAR11_0864) or 53 mM MgSO4 (SAR11_1203) were included in subsequent DSF and ITC binding experiments for these SBPs.

Isothermal titration calorimetry

ITC experiments were performed using a MicroCal PEAQ-ITC system (Malvern Panalytical). Protein samples were refolded and freshly purified (not frozen), and protein and ligand samples were prepared in the same batch of DSF buffer used for size-exclusion chromatography to minimize the heat of dilution. For SAR11_0864 and SAR11_1203, calcium chloride (final concentration 10.3 mM) or magnesium sulfate (final concentration 53 mM), respectively, was added to the protein and ligand samples. Experiments were performed at 25 °C with stirring at 700 rpm and 10 µcal s−1 reference power. Titration parameters were varied depending on the protein yield, the fraction of active protein, and the affinity and enthalpy of the interaction. In a typical titration, 35 µM protein was titrated with 1× 0.4-µl and 19× 1.6-µl injections of ligand, with the ligand concentration chosen to give >1.5-fold molar excess of ligand to active protein at the end of the titration. ITC experiments were generally performed at least in duplicate.

For simple 1:1 binding interactions, the association constant (Ka), enthalpy (ΔH), and stoichiometry (n) of the interaction were determined by fitting the data to the one-set-of-sites model in MicroCal PEAQ-ITC analysis software. In the case of the SAR11_0769 + d-glucose interaction, thermodynamic parameters were estimated through Bayesian fitting to a modified competitive binding model, which incorporated an additional parameter to account for the fraction of the ligand in each anomeric form, and a two-sets-of-sites model implemented in pytc software67; the latter model is equivalent to the two-sets-of-sites model in the MicroCal software, except without the minor correction for heat associated with the displaced volume for each injection (for consistency with the other models in pytc). Thermodynamic parameters for the SAR11_0953 + l-glutamate, SAR11_1203 + citrate, SAR11_1210 + l-arginine, SAR11_1336 + glycine betaine, and SAR11_1346* + l-leucine interactions were determined through competitive displacement experiments68, in which l-phenylalanine, cis-aconitate, d-octopine, glycine, or l-serine (respectively) were included at a fixed concentration in the cell to reduce the apparent binding affinity for the ligand of interest. The data for these competitive binding experiments were analysed by Bayesian fitting to the competitive binding sites model in pytc software. To confirm the high affinity of the SAR11_1210 + l-arginine interaction, a competitive binding experiment was performed where SAR11_1210 and ArgT from S. enterica (which has a Kd of 15 nM for l-arginine) were included in the cell together at the same concentration (28 µM) and titrated with l-arginine. Similarly, for the SAR11_1210(E108A) + l-arginine interaction, a mixture of SAR11_1210(E108A) and SAR11_1210 (35 µM each) was titrated with l-arginine. For these titrations, the data was fitted to a two-sets-of-sites binding model as described above to obtain thermodynamic parameters for both protein–ligand interactions. For all analyses, the heat of dilution was assumed to be a small constant value and included as a fitted parameter in the model. The validity of this assumption was confirmed for each ligand by performing a control titration where the ligand was injected into DSF buffer.

Spectrophotometric analysis of iron(iii) binding

Binding of iron(iii) to SAR11_1238 was analysed using a spectrophotometric assay based on literature protocols69,70. UV–vis spectra were recorded at room temperature (25 °C) in a 96-well plate from 300 nm to 630 nm with 1 nm bandwidth using a Multiskan GO spectrophotometer (Thermo Scientific). An initial protein concentration of 100 µM and an initial volume of 200 µl were used for all spectrophotometric assays. First, purified SAR11_1238 was thawed and exchanged into 50 mM Tris, 200 mM NaCl buffer (pH 8.0) using a centrifugal concentrator, and the spectrum of the resulting protein sample was recorded. To prepare unliganded protein for iron-binding assays, the protein was exchanged into 50 mM Tris, 200 mM NaCl, 20 mM sodium citrate buffer (pH 8.0) by three rounds of 30-fold dilution and concentration, allowing chelation and removal of the metal ligand. Citrate was then removed by four rounds of 30-fold dilution and concentration with 50 mM Tris, 200 mM NaCl buffer (pH 8.0). Binding assays were performed by titrating the unliganded protein (200 µl of 100 µM solution) with 8× or 10× 5-µl injections of 800 µM iron(iii) solution, which was prepared from iron(iii) chloride and a 2.5-fold molar excess of trisodium citrate (which ensures that the iron(iii) remains soluble) in ultrapure water. To confirm that SAR11_1238 binds iron(iii) rather than the iron(iii)–citrate complex, the protein was also titrated under the same conditions with 800 µM ammonium iron(II) sulfate; under the aerobic conditions of the assay, iron(ii) is rapidly oxidized to iron(iii)69. UV–vis spectra were recorded 1 min (iron(ii)) or 15 min (iron(iii)) after each injection. Finally, a competitive binding assay with citrate was used to estimate the affinity of SAR11_1238 for iron(iii). The protein was saturated with a twofold molar excess of iron(iii) solution, diluted to a volume of 1 ml, and then dialysed against 500 ml of 50 mM Tris, 200 mM NaCl buffer (pH 8.0) at 4 °C overnight to remove excess iron(iii) and citrate. The protein was then concentrated to 100 µM and titrated with 5-µl injections of 8 twofold serial dilutions of 500 mM sodium citrate (adjusted to pH 8.0 in 50 mM Tris, 200 mM NaCl buffer). The absorbance at 440 nm was recorded 5 min after each addition. The data were fitted to a hyperbolic curve, yielding an apparent Kd of 9.0 mM for citrate. Given that citrate has a Kd of ~10−17 M for iron(iii), this implies that SAR11_1238 has a Kd for iron(iii) on the order of ~10−19 M, similar to previously characterized iron(iii)-binding proteins70,71.

X-ray crystallography

For the SAR11_0769/d-glucose and SAR11_1210/l-arginine structures, the proteins were first expressed and purified by nickel affinity chromatography under native conditions as described above. After addition of a 20-fold molar excess of d-glucose (SAR11_0769) or l-arginine (SAR11_1210), the protein was purified further by size-exclusion chromatography on a HiLoad 26/600 Superdex 75 pg column (Cytiva), eluting in 3× crystallization buffer (60 mM HEPES, 150 mM NaCl, pH 7.5). Fractions containing the target protein were collected, and d-glucose (SAR11_0769) or l-arginine (SAR11_1210) was added to a concentration of 30 µM. The protein was concentrated to a volume of ~500 µl, diluted threefold in water to reduce the NaCl concentration to 50 mM, and then concentrated further to 12 mg ml−1. For the SAR11_0769/d-galactose and SAR11_0655/l-pyroglutamate structures, the proteins were expressed and purified in the same way, except that no ligands were added. Protein crystals were obtained using the vapour diffusion method in hanging drops at 20 °C, then cryoprotected and flash-frozen in liquid nitrogen. Crystallization and cryoprotection conditions for each protein are given in Supplementary Methods. X-ray diffraction data were collected on beamline BL32XU at the SPring-8 synchrotron (Harima, Japan), using the ZOO suite for automated data collection72. The data were automatically indexed, integrated, scaled and merged in XDS73 using KAMO74. The structure was solved by molecular replacement in Phaser75 or MOLREP76. For SAR11_1210, the structure of an opine-binding protein from Agrobacterium fabrum (PDB ID 5OT8) was used as a search model; in the remaining cases, an AlphaFold2 model was used77. The structures were then refined by iterative real-space and reciprocal-space refinement in REFMAC78, Phenix79, and COOT80. Data collection and refinement statistics are given in Supplementary Table 10 and Supplementary Table 11. Structures were visualized in Pymol.

Gas chromatography–mass spectrometry

SBPs purified under native conditions were exchanged into 200 mM ammonium acetate using a PD-10 desalting column (Cytiva) and concentrated to ~1 mM. A 10-nmol aliquot of protein was mixed with 10 µl of 300 µM α-methylglucopyranoside (as an internal control) and 200 µl methanol. The mixture was agitated at 1500 rpm at 24 °C for 10 min and then centrifuged at 21,000g for 20 min at 4 °C. The supernatant was evaporated to dryness using a vacuum evaporator, redissolved in 20 µl anhydrous pyridine, and derivatized by addition of 30 µl N-methyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA) containing 1% trimethylchlorosilane (Supelco) followed by incubation at 70 °C for 1 h. In the case of SAR11_1361, the dried sample was instead dissolved in 20 µl of 20 mg ml−1 methoxyamine hydrochloride in anhydrous pyridine and incubated at 37 °C for 90 min with agitation at 750 rpm before addition of the MSTFA mixture. The derivatized samples were injected immediately onto an Agilent 7890 A GC System (Agilent Technologies) equipped with a PAL COMBI-XT autosampler (CTC Analytics) and connected to a PEGASUS 4D GC×GC TOF-MS instrument (LECO) operating in one-dimensional mode. The GC was fitted with a DB-1MS column (Agilent Technologies) with 30 m length, 0.25 mm internal diameter, and 0.25 µm film thickness. The instrument was operated in pulsed split mode with a split ratio of 2 and injection volume of 1 µl. The inlet temperature was 250 °C. Helium was used as the carrier gas with a flow rate of 1 ml min−1. The GC oven temperature was held at 70 °C for 5 min, then raised at 12 °C min−1 to 300 °C, and finally held at 300 °C for 10 min. Mass spectrometry data were collected from 50 to 500 m/z after a 6.5-min solvent delay. The ion source and transfer line temperatures were 250 °C and the ionization energy was 70 eV. Data analysis and spectral database searches against the NIST database were performed using ChromaTOF software (LECO). Protein-derived samples were analysed before control samples to prevent carryover.

Biogeographical analysis

Biogeographical analysis was performed using the Ocean Gene Atlas v2.0 server33. Abundance data for each SBP gene from Ca. P. ubique HTCC1062 in the Tara Oceans OM-RGC_v2_metaG and OM-RGC_v2_metaT datasets was obtained through a BLAST search with a stringent e-value threshold of 10−30. To avoid inclusion of homologous SBPs with different transport functions, hits with a sequence identity of less than 40% (for ABC SBPs) or 55% (for TRAP and TTT SBPs) compared with the corresponding HTCC1062 SBP were excluded from the analysis.

To estimate the total abundance of SBP transcripts, abundance data for each of the 38 PFAM families in CL0177 (PBP; periplasmic binding protein) and CL0144 (Periplas_BP; periplasmic binding protein like), excluding the transferrin family (PF00405) and any families that contain solely enzymes or transcription factors (PF00800, PF01379, PF01634, PF02621, PF03466, PF09084), were obtained using a hmmer search of the OM-RGC_v2_metaT dataset with an e-value threshold of 10−10. Hits were obtained for 26 out of 31 PFAM families. For each PFAM family, the corresponding hidden Markov model (HMM) was obtained from the InterPro database81. The protein sequences from the hmmer search were then aligned to this HMM using hmmalign and used to construct a new HMM using hmmbuild in HMMER3.4 (http://hmmer.org). A second hmmer search of the OM-RGC_v2_metaT dataset, with a lower e-value threshold of 10−5, was then conducted using the resulting HMM. The hits from all 52 searches were combined and redundant hits were removed, resulting in a total of 211,222 unique SBP genes. The two-step search recovered 94% of the 23,879 genes identified as homologues of the Ca. P. ubique HTCC1062 SBPs in the BLAST analysis before application of a sequence identity threshold; the remaining 1267 genes were also added to the list of SBP genes. Finally, the total abundance of SBP genes at each site was calculated.

To estimate the percentage of SAR11 bacteria at a site containing a given SBP from Ca. P. ubique HTCC1062, we used the recruitment values of 159 SAR11 genomes in the Tara Ocean metagenome dataset calculated by Haro-Moreno et al.34. The presence of a homologue of each SBP in each of the corresponding genomes was determined by BLAST using a 50% sequence identity and 50% coverage threshold. The relative abundance of SAR11 bacteria containing a given SBP homologue was then calculated for each station. Plots were generated using R and GraphPad Prism.

Phylogenetic analysis

Protein sequences homologous to the SBP of interest were identified via a BLAST search of the UniProtKB Reference Proteomes and Swiss-Prot databases82. The resulting sequences were filtered to remove a small number of unusually long sequences (>20% greater than mean length) and aligned in MUSCLE v3.8.3183. The alignment was trimmed in trimAl v1.2 using the automated1 option84 and then used to generate a maximum-likelihood phylogeny in FastTree v2.1.11, using LG + Γ20 as the substitution model85. For each protein sequence in the tree, the fraction of conserved binding site residues, compared with the corresponding protein from Ca. P. ubique HTCC1062, was estimated. The binding site residues were obtained from the crystal structure (SAR11_0769) or estimated from an AlphaFold2 model86,87. For this analysis, the following substitutions were treated as conservative: S/T, I/M, V/L, I/V, L/M, D/E, Q/N, A/V, F/Y, Y/W, F/W. Phylogenetic tree figures were generated using the ggtree package in R88. Figures showing taxonomic distribution (Extended Data Fig. 8b) were generated using Krona89.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.