Siderophores as an iron source for picocyanobacteria in deep chlorophyll maximum layers of the oligotrophic ocean

Prochlorococcus and Synechococcus are the most abundant photosynthesizing organisms in the oceans. Gene content variation among picocyanobacterial populations in separate ocean basins often mirrors the selective pressures imposed by the region’s distinct biogeochemistry. By pairing genomic datasets with trace metal concentrations from across the global ocean, we show that the genomic capacity for siderophore-mediated iron uptake is widespread in Synechococcus and low-light adapted Prochlorococcus populations from deep chlorophyll maximum layers of iron-depleted regions of the oligotrophic Pacific and S. Atlantic oceans: Prochlorococcus siderophore consumers were absent in the N. Atlantic ocean (higher new iron flux) but constituted up to half of all Prochlorococcus genomes from metagenomes in the N. Pacific (lower new iron flux). Picocyanobacterial siderophore consumers, like many other bacteria with this trait, also lack siderophore biosynthesis genes indicating that they scavenge exogenous siderophores from seawater. Statistical modeling suggests that the capacity for siderophore uptake is endemic to remote ocean regions where atmospheric iron fluxes are the smallest, especially at deep chlorophyll maximum and primary nitrite maximum layers. We argue that abundant siderophore consumers at these two common oceanographic features could be a symptom of wider community iron stress, consistent with prior hypotheses. Our results provide a clear example of iron as a selective force driving the evolution of marine picocyanobacteria.


Data sources
The collection of genomes searched in this study (MARMICRODB, https://zenodo.org/record/3520509) was previously introduced as a reference database for quantifying Prochlorococcus and SAR11 abundance in marine metagenomes [1]. These genomes were selected to provide a comprehensive representation of the marine environment and a nonredundant and simplified characterization of terrestrial and host-associated systems. MARMICRODB includes 18758 bacterial and archaeal single cell, metagenome-assembled, and isolate genomes and eukaryotic transcriptomes from a variety of sources [2][3][4][5][6][7][8][9][10][11] as well as over 7000 viral genomes from NCBI RefSeq [12].
Prochlorococcus clade cell concentrations from the Hawai'i Ocean Time-series Station ALOHA and the Bermuda-Atlantic Time-series Study BATS site were accessed from the Biological and Chemical Oceanography Data Management Office (https://www.bco-dmo.org/dataset/3381). Clades were measured by quantitative PCR as described in a prior study [15].
Trace metal and other chemical concentrations were obtained from the GEOTRACES Intermediate Data Product IDP2017 version 2 (accessed January 2019) [16]. Samples were included from sections GA02 [17,18], GA03, GA10 [19], and GP13, which all had paired metagenomic data. Biogeochemical data from the Tara Oceans project was downloaded from https://doi.pangaea.de/10.1594/PANGAEA.875579. Climatological mean dissolved iron from the MIT Darwin model (v0.1 llc90, http://darwinproject.mit.edu/) was obtained using pycmap v0.1.2 [20]. This model version is modified from earlier versions [21,22] and includes biogeochemical cycling of nutrient elements and a planktonic marine ecosystem model incorporating functional and size diversity with multiple links between trophic levels. It is driven by the physical ocean circulation model from ECCO v4 [1]. The iron cycle parameterization in the MIT Darwin model includes a dominant iron source through aeolian dust deposition and a dominant sink due to particle scavenging and dissolution/complexation with organic ligands. The model includes two chemical species -Fe (inorganic free ferric iron) and FeL (organically complexed Fe). The total dissolved iron concentration consists of the sum of these two species.

Preprocessing biogeochemical and hydrographic measurements
Raw fluorescence or derived chlorophyll a concentrations from hydrographic CTD casts on the GEOTRACES cruises were obtained from the IDP17 [16]. At each sampling station, the fluorescence signal from the CTD profile was smoothed using a 10-meter rolling mean, then scaled at each depth by the profile standard deviation, and finally divided by the maximum value to obtain a normalized and scaled fluorescence profile from each sampling location. Normalization was necessary because fluorometer data from across the datasets was either in uncalibrated factory units, derived chlorophyll a concentrations, or raw signal and individual instrument calibration coefficients were unavailable. This normalization procedure discards the fluorescence signal magnitude, but it allows for a between-sample comparison of fluorescence profile shape and DCM depth. The DCM depth was defined as depths with fluorescence values within 90% of the peak subsurface fluorescence signal and contiguous to the maximum fluorescence depth. The full DCM depth range was defined as 50% of maximal subsurface fluorescence.
When metagenome bottle samples did not align precisely with the depth/coordinates of the hydrographic casts or trace metal casts, the nearest CTD cast (within a 100 km radius) was matched to discrete bottle data from within a 15-meter depth range. Coordinates were matched using the shortest distance between two geospatial points according to the Haversine formula. Despite this relaxed, fuzzy-matching approach, some combinations of variables were missing for some observations in the dataset. Generally, features like salinity, temperature, and macronutrient concentrations were present for most metagenome observations. However, in fewer than 5% of all samples, we needed to impute at least one hydrographic parameter or macronutrient variable. We imputed missing salinity, temperature, macronutrient, and DCM depth observations by the missForest algorithm [23] using chaining random forest prediction and imputation from all other available variables except siderophore receptor abundance. This process was implemented with the ranger package [24] in missRanger v2.1.3 (https://github.com/mayer79/missRanger). The GEOTRACES trace metal data was not as complete as the macronutrient and hydrographic data. For example, measured dissolved iron or copper concentrations from GEOTRACES IDP17 were only available for roughly half of the total metagenome observations. Therefore, we also used random forest imputation for trace metal measurements, but we only imputed trace metal categories with no less than 50% missing measurements. Missing trace metal observations were imputed separately for each ocean basin (N. Atlantic, S. Atlantic, N. Pacific, S Pacific, Mediterranean and Red Seas, and Indian Ocean). The imputation workflow is documented at https://github.com/slhogle/cyano-sidero-ocean.

Estimation of siderophore transport cluster prevalence in genomes
Many of the analyzed Prochlorococcus and Synechococcus genomes were incomplete because they were sequenced from material isolated from single cells. Therefore, genome completeness was taken into account when estimating the proportion of genomes with the siderophore transport cluster. It was assumed that there was no bias in the content of genome recovery in the single-cell genome sequencing and assembly processes, which is reasonable for the latest single-cell technologies [25]. The number of missing base pairs in each incomplete assembly was estimated with checkM using a custom set of 730 single-copy, core genes from Prochlorococcus and Synechococcus genomes (https://doi.org/10.5281/zenodo.3719132). The corrected prevalence of the TonB dependent receptor (TBDR) gene ( ) was estimated as: where for each taxonomic group/clade, is the length of the TBDR gene in base pairs, is the length of the genome assembly in base pairs, is the completeness estimate,¯ is the average length of all TBDR genes from each taxonomic group/clade, and is the total number of assemblies from each taxonomic group/clade. The TBDR gene always co-occurred with other transport components (ABC ATPase, ABC permease, ABC solute binding protein, TonB, ExbB, ExbD) as an operon-like structure within the picocyanobacterial genomes. Hence, it was assumed that the corrected prevalence of this individual gene represents that of the entire siderophore transport cluster.

Quality control of metagenome sequencing reads
Raw metagenomic reads were preprocessed using the bbtools software suite (https://jgi.doe.gov/data-and-tools/bbtools/) as described in detail earlier [1]. Briefly, reads were trimmed of sequencing adapters using kmer matches to Illumina adapter sequences from the 3' end of the read. A maximum Hamming distance of one was required for matches to reference kmers. Reads with three or more 'Ns' or with an average quality score of less than Q20 were discarded. Overlapping reads were trimmed based on insert size (tbo=t) if adapter kmers were not identified and then both 4/16 reads trimmed to the minimum length of the read pair. Reads shorter than 60 bp after all adapter trimming steps were discarded. Reads were also discarded if they contained common Illumina sequencing artifacts (based on kmer matches, k=31). Reads were overlapped using the bbmerge tool [26] using default parameters with a minimum insert size of 35 bp and a minimum overlapping sequence of 12 bp. All unmerged and orphaned reads that passed quality control were also retained for downstream annotation and analysis.

Metagenome read classification
Metagenome reads were mapped to MARMICRODB using Kaiju v1.6.0 [27] as described before [1]. Briefly, Kaiju was run in greedy mode, tolerating a maximum of five amino acid mismatches per match, a minimum match length of 11, a minimum BLOSUM62 score of 65, and excluded alignments with a Kaiju "expect value" (analogous to BLAST e value) less than 0.05. Query sequences with low-complexity regions were filutered using the SEG filtering algorithm from the BLAST+ package [28]. The majority of reads (54% ± 10%) were classified to at least one reference database sequence.

Processing and analysis of metagenomic count data
For each sample, reads with the best match to any Prochlorococcus or Synechococcus TonB dependent receptor (TBDR) gene were summed then length-normalized (reads per kilobase; RPK) to the median length of all Prochlorococcus or Synechococcus TBDR genes. A collection of 730 single-copy core genes (COGs) (https://doi.org/10.5281/zenodo.3719132) was used to estimate the abundance of Prochlorococcus and Synechococcus genomes in each sample. For each sample, reads mapping to each core COG were length-normalized to the median length of the COG. Samples with median Prochlorococcus or Synechococcus core COG coverage was less than 100 RPK were excluded. Outlier core COGs were excluded if they were not present in all samples and were outside the 10th and 90th RPK percentiles of all COGs across all samples. The outlier removal process resulted in 277 Prochlorococcus and 186 Synechococcus core COGs that were robust and stable across all samples (i.e., had no significant differences between ocean basins or depth ranges). The abundance of Prochlorococcus or Synechococcus genomes in each sample was taken as the median of the 277 or 186 length-normalized core COG abundances. The frequency of siderophore consumers was calculated as the ratio of length-normalized TBDR abundance to the abundance of Prochlorococcus or Synechococcus genomes (i.e., median COG abundances). Assuming that TBDRs are single-copy per genome, this value should approximate the proportion of Prochlorococcus or Synechococcus genomes in a metagenomic sample with the potential to consume siderophores.

Random Forest training and testing
A random forest regression was trained to predict the frequency of Prochlorococcus siderophore consumers using 45 different environmental features as input. These environmental features included trace metal data from GEOTRACES, macronutrient data from GEOTRACES and Tara Oceans, modeled climatological dFe means from the MIT Darwin model (http://darwinproject.mit.edu/), frequencies of different Prochlorococcus ecotypes, and water column chlorophyll a profiles. Samples were restricted to the upper 300 meters of the water column and where the median RPK of marker genes was higher than 100. Random forest hyperparameters were tuned using a grid of potential parameter combinations. Model training was performed using nested ten-fold cross-validation, reserving 20% of the data for estimating final model performance. Specifically, the number of variables to split at each tree node (mtry), the minimum number of observations to partition at each node (min.node.size), and the 5/16 forest size (ntrees) were tuned. Hyperparameters were selected by maximizing the coefficient of determination from correlation (model 2 ) and minimizing the root mean square error (RMSE). Performance metrics were averaged over the ten cross-validation splits of the training data.
Random forest regression internally ranks predictor variables by their relative importance for completing a given prediction task, but it can not by itself assess whether the influence of a variable is greater than expected by chance. The latter task is necessary, for example, if one is interested in understanding the mechanisms governing a system rather than the simple construction of a "black-box" predictive tool.
The Boruta heuristic was used to assess variable importance with 100 random permutations for feature selection and importance ranking [29]. Briefly, Boruta is designed to identify all dataset features more relevant to a random forest regression task than by chance alone. In practice, Boruta will identify variables that may be redundant/correlated or that weakly interact with other features to produce a correlation with the outcome variable greater than expected by chance. During one algorithm iteration, Boruta doubles each feature in a dataset, randomly shuffles the doubled variables' observed values, and constructs a random forest model with the original and randomized features. Then by using the precomputed importance rankings from the model construction step, Bortua compares the importance of the 'true' features to the maximum importance of the randomly shuffled 'shadow' features using a two-sided test of equality (p value cutoff of 0.01). This process is then repeated 100 times or until all features have a consistent designation of importance. By iterating this process, the algorithm identifies dataset features that consistently underperform or overperform relative to the randomized input variables.

Seasonal effects from Generalized Additive Models
Seasonal effects in time-series metagenomes were estimated using Generalized Additive Mixed Models and Linear Mixed-Effect Models implemented through the mgcv v1.8-26 [30] nlme v3.1-148 libraries in R v3.6.2. To decompose seasonal effects a cyclic spline term was fit to a numerical variable for "day of the year", which we assume is representative of the season. A global trend term was also fit to the cumulative time since sampling onset. Residuals are modeled as a time series process with a continuous AR(1) correlation structure to account for lack of residual independence. Seasonal effects were considered significant if the "day of the year" smooth term was statistically significant (p < 0.05). Fig.  2B shows the estimated smooth term for "day of the year" with 95% confidence intervals shaded.

Dataset and code availability
The data and code for generating all figures and tables and supporting the text's conclusions are available from https://github.com/slhogle/cyano-sidero-ocean.
The MARMICRODB dataset, including a comprehensive description, raw protein sequence files, Kaiju formatted databases, scripts, and instructions for using the resource, is available from https://doi.org/10.5281/zenodo.3520509.

6/16
Supplementary Figures: Supplementary Figure S1. Gene diagrams of siderophore transport clusters in marine picocyanobacteria Gram-negative bacteria import Fe-bound siderophores in a multistep process involving three protein modules: 1) an outer membrane, TonB-dependent receptor (TBDR) that recognizes and binds Fe-loaded siderophores, 2) an energy transducing complex (TonB/ExbB/ExbD) that translocates siderophores through the TBDR via proton motive force, and 3) an ATP-binding cassette (ABC) complex consisting of an inner membrane-bound permease/ATPase complex and a periplasmic solute binding protein [31]. Bacteria synthesize siderophores from secondary metabolite gene clusters encoding nonribosomal peptide synthetases and polyketide synthases [32]. The genes encoding both siderophore biosynthesis and uptake are typically clustered on the chromosome in an operon-like structure and are tightly regulated. Cartoon gene diagrams show the absolute position of siderophore transport components. Diagrams are partitioned by clade (colored/shaded boxes). The last number following each genome is the contig number of the gene cluster. ABC ATPase, ATP-binding cassette transporter ATPase component (COG1120); ABC Perm., ATP-binding cassette transporter membrane permease component (COG0609); ABC SBP, ATP-binding cassette transporter periplasmic solute binding component (COG0614); TBDR, outer membrane TonB-dependent receptor protein (COG4771); TonB, energy transducing protein TonB (COG0810); ExbB/ExbD, biopolymer transport proteins (COG0811).

7/16
Supplementary Figure S2. Comparing the sequence similarity of picocyanobacterial TonB dependent receptor (TBDR) genes A) t-Distributed Stochastic Neighbor Embedding (tSNE) dimensionality reduction of pairwise sequence similarity score between 625 TonB dependent receptor (TBDR) amino acid sequences with a pairwise blastp e value < 1e-10 to any marine picocyanobacterial TBDR. Sequences that share greater amino acid similarity cluster more closely together. The colored subset represents all sequences with an e value < 1e-40 to a marine picocyanobacterial TBDR, and they are colored by their taxonomic origin. B) Focused view of the sequence similarity scores of the TBDR from LLI Prochlorococcus strain MIT0917 and the most closely related TBDRs. Each point is a sequence comparison to the MIT0917 sequence, and the vertical axis shows the blastp bitscore of the comparison. The comparisons are bitscore rank-ordered (x-axis), showing only those from the top 25%. C) Unrooted maximum likelihood phylogenetic tree of the 625 TBDRs from A) and B). Branches are colored by bootstrap score out of 300 bootstraps. The inner ring highlights the same highlighted sequences from A) and B), and the outer ring categorizes sequences by picocyanobacterial clade.

8/16
Supplementary Figure S3. Biogeography of Synechococcus siderophore consumers Sampling locations of Synechococcus isolate and single-cell genomes used for this study. Point size is proportional to the total number of sampled genomes from each location, and point color is proportional to the percentage of genomes with siderophore uptake systems at that location. Oceanographic stations ALOHA and BATS are denoted for reference.
Supplementary Figure S4. A) Dissolved Fe (dFe) profiles from two prior studies at BATS [33] and ALOHA [34]. The DCM region is highlighted in grey. B) Change in dFe concentration with depth for the two locations. is estimated by taking the derivative of a thin plate smooth fit to dFe as a function of depth. Derivatives are estimated from Generalized Additive Models via finite differences. The magnitude of the rate of change is highest at the BATS DCM, which implies a larger relative flux of dFe from below the DCM than at ALOHA.

11/16
Supplementary Figure S6. Other predictive variables of Prochlorococcus siderophore consumer abundance Normalized abundance of Prochlorococcus siderophore receptors plotted against the remaining environmental variables with the greatest predictive contribution to the random forest model (Fig. 4). Each point is a metagenomic observation colored by ocean basin. Small squares represent samples with missing data that was imputed for each variable (see methods). Dissolved Fe climatological means are from the MIT Darwin model. The remaining measurements are in situ chemical, biological, or hydrographic measurements from GEOTRACES and Tara Oceans.