Differential global distribution of marine picocyanobacteria gene clusters reveals distinct niche-related adaptive strategies

The ever-increasing number of available microbial genomes and metagenomes provides new opportunities to investigate the links between niche partitioning and genome evolution in the ocean, especially for the abundant and ubiquitous marine picocyanobacteria Prochlorococcus and Synechococcus. Here, by combining metagenome analyses of the Tara Oceans dataset with comparative genomics, including phyletic patterns and genomic context of individual genes from 256 reference genomes, we show that picocyanobacterial communities thriving in different niches possess distinct gene repertoires. We also identify clusters of adjacent genes that display specific distribution patterns in the field (eCAGs) and are thus potentially involved in the same metabolic pathway and may have a key role in niche adaptation. Several eCAGs are likely involved in the uptake or incorporation of complex organic forms of nutrients, such as guanidine, cyanate, cyanide, pyrimidine, or phosphonates, which might be either directly used by cells, for example for the biosynthesis of proteins or DNA, or degraded to inorganic nitrogen and/or phosphorus forms. We also highlight the enrichment of eCAGs involved in polysaccharide capsule biosynthesis in Synechococcus populations thriving in both nitrogen- and phosphorus-depleted areas vs. low-iron (Fe) regions, suggesting that the complexes they encode may be too energy-consuming for picocyanobacteria thriving in the latter areas. In contrast, Prochlorococcus populations thriving in Fe-depleted areas specifically possess an alternative respiratory terminal oxidase, potentially involved in the reduction of Fe(III) to Fe(II). Altogether, this study provides insights into how phytoplankton communities populate oceanic ecosystems, which is relevant to understanding their capacity to respond to ongoing climate change.

adjacency matrix was calculated in 'signed' mode (i.e., considering correlated and anti-correlated CLOGs separately), by using the Pearson correlation between pairs of CLOGs (based on their relative abundance in every sample) and raising it to the power 12, which allowed to obtain a scalefree topology of the network. Modules were identified by setting the minimum number of genes in each module to 100 and 50 for Synechococcus and Prochlorococcus, respectively, and by forcing every gene to be included in a module. The eigengene of each module, i.e. the first principal component of gene abundances at the different stations for this module (representative of the relative abundance of genes of a given module at each Tara Oceans station) was then correlated to environmental parameters and to the relative abundance of ESTUs. Finally, genes in each module with the highest correlation to the eigengene (a measurement called 'membership'), were extracted in order to identify the most representative genes of each module.

Identification of differentially distributed clusters of adjacent genes (eCAGs)
Results on individual niche-related genes identified by WGCNA were then integrated with knowledge on gene synteny in reference genomes (Datasets 7 and 8). For each WGCNA module, we defined eCAGs by searching adjacent genes of the module in the 256 reference genomes. In order to be considered as belonging to the same eCAG, two genes of the same module must be less than 6 genes apart in 80% of the genomes in which these two genes are present. The 80% cut-off allowed us to take into account the incompleteness of some reference genomes, and notably MAGs and SAGs. This method led us to identify clusters of adjacent genes in reference genomes, comprising genes displaying a similar distribution pattern, called eCAGs. It is worth noting that the association of an eCAG with a specific niche is totally independent from gene synteny. Indeed, this association is based on the fact that all genes in an eCAG necessarily come from the same WGCNA module, which is itself associated with a niche. Thus, the potential absence of whole syntenic genome regions in MAGs due to assembly biases cannot lead to false associations of an eCAG with a particular environment. Furthermore, none of the eCAGs mentioned in the text (Dataset 6) were exclusively present in MAGs, excluding the risk of false positives due to MAG assembly biases.
A network of eCAGs was then built for each WGCNA module, considering the number of genomes in which these genes are adjacent (Figs. 4, S3 and S4). A graph representation of eCAGs was displayed using the graph embedder (GEM) algorithm [8] or the Fruchterman-Reingold algorithm [9] implemented in the R-package igraph [10]. These are force-directed algorithms, meaning that node layout is determined by the forces pulling nodes together and pushing them apart. In other words, its purpose is to position the nodes of a graph so that the edges of more or less equal length are gathered together and to avoid as many crossing edges as possible. The first algorithm was used to draw an unweighted and undirected global graph of all eCAGS (Fig. 4). The second algorithm was used to draw, for each WGCNA module separately, unweighted and undirected graphs of eCAGS where link thickness corresponds to the number of genomes in which the eCAG members are less than six genes apart (Figs. S3, S4).

Prochlorococcus modules
In order to better interpret the global distribution of picocyanobacterial gene content, gene modules obtained by WGCNA were correlated to the available environmental parameters (Figs. 2A-B, S1A-B) and the relative abundance of Prochlorococcus or Synechococcus ESTUs at each station ( Fig.  2C-D, S1A-B). The brown module, corresponding to genes preferentially found in Fe-limited HNLC areas and strongly associated with the presence of HLIIIA, HLIVA and LLIB, is described in the main text. The blue module was found to be associated with warm, low-chlorophyll oligotrophic regions with low N and P concentrations and high Fe availability ( Fig. 2A), where ESTUs HLIIA dominate the Prochlorococcus community and HLIIB-D were also present at lower abundance (Fig.  2C, Fig. S1A). The turquoise module seems to correspond to genes present in cold, chlorophyllrich waters, colonized by LLIA ESTUs, and to a lower extent to LLIC and LLID, but anti-correlated with HLII ESTUs. The turquoise module gathers station TARA-070, where LLIA dominates (Fig.  1A), as well as stations dominated either by HLIA or the coldest stations dominated by HLIIA ESTUs (TARA-0146 and 149), the common point between all these stations being a strong relative abundance of LLIA (Figs. 1 and S1A). Finally, the red module seems to be characteristic of cold, Fe-rich, N-and P-depleted waters, and strongly correlated to HLIA and anti-correlated to HLII-IV and LLIB ESTUs ( Fig. 2A-C), corresponding to assemblages mainly found at the highest latitude stations of the Tara Oceans transect (TARA_066, 068, 093, 094, 133, 150, 151, 152) as well as all stations in the Mediterranean Sea (Fig. S1A).

Synechococcus modules
The same analysis performed on Synechococcus genes shows that the yellow module is correlated with phosphate and ammonium concentrations and strongly anti-correlated to Fe availability, and thus corresponds to genes found in HNLC areas. Accordingly, this module is correlated to ESTUs CRD1A, CRD1C, EnvAA and EnvBA, previously reported to dwell in Fe-depleted areas ( [2,8]; Figs. 2B-D). Although the midnight blue module is only positively correlated with oxygen concentration, it is most strongly associated with ESTU IA and IVA-C, known to colonize cold, coastal, or mixed open ocean waters at high latitude [2] and anti-correlated with ESTUs IIA and IIIA/B (Fig. 2C-D). In terms of distribution, genes of this module are only found in two upwelling stations (TARA_093, 133), as well as in a cold station sampled in winter at the northernmost Atlantic station of the Tara Ocean transect, TARA_152 (Figs. 1B, S1B in this study and Fig.4 in [2]). The tan module was found in cold, chlorophyll-rich waters with a high relative abundance of ESTUs IA and IVA-C ( Fig. 2B-D) and was also detected in the most strongly mixed waters of the Tara Oceans dataset, notably the upwelling stations (TARA_067, 093), at TARA_145, a cold station sampled in winter, North of the Gulf stream as well as in northern Atlantic stations of the Tara Ocean transect (TARA_150, 151, 152, Fig. S1B). The purple module is found in waters with high salinity, iron-rich, P-depleted waters, and associated with IIIA/B, WPC1A and all SC 5.3 ESTUs, known to co-occur in low-P areas of the world ocean (Fig. 2B-D). Consistently, it was specifically found in the Mediterranean Sea and the only station of the Gulf of Mexico (TARA_142, Fig. S1B). Finally, the salmon module was associated with warm, Fe-rich waters. This module was most strongly associated with ESTU IIA and to a lesser extent with the fairly rare ESTUs VIIA and 5.3B and its eigengene accordingly has higher values at stations dominated by ESTU IIA (Fig. S1).    Black dots represent Tara Oceans stations for which Prochlorococcus or Synechococcus read abundance was too low to reach the threshold limit.  Syn-eCAG_010 Syn-eCAG_011 The size of the circle is proportional to the relative abundance of Synechococcus as estimated based on the single-copy core gene petB and this gene was also used to estimate the relative abundance of other genes in the population. Black dots represent Tara Oceans stations for which Synechococcus read abundance was too low to reach the threshold limit. The size of the circle is proportional to the relative abundance of Synechococcus as estimated based on the single-copy core gene petB and this gene was also used to estimate the relative abundance of other genes in the Synechococcus population. Black dots represent Tara Oceans stations for which Synechococcus read abundance was too low to reach the threshold limit.
A B