Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in the surface ocean

Nitrogen fixation in the surface ocean impacts the global climate by regulating the microbial primary productivity and the sequestration of carbon through the biological pump. Cyanobacterial populations have long been thought to represent the main suppliers of the bio-available nitrogen in this habitat. However, recent molecular surveys of nitrogenase reductase gene revealed the existence of rare non-cyanobacterial populations that can also fix nitrogen. Here, we characterize for the first time the genomic content of some of these heterotrophic bacterial diazotrophs (HBDs) inhabiting the open surface ocean waters. They represent new lineages within Planctomycetes and Proteobacteria, a phylum never linked to nitrogen fixation prior to this study. HBDs were surprisingly abundant in the Pacific Ocean and the Atlantic Ocean northwest, conflicting with decades of PCR surveys. The abundance and widespread occurrence of non-cyanobacterial diazotrophs in the surface ocean emphasizes the need to re-evaluate their role in the nitrogen cycle and primary productivity.


Introduction
Marine microbial communities play a critical role in biogeochemical fluxes and climate [1,2,3], but their activity in the euphotic zone of low latitude oceans is often limited by nitrogen availability [4,5]. Thus, biological fixation of gaseous dinitrogen in the surface ocean is a globally important process that contributes to the ocean's productivity and enhances the sequestration of carbon through the biological pump [6,7]. Microbial populations that can fix nitrogen are termed diazotrophs, and although they encompass a wide range of archaeal and bacterial lineages [8,9], diazotrophs within the bacterial phylum Cyanobacteria in particular are considered to be responsible for a substantial portion of nitrogen input in the surface ocean [10,11,12]. Studies employing cultivation and flow cytometry [13,14,15,16,17] characterized multiple cyanobacterial diazotrophs and shed light on their functional life-styles [18,19,20]. PCR surveys of the nitrogenase reductase gene nifH have indicated that the ability to fix nitrogen is also found in lineages of the phyla Proteobacteria, Firmicutes, and Spirochetes [9,21,22], suggesting the presence of heterotrophic bacterial diazotrophs (HBDs) that contribute to the introduction of nitrogen in the surface ocean. Quantitative surveys of non-cyanobacterial nifH genes indicated that HBDs are diverse but relatively rare in open ocean [23,24,25,26,27,28], and efforts to access their genomic contents through cultivation have not yet been successful, limiting our understanding of their ecophysiology. Figure 1 Geographically bounded metagenomic co-assemblies. Dots in the map correspond to the geographic location of 93 metagenomes from the TARA Oceans project. Each dot is associated with a metagenomic set (corresponding to a geographic region) for which we performed a metagenomic co-assembly (n=12). The number of metagenome-assembled genomes (MAGs) we characterized from each metagenomic set varied from 13 to 141 after the removal of redundant MAGs, for a total of 957 non-redundant MAGs encompassing the three domains of life.
Here, we used metagenomic co-assembly and binning strategies to create a non-redundant database of archaeal, bacterial and eukaryotic genomes from the TARA Oceans project [29]. We characterized nearly one thousand microbial genomes from the surface samples of four oceans and two seas, which revealed novel nitrogen-fixing Planctomycetes and Proteobacteria populations. In contrast to previous findings of PCR-dependent surveys, our analyses indicate that putative HBDs are not only diverse but also abundant in the open surface ocean habitat.

Results
The 93 TARA Oceans metagenomes we analyzed represent the planktonic size fraction (0.2-3 micrometer) of 61 surface samples and 32 samples from the deep chlorophyll maximum layer of the water column (Table S1). 30.9 billion metagenomic reads (out of 33.7 billion) passed the quality control criteria. We performed 12 metagenomic co-assemblies (1.14-5.33 billion reads per set) using geographically bounded samples (Figure 1), and identified 42.2 million genes in scaffolds longer than 1,000 nucleotides (Table  S2 summarizes the assembly statistics). We applied a combination of automatic and manual binning to each co-assembly output, which resulted in 957 manually curated, non-redundant metagenome-assembled genomes (MAGs) (Figure 1).
Recovery of these MAGs complements decades of cultivation efforts by providing genomic context for lineages missing in culture collections (e.g., Euryarchaeota, Candidatus Tarascapha, and the Candidate Phyla Radiation), and allowed us to search for diazotrophs within a large pool of mostly novel marine microbial populations.
Genomic stability of a well-studied nitrogen-fixing symbiotic population at large scale Our genomic collection included six cyanobacterial MAGs, one of which (ASW 00003), contained genes that encode the catalytic (nifH, nifD, nifK ) and the biosynthetic proteins (nifE, nifN and nifB ) required for nitrogen fixation [8]. ASW 00003, which we recovered from the Atlantic southwest metagenomic co-assembly, showed remarkable similarity to the genome of the symbiotic cyanobacterium Candidatus Atelocyanobacterium thalassa [30] isolated from the North Pacific gyre (GenBank accession: CP001842.1). Besides their comparable size of 1.43 Mbp (ASW 00003) and 1.46 Mbp (isolate), their average nucleotide identity was 99.96% over the 1.43 Mbp alignment. Ca. A. thalassa is a diazotrophic taxon that lacks key metabolic pathways and lives in symbiosis with photosynthetic eukaryotic cells [19,31]. While the high genomic similarity between ASW 00003 and the Ca. A. thalassa isolate genome demonstrates the accuracy of our metagenomic workflow, this finding also emphasizes the genomic stability of a well-studied nitrogenfixing symbiotic population at large scale, possibly due to a strong selective pressure from the eukaryotic hosts.

Genomic evidence for nitrogen fixation by Planctomycetes and Proteobacteria
Besides the cyanobacterial MAG, we also identified seven Proteobacteria and two Planctomycetes MAGs in our collection that contained the complete set of genes for nitrogen fixation. To the best of our knowledge, these MAGs (HBD-01 to HBD-09) represent the first genomic evidence of putative HBDs inhabiting the surface of the open ocean ( Figure 2, Panel A). They were obtained from the Pacific Ocean (n=6), Atlantic Ocean (n=2), and Indian Ocean (n=1), and displayed relatively large genome sizes (up to 6 Mbp and 5,390 genes) and a GC-content ranging from 50% to 58.7%. One of the Proteobacterial MAGs resolved to the genus Desulfovibrio (HBD-01). The remaining MAGs from this phylum represented new lineages within the orders Desulfobacterales (HBD-02), Oceanospirillales (HBD-03; HBD-04; HBD-05) and Pseudomonadales (HBD-06; HBD-07) ( Figure 2, Panel A). The phylogenetic assignment of one Planctomycetes MAG (HBD-08) with a low completion estimate (33.5%) could not be resolved beyond the phylum level, possibly due to missing phylogenetic marker genes for taxonomic inferences. However, the length of this MAG (4.03 Mbp) suggests that its completion based on single-copy core genes may have been underestimated, as we have observed in previous studies [32,33]. The closest match for the second Planctomycetes MAG (HBD-09) was the family Planctomycetaceae (order Planctomycetales) when using a collection of single copy genes. This MAG contained a large fragment of the 16S rRNA gene (1,188 nt; available in Table 4) for which the best match to any characterized bacterium in the NCBI's non-redundant database was Algisphaera agarilytica (strain 06SJR6-2, NR 125472) with 88% identity. The novelty of the 16S rRNA gene sequence suggests that this MAG represents a new order (and possibly a new class) within the Planctomycetes.
We placed the nine HBDs in a phylogenomic analysis of the 432 Proteobacteria and 43 Planctomycetes MAGs using a set of 37 phylogenetic marker genes ( Figure 2, Panel B; interactive interface at https://anviserver.org/merenlab/tara hbds ). The two Deltaproteobacteria HBDs were closely related to each other, but not adjacent in the phylogenomic tree. The HBDs within Oceanospirillales (n=3), Pseudomonadales (n=2) and Planctomycetes (n=2) formed three distinct phylogenomic lineages. These results suggest that closely related populations of diazotrophs inhabit the surface ocean, and nitrogen fixation genes occur sporadically among diverse putatively heterotrophic marine microbial lineages.
Our initial binning results included 120 redundant MAGs that we observed multiple times in independent co-assemblies (Table S5). Although they are not present in our final collection of 957 non-redundant MAGs (allowing for a better assessment of the relative abundance of microbial populations), we used this redundancy to investigate the stability of the phylogeny and functional potential of populations recovered from multiple geographical regions. For instance, we characterized the genomic content of HBD-06 from the Atlantic northwest (5.49 Mbp) and from each of the three Pacific Ocean regions (5.55 Mbp, 5.33 Mbp, and 5.29 Mbp in the regions PON, PSW, and PSE, respectively) (Table S5). Average nucleotide identity values between the Atlantic MAG and the three Pacific MAGs ranged from 99.89% to 99.97% over more than 97% of the genome length. We observed similar trends for HBD-07 and HBD-09 (Table S5). The complete set of nitrogen fixation genes was present in all these redundant MAGs, demonstrating a large-scale stability of this functional trait in these HBDs.
The proportion of genes of unknown function was on average 27.6(+/-2.63)% for the Proteobacteria HBDs and 49.3(+/-0.5)% for the Planctomycetes HBDs, exposing an important gap in our functional understanding of especially the latter taxonomic group of diazotrophs. The grand total of 37,582 genes we identified in the nine HBDs encoded 5,912 known functions (Table S6) Table S6). The relatively weak overlap of known functions between these groups indicates that the ability to fix nitrogen in marine populations may not be associated with a tightly defined functional life style.
Functional differences between HBDs included multiple traits within the biological nitrogen cycle. For instance, hydrolysis of urea to ammonia and CO2 was characteristic to the Gammaproteobacteria HBDs, while oxidation of glycine to ammonia was only detected in the Deltaproteobacteria HBDs and the Planctomycetes HBD-08 (Table S6). These pathways might represent alternative ammonia sources to avoid the costly nitrogen fixation. In addition, we detected the complete denitrification pathway in the representative MAG and all three genomic replicates of HBD-06, suggesting that widespread HBDs may also be involved in nitrogen loss in the surface ocean. However, we only identified this pathway in HBD-06 and a non-diazotroph MAG (ION 00025 affiliated to the genus Labrenzia). Thus, the scarcity of nitrogen fixation (10/957) and denitrification (2/957) pathways in our genomic database suggest the metabolic potential of HBD-06 to be highly singular. Other functional differences between HBDs included genes related to the regulation of nitrogen fixation. Single copies of the coding genes nifX, nifQ, nifO, nifW and nifT were characteristic to Oceanospirillales and Pseudomonadales HBDs, while the two-component nitrogen fixation transcriptional regulator fixJ was only detected in the Planctomycetes HBDs. We also identified a relatively small set of 271 functions (4,608 genes) common to the nine HBDs ( Figure 2, panel C). Besides various housekeeping genes and the full set of genes for nitrogen fixation, they included genes coding for chemotaxis and flagellar proteins, transporters for ammonia, phosphate and molybdenum (required for nitrogen fixation), and multiple nitrogen regulatory proteins (Table S6). They also included 477 genes coding for transcriptional regulators. HBDs also shared complete pathways for biosynthesis of acyl-CoA and acetyl-CoA (beta-oxidation and pyruvate oxidation), glycolysis, ABC-2 type transport systems, and a two-component regulatory system for phosphate starvation response (Table S5). Thus, the HBDs we characterized appeared to be involved in different steps of the nitrogen cycle and possessed distinct strategies regulating nitrogen fixation, but shared traits related to energy conservation, motility, nutrient acquisition and gene regulatory processes. Swimming motility was a common trait we observed in all the HBDs, which may be an indication of a free-living life style rather than the symbiotic lifestyles observed in some cyanobacterial diazotrophs.

The taxonomy of HBDs is coherent with the phylogeny of nitrogen fixation genes
Our phylogenetic analysis of the catalytic genes nifH and nifD from a wide range of diazotrophs placed our HBDs in four distinct lineages (Figure 3). We also included in this analysis the genomic replicates we had removed from our non-redundant genomic collection. These replicates clustered with their representative MAGs in the phylogenetic tree, which indicates that widespread HBDs maintain near-identical nitrogen fixation genes. HBD-01 (Desulfovibrio) and HBD-02 (Desulfobacterales) were clustered with close taxonomic relatives. In addition, the Gammaproteobacteria HBDs were most closely related to reference genomes of the genera Pseudomonas and Azotobacter from the same class. The agreement between their taxonomy and placement in the functional gene-based phylogeny suggests that diazotrophic traits in the heterotrophic bacterial populations inhabiting the surface ocean may have been transferred vertically rather than horizontally. Finally, the nifD and nifH genes we identified in the Planctomycetes HBDs formed distinct phylogenetic clusters, which was particularly apparent for nifD (Figure 3, panel A). These results suggest nitrogen fixation represents a functional trait rooted within the phylum Planctomycetes.
HBDs are not only diverse but also abundant in the surface ocean The cumulative relative abundance of the Proteobacteria and Planctomycetes HBDs in the metagenomic dataset averaged to 0.01% and 0.05%, respectively (Table S3). In comparison, Ca. A. thalassa had a relative abundance averaging to 0.006%. The detection of Proteobacteria and Planctomycetes HBDs was very low in the Mediterranean Sea and Red Sea (0.00064% on average). In contrast, they were substantially enriched in metagenomes from the Pacific Ocean (0.14% in average) compared to the other regions ( Figure 4). In fact, the Pacific Ocean metagenomes contained 81.4% of the 17.8 million reads that were recruited by HBD MAGs from the entire metagenomic dataset. In particular, the two most abundant Proteobacteria and Planctomycetes HBDs (HBD-06 and HBD-09) showed a broad distribution ( Figure 4) and were significantly enriched in this ocean (Welch's test; p-value <0.005). HBD-06 was also abundant in the northwest region of the Atlantic Ocean and to a lesser extent in the Southern Ocean, revealing that the ecological niche of a single HBD population can encompass multiple oceans and a wide range of temperatures (Table S3). Interestingly, HBD-07 and HBD-08, which are phylogenetically and functionally closely related to HBD-06 and HBD-09, respectively, were not only less abundant, but they also exhibited a different geographical distribution ( Figure 4). We could not explain the increased signal for the nine HBDs in few geographic regions using temperature, salinity, or the concentration of essential inorganic chemicals including oxygen, phosphate and nitrate (Table S1).
To reconcile the abundance of nitrogen fixing populations in the surface of the open ocean with the inclusion of new HBDs described in this study, we used the previous PCR-based estimations of the abundance of non-cyanobacterial nifH genes. It is estimated that each litre of water in the surface ocean contains about 0.5 billion archaeal and bacterial cells [34]. Quantitative PCR surveys have estimated that all noncyanobacterial nifH genes combined generally range from 10 to 1,000 copies per litre and rarely reach 0.1 million copies [23,24,25,26,27]. One of the most abundant nifH phylotypes, gamma-24774A11, comprises up to 20,000 copies per litre in the Pacific Ocean, and is considered to represent one of the most important groups of non-cyanobacterial diazotrophs in the surface ocean [27,35,36]. In contrast, the same assumptions, and our abundance estimations based on metagenomic mapping, suggest that the nine populations of HBDs characterized in this study represent 0.5 million cells per litre on average (and up to 3.2 million cells) in the surface of the Pacific Ocean. HBD-06 alone represented on average more than 0.3 million cells per litre in this region. For a more direct comparison between relative abundance values derived from shotgun metagenomics and quantitative PCR surveys, we also recruited short metagenomic reads from TARA Oceans using a database of the nifH genes we identified in this study and gamma-24774A11 (Table  S7). Mapping results showed that the nifH genes we identified in HBDs recruited 38.5 to 535 times more reads compared to gamma-24774A11 worldwide (i.e., when considering the 93 metagenomes altogether). These results altogether suggest that HBD populations can be up to about 2.5 orders of magnitude more abundant in the surface ocean than previously thought.

A binomial naming for the HBDs we characterized from multiple geographic regions
Most of the MAGs we characterized in the present study correspond to novel genera, but the lack of culture representatives is preventing a formal taxonomical characterization of these new lineages. As an exception, we tentatively named the HBDs we independently characterized from multiple geographic regions (i.e., those for which we have genomic replicates) using the candidatus status and binomial naming system: 'Candidatus Azoaequarella praevalens' gen. nov., sp. nov., (HBD-06; representative genome is ANW 00006 with a completion of 98.1%) and 'Ca. Azopseudomonas oceani' gen. nov., sp. nov., (HBD-07; representative genome is ANW 00019 with a completion of 91.2%) within the order Pseudomonadales (unknown family), and 'Ca. Azoplanctomyces absconditus' gen. nov., sp. nov., (HBD-09; representative genome is PSW 00018 with a completion of 97.3%) within the Phylum Planctomycetes (unknown order and family). Note that ANI between 'Ca. Azoaequarella praevalens' and 'Ca. Azopseudomonas oceani' was 83.1%, supporting their assignment to different genera but suggesting they might belong to the same family. These three candidate species represent widespread and abundant nitrogen-fixing populations in the surface of the open ocean (Figure 4). Here is a brief explanation of their naming: Orphan nifH genes from other archaeal and bacterial phyla are also abundant The non-redundant collection of 957 curated MAGs in which we searched for HBDs represent only 2.07% of the scaffolds in our assembly of TARA Oceans metagenomes. To identify more nifH genes, we also investigated the remaining 'orphan' scaffolds (42,193,607 genes). Our search based on amino acid similarity with the HBD database resulted in the recovery of nine additional non-redundant nifH genes (Table S7). Eight of them originated from the Pacific Ocean metagenomic co-assemblies, substantiating the unequal distribution patterns for nitrogen fixation genes we observed at the MAG level. The phylogenetic analysis on these nifH genes affiliated them with Elusimicrobia (n=2), Firmicutes (n=2), Proteobacteria (n=1), Spirochaeta (n=1), Verrucomicrobia (n=1), a group of uncultured bacteria (n=1), and Euryarchaeota (n=1) ( Figure S1). This primer-independent analysis of nifH genes identified a small but phylogenetically remarkably wide range of lineages. In particular, the Euryarchaeota nifH (Id-1645572 from the region PON; 858 nucleotides-long) had several mismatches with commonly used nifH PCR primers [37]. Its best match in the NCBI non redundant database was a nifH gene of a Methanocalculus MAG (e-value=2e-93) obtained from an oil reservoir in Alaska [38]. These nine nifH genes had a mean coverage 21.7 to 56.3 times higher than gamma-24774A11 worldwide (Table S7). These findings suggest that although their genomic content has yet to be determined, abundant non-cyanobacterial diazotrophic populations in the surface ocean habitat are not limited to Planctomycetes and Proteobacteria.

Discussion
The nine heterotrophic bacterial diazotrophs (HBDs) we describe in this study provide first genomic insights into widespread and abundant nitrogen-fixing surface ocean populations that are not affiliated with Cyanobacteria. These HBDs include two Planctomycetes populations, which represents the first observation of diazotrophs in this phylum. These findings complement decades of PCR surveys and cultivation efforts, and corroborate the relevance of metagenomic assembly and binning strategies to improve our understanding of microbial communities inhabiting the largest biome on Earth.
How relevant are these HBDs to the gross nitrogen budget of the world's oceans? Although our study provides evidence to support the existence of thriving lineages of HBDs in the surface ocean, such as the recovery of near-identical population-level genomes from multiple geographic regions, it does not explain their role and impact on the nitrogen cycle. For instance, 'Ca. Azoaequarella praevalens' displayed functional capacity for both nitrogen fixation and denitrification, exposing a rare metabolic duality between nitrogen gain and loss. The remaining HBDs appear to lack genes for denitrification, however, the extent by which their relative abundances correlate with nitrogen gain for the surrounding microbial populations is unclear. These observations pose the necessity for more targeted analyses to understand the contribution of each HBD to nitrogen bioavailability in the surface ocean.
The HBDs we characterized were most abundant in the Pacific Ocean and the northwest of the Atlantic Ocean. One potential explanation for the differential distribution of HBDs at large scale could be a technical one: we may have failed to recover abundant nitrogen fixing populations from other geographic regions due to the limitations associated with metagenomic binning. To address this concern, we also focused on the occurrence of nifH genes in the assembled scaffolds that were not binned into any population genome and their abundance in the surface ocean. While this investigation revealed additional putative diazotrophs within other archaeal and bacterial phyla, their distribution patterns also mirrored those we observed at the level of population genomes. These findings altogether suggest that our study recovered the most abundant HBDs in the TARA Oceans metagenomes, and that their distribution may be driven by differential resource availability and niche specialization. Our analyses revealed a diverse group of HBDs enriched only in a few geographic regions. Although the available metadata did not seem to explain the niche boundaries for HBDs, the iron bioavailability may be playing a role: this element is particularly important for photosynthesis [39,40,41] and is known to represent a limiting factor for cyanobacterial diazotrophs in the Pacific Ocean [7]. Thus, marine systems co-limited by nitrogen and iron may represent appropriate ecological niches for HBDs, where they could be the main sources of nitrogen input into the surface ocean.
Overall, our study reveals that populations of HBDs within Planctomycetes and Proteobacteria, as well as putative diazotrophs within other archaeal and bacterial phyla, can be abundant in the surface ocean, occasionally across wide ecological niches spanning a large range of temperatures. Our investigation takes advantage of unprecedented amount of shotgun metagenomic sequencing data to investigate the diversity of nifH genes without primer bias. While our findings substantiate the previous observations made through PCR amplicon surveys regarding the diversity of HBDs in the surface of the oceans [23,24,25,26,27], they also demonstrate that amplicon surveys may have underestimated the abundance of HBDs by multiple orders of magnitude. This emphasizes the need to re-assess the role of heterotrophic bacteria in oceanic primary productivity and the nitrogen cycle. Cultivation efforts and additional environmental surveys will be essential to characterize the life-styles of these non-cyanobacterial populations, and to determine environmental conditions in which they fix nitrogen.

Methods
The TARA Oceans metagenomes. We acquired 93 metagenomes from the European Bioinformatics Institute (EBI) repository under the project ID ERP001736 and quality filtered the reads using the illumina-utils library [42] v1.4.1 (available from https://github.com/meren/illumina-utils). Noisy sequences were removed using the program 'iu-filter-quality-minoche' with default parameters, which implements a noise filtering described by Minoche et al. [43]. Table S1 reports accession numbers and additional information (including the number of reads and environmental metadata) for each metagenome.
Metagenomic co-assemblies, gene calling, and binning. We organized the dataset into 12 'metagenomic sets' based upon the geographic coordinates of metagenomes (Table S1). We co-assembled reads from each metagenomic set using MEGAHIT [44] v1.0.3, with a minimum scaffold length of 1 kbp, and simplified the scaffold header names in the resulting assembly outputs using anvi'o [33] v2.0.2 (available from http://merenlab.org/software/anvio). For each metagenomic set, we then binned scaffolds >2.5 kbp (>5 kbp for the Southern Ocean) following the workflow outlined in Eren et al. 33. Briefly, (1) anvi'o profiled the scaffolds using Prodigal [45] v2.6.3 with default parameters to identify genes (Table S2), and HMMER [46] v3.1b2 to identify genes matching to archaeal [47] and bacterial [48,49,50,51] single-copy core gene collections, (2) we used Centrifuge [52] with NCBI's NT database to infer the taxonomy of genes (as described in http://merenlab.org/2016/06/18/importing-taxonomy), (3) we mapped short reads from the metagenomic set to the scaffolds using Bowtie2 [53] v2.0.5 and stored the recruited reads as BAM files using samtools [54], (4) anvi'o profiled each BAM file to estimate the coverage and detection statistics of each scaffold, and combined mapping profiles into a merged profile database for each metagenomic set. We then clustered scaffolds with the automatic binning algorithm CONCOCT [48] by constraining the number of clusters per metagenomic set to 100 to minimize the 'fragmentation error' (when multiple clusters describe one population), with the exception of the Southern Ocean (25 clusters) and the Pacific Ocean southeast (150 clusters) metagenomic sets. Finally, we manually binned each CONCOCT cluster (n=1,175) using the anvi'o interactive interface. See the Supplementary Information for details. Table S8 reports the genomic features (including completion and redundancy values) of the bins we characterized.
Identification and curation of metagenome-assembled genomes (MAGs). We defined all bins with >70% completion or >2 Mbp in length as MAGs (Table S2). We then individually refined each MAG as outlined in Delmont and Eren [55], and renamed scaffolds they contained accordingly to their MAG ID in order to insure that the names of all scaffolds in MAGs we characterized from the 12 metagenomic sets were unique.
Taxonomic and functional inference of MAGs. We used CheckM [56] to infer the taxonomy of MAGs based on the proximity of 43 single-copy gene markers within a reference genomic tree. We also used Centrifuge, RAST [57] and manual BLAST searches of single copy core genes against the NCBI's nonredundant database to manually refine the CheckM taxonomic inferences, especially regarding the archaeal and eukaryotic MAGs. We also used the occurrence of bacterial single-copy genes to identify MAGs affiliated to the Candidate Phyla Radiation (as described in http://merenlab.org/2016/04/17/predicting-CPR-Genomes/). Table S4 reports our curated taxonomic inference of MAGs. We used KEGG (the 04/14/2014 release) to identify functions and pathways in MAGs. We also used RAST to identify functions in 15 MAGs that contained the complete set of nitrogen fixation genes (originally identified from the KEGG pathways). Tables S6 and S9 report the RAST and KEEG results, respectively. We used Gephi [58] v0.8.2 to generate a functional network using the Force Atlas 2 algorithm to connect MAGs and RAST functions. We correlated node sizes to the number of edges they contained, which resulted in larger nodes for MAGs compared to functions.
Characterization of a non-redundant database of MAGs. We concatenated all scaffolds from the genomic database of MAGs into a single FASTA file and used Bowtie2 and samtools to recruit and store reads from the 93 metagenomes. We used anvi'o to determine the coverage values, detection and relative distribution of MAGs and individual genes across metagenomes (Table S10). We determined the Pearson correlation coefficient of each pair of MAGs based upon their relative distribution across the 93 metagenomes using the function 'cor' in R [59] (Table 5). We finally used NUCmer [60] to determine the average nucleotide identity (ANI) of each pair of MAG affiliated to the same phylum for improved performance (the Proteobacteria MAGs were further split at the class level) ( Table 5). MAGs were considered redundant when their ANI reached to 99% (minimum alignment of >75% of the smaller genome in each comparison) and the Pearson correlation coefficient above 0.9. We then selected a single MAG to represent a group of redundant MAGs based on the largest 'completion minus redundancy' value from single-copy core genes for Archaea and Bacteria, or longer genomic length for Eukaryota. This analysis provided a non-redundant genomic database of MAGs. We performed a final mapping of all metagenomes to calculate the mean coverage and detection of these MAGs (Table S3, Supplementary Information).
Statistical analyses. We used STAMP [61] and Welch's test to identify non-redundant MAGs that were significantly enriched in the Pacific Ocean compared to all the other regions combined. Table S3 reports the p-values for each MAG.
World maps. We used the ggplot2 [62] package for R to visualize the metagenomic sets and relative distribution of MAGs in the world map.
Phylogenomic analysis of MAGs. We used PhyloSift [63] v1.0.1 with default parameters to infer associations between MAGs in a phylogenomic context. Briefly, PhyloSift (1) identifies a set of 37 marker gene families in each genome, (2) concatenates the alignment of each marker gene family across genomes, and (3) computes a phylogenomic tree from the concatenated alignment using FastTree [64] v2.1. We rooted the phylogenomic tree to the phylum Planctomycetes with FigTree [65] v1.4.3, and used anvi'o to visualize it with additional data layers.
Identification of additional nifH sequences in orphan scaffolds. We used DIAMOND [66] to generate a database of nifH genes we identified in the nine HBDs, and to search for additional nifH amino acid sequences within the genes Prodigal identified in scaffolds longer than 1,000 nucleotides. We considered only hits with an e-value of <1e-50, and defined them as nifH genes only when (1) 'nitrogenase' was the top blastx hit against the NCBI's nr database, and (2) the characteristic [4Fe-4S]-binding site (Prosite signature PDOC00580) was present in their amino acid sequence.
Mean coverage of nifH genes across 93 metagenomes. We concatenated all nifH genes (orphan genes as well as the ones affiliated with HBDs) into a single FASTA file along with the additional reference sequence gamma-24774A11. To estimate the mean coverage of all nifH sequences, we recruited reads from all metagenomes and profiled the resulting mapping results with anvi'o as described above. Table S7 reports the mean coverage estimates.
Phylogenetic analysis of nifD and nifH genes. We imported the amino acid sequences of nifD and nifH genes identified in this study into ARB v.5.5-org-9167 [67], which built databases by identifying representative, non-redundant protein reference sequences by blastp searches against the NCBI nr database. In ARB, we aligned sequences to each other using ClustalW [68], manually refined alignments, and calculated phylogenetic trees with PhyML [69] using the 'WAG' amino acid substitution model, and a 10% conservation filter.

Figure designs.
We used the open-source vector graphics editor Inskape (available from http://inkscape.org/) to finalize all the figures.