Alternative strategies of nutrient acquisition and energy conservation map to the biogeography of marine ammonia-oxidizing archaea

Ammonia-oxidizing archaea (AOA) are among the most abundant and ubiquitous microorganisms in the ocean, exerting primary control on nitrification and nitrogen oxides emission. Although united by a common physiology of chemoautotrophic growth on ammonia, a corresponding high genomic and habitat variability suggests tremendous adaptive capacity. Here, we compared 44 diverse AOA genomes, 37 from species cultivated from samples collected across diverse geographic locations and seven assembled from metagenomic sequences from the mesopelagic to hadopelagic zones of the deep ocean. Comparative analysis identified seven major marine AOA genotypic groups having gene content correlated with their distinctive biogeographies. Phosphorus and ammonia availabilities as well as hydrostatic pressure were identified as selective forces driving marine AOA genotypic and gene content variability in different oceanic regions. Notably, AOA methylphosphonate biosynthetic genes span diverse oceanic provinces, reinforcing their importance for methane production in the ocean. Together, our combined comparative physiological, genomic, and metagenomic analyses provide a comprehensive view of the biogeography of globally abundant AOA and their adaptive radiation into a vast range of marine and terrestrial habitats.

urea utilizing genes in natural marine AOA communities by calculating the ratio of reads that mapped to AOA ureC genes (encoding the alpha subunit of urease) with the criterion of E-value ≤ 1 × e -10 , reads sequence identity ≥ 80%, and reads length ≥ 100 bp divided by reads that mapped to AOA amoA genes (Fig S12). We found that ureC genes were distributed throughout the upper ocean water column at the HOT station (125-1,000 m) (Fig. S12). The recruitment ratio of AOA ureC genes versus amoA genes indicates that ~40-100% of marine AOA contain urea utilizing genes (Fig. S12). Similarly, high ureC:amoA ratios were also observed in the equatorial Pacific Ocean (22-55%) [11], the Gulf of Mexico (~10-15%) [8], and the San Pedro Ocean Time-Series (SPOT) site (~60-100%) [12], together highlighting the pervasiveness and importance of urea as an alternative substrate for AOA in the ocean.

Sample source, culture maintenance and genome sequencing
All AOA species were maintained in liquid mineral medium in the dark without shaking.
In brief, the filter was removed from the Sterivex cartridge, aseptically cut into small pieces with flame-sterilized scissors, and transferred to a sterile 15 ml Falcon tube. Two milliliters of fresh sucrose lysis buffer containing 50 mM Tris-HCl at pH 8.0, 40 mM EDTA at pH 8.0, and 0.75 M sucrose was added to the Falcon tube. Immediate flash freezing of the sample in liquid nitrogen and subsequent thawing at room temperature were conducted to release DNA from AOA cells.
After one freeze and thaw cycle, 465 µl of 10% SDS and 15 µl of 20mg/ml proteinase K were added, and the contents were mixed by vortexing for 2 minutes. The tube was incubated at 37°C for 2 hours with agitation at 30 minutes intervals. The aqueous phase was transferred to a spin column, and DNA was collected using Qiagen Blood & Tissue kit according to the manufacturer's recommendations.
Genomic DNA was sequenced by a combination of the Illumina MiSeq platform with paired-end (PE) reads of 250 bp and the Pacific Biosciences (Pacbio) RSII SMRT sequencing platform using a 20 kb SMRTbell template library. De novo assembly of the Pacbio sequence reads was conducted using the Hierarchical Genome Assembly tool in the PacBio SMRT Analysis Software (RS_HGAP_Assembly.3) [14]. Illumina sequence reads were trimmed and filtered by Trimmomatic (version 0.36) to remove low-quality reads [15]. The initial Pacbio-based assemblies were error corrected with high-quality Illumina reads using Pilon (version 1.23) [16].
The corrected genomes were annotated through NCBI Prokaryotic Genome Annotation Pipeline [17].
Ca. Nitrosarchaeum sp. strain AC2 and Ca. Nitrosotenuis sp. strain DW1 were enriched from near-shore sediments of Lake Acton (39°57'N, 84°74'W, USA) and Lake Delaware (40°39'N, 83°05'W, USA), respectively [18]. These two freshwater AOA enrichment cultures were maintained in HEPES buffered freshwater SCM supplemented with 0.5 mM NH4Cl at 30°C under dark as described previously [18]. Cells were harvested at mid-exponential phase and genomic DNA was extracted using a sucrose lysis method as described above. DNA extracted from each culture was sequenced on a SMRT cell using PacBio Sequel platform with insert size of approximately 10 kb. The Pacbio sequence reads were clustered according to the Latent Strain Analysis algorithm [19], and the clustered reads were de novo assembled by CANU (version 1.8) [20]. The longest contig obtained from each culture was circularized containing an overlapping region of more than 2,000 bp at both ends of the contig. Two complete genomes were annotated through NCBI Prokaryotic Genome Annotation Pipeline [17].  shaking. Cells were harvested at late-exponential or early stationary phase, and genomic DNA was extracted using the modified phenol-chloroform extraction method as described previously [21]. Extracted DNA was sequenced using Illumina HiSeq 2000 platform with PE of 150 bp.
Illumina sequence reads were assembled using CLC's de novo assembly algorithm (version 6.0, CLC Bio, Qiagen, Germany), and the resulting contigs were curated by CodonCode Aligner (version 3.7, CodonCode Corp.) as previously described [22]. Three assembled genomes were annotated with the Rapid Annotation using Subsystem Technology (RAST) pipeline [23], and the annotations were checked and corrected by NCBI Prokaryotic Genome Annotation Pipeline [17].
A pure marine AOA culture of Ca. Nitrosopumilus zosterae strain NM25 was obtained from the eelgrass sediments in Tanoura Bay of Shimoda, Shizuoka, Japan [24]. Strain NM25 was maintained in HEPES buffered SCM supplemented with 1 mM NH4Cl and 100 µM αketoglutaric acid [13] at 30°C in the dark with slow stirring. Cells were harvested at lateexponential or early stationary phase via centrifugation, and genomic DNA was extracted from cell pellets with the ISOPLANT II kit (Nippon Gene, Tokyo, Japan) following the manufacturer's instructions. Extracted DNA was sequenced using the Illumina HiSeq1000 platform, and the trimmed reads were assembled using the GS De novo assembler version 2.6 (Roche). The assembled draft genome was annotated with the RAST pipeline [25] and deposited at DDBJ using DDBJ Fast Annotation and Submission Tool (DFAST) [26]. with k-mer of 35 and contig length > 500 bp. AOA MAGs were recovered using the differential coverage binning method [27]. MAG quality was further improved by using PE tracking to remove incorrectly binned contigs and recruit the short contigs that were filtered in the preliminary binning process. Scaffolding of contigs in MAGs was performed using SSPACE [15].

Sampling sites and metagenome-assembled genomes
Thermophilic AOA MAG US01 was recovered from a hyperthermal hot spring in Ulu Slim, Malaysia. The spring has a near-neutral pH and a maximum temperature at 104°C. Hot spring water and sediment samples were collected in September 2011 at in situ temperature of 90°C.
The collected samples were kept at ambient temperature and immediately transferred to the laboratory, where they were stored at 4°C before DNA extraction. Equal volumes of water and sediment samples were used for DNA extraction following the protocol described previously [28]. Extracted DNA was sequenced by Illumina MiSeq with PE of 100 bp and insert size of 300 bp. Quality-filtered (average Q value > 30) metagenome reads were assembled using CLC's de novo assembly algorithm (version 6.04, CLC Bio) with k-mer of 35 and contig length > 500 bp. A thermophilic AOA MAG was recovered using the differential coverage binning method [27]. Differential coverage of assembled contigs were estimated based on mapping the instructions with a few modifications as previously described [29]. Extracted DNA was sequenced on the Illumina Hiseq X Ten platform with PE of 150 bp. Clean sequence reads were quality trimmed and assembled into contigs using IDBA-UD (Version 1.1.1) with the parameters: -mink 65, -maxk 145, -steps 10 [30]. The ORFs within contigs were predicted by Prodigal with the '-p meta' option (Version 2.6.3) [31]. The initial metagenome binning was performed by MetaBAT (Version 2.12.1) [32] with the modified sensitivity and specificity settings as previously described [33]. A hadopelagic marine AOA MAG was recovered using MetaBAT with the default sensitivity and specificity settings (Version 2.12.1) [32].
Metagenome binning and taxonomic assignments were performed as described elsewhere [33].
Briefly, the initial binning was performed by setting different parameters of sensitivity and specificity in MetaBAT (Version 2.12.1) [32]. Subsequently, all of the bins that retrieved from MetaBAT were pooled for post-dereplication by DAS Tool (Version 1.1) [34]. immediately transferred to liquid nitrogen and stored at -80°C until further processing. DNA extracts were obtained with the phenol-chloroform extraction method as previously described [35]. Extracted DNA was sequenced using the Illumina Hiseq X Ten platform with PE of 150 bp and insert size of 350 bp. Clean metagenome reads were quality trimmed and assembled independently using IDBA-UD (Version 1.1.1) with the following parameters: -mink 70, -maxk 100, -steps 10, -pre_correction [30]. Assembled contigs with length > 10 kb were selected for metagenome binning based on the analysis of tetranucleotide frequencies, GC content, and coverage values as described elsewhere [36].

Genomic feature analysis
Completeness, contamination, and coding density of AOA culture genomes and MAGs were assessed by CheckM [37]. Detailed results can be found in Table S1. Average nucleotide identity (ANI) was estimated by pyani.py (https://github.com/widdowquinn/pyani) using BLASTN to align genomic fragments. All predicted genes were searched against the Clusters of Orthologous Group (COG) database [38]. The conserved metabolic pathways shared by ≥ 95% of the complete or nearly complete AOA genomes were constructed based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) reference pathway map [39]. The presence and absence of each pathway enzyme involved in ten major AOA metabolic pathways, such as ammonia oxidation and electron transfer, carbon fixation, phosphorus utilization, and cobalamin biosynthesis, were assessed within available culture genomes and high-quality MAGs (close to or more than 90% completeness and close to or less than 5% contamination) [40]. Detailed results can be found in Table S4.

Core genome and pan-genome analyses
AOA genome proteins were clustered into homolog cluster groups (HCGs) using OrthoMCL based on all-against-all BLASTP with the thresholds of pairwise coverage of 50% and identity of 50% [41]. Orthologs and paralogs were identified as the reciprocal best similarity pairs that were found between and within species, respectively [42]. Core genome represents the HCG genes shared by all AOA species genomes, MAGs, and SAGs, and pan-genome represents the genes present in at least one AOA genome. To estimate core genome and pan-genome sizes, the shared/unique gene contents with the number of AOA genomes (N) ranging from 2 to 45 were determined (Nmax = 25 for marine AOA species; Nmax = 12 for Prochlorococcus species [43]; Nmax = 7 for SAR11 species [44]) . For each N, a total of C(45, N) combinations of genomes were calculated. If the number of combinations was over 5,000, 5,000 random combinations were sampled for core genome and pan-genome analyses.

Phylogenomic analysis
The phylogenomic trees of AOA were constructed based on concatenated alignments of 71 conserved single-copy homologous proteins from 37 complete or nearly complete AOA cultured species genomes and 7 high-quality AOA MAGs. These marker proteins were identified based on the cluster of HCGs, and the alignments were carried out by MAFFT (version 7.221) [45].
The alignments were edited with Gblocks (version 0.91b) to identify conserved regions [46].
These protein sequences were concatenated as a single evolutionary unit. ProtTest (version 3.4) was employed to select the best-fit model of amino acid substitution according to the AIC and BIC values [47]. Subsequently, the maximum likelihood phylogenomic trees were built by RAxML (version 8.0.26) using the LG+I+G+X model on the basis of 100 bootstrap replications [48].

Evolution experiment and mutation analysis
Since the isolation of Nitrosopumilus maritimus strain SCM1 [49], it has been continuously transferred from 2007 to 2018 in HEPES buffered SCM supplemented with 1 mM NH4Cl at 30°C in the dark without shaking. Growth was monitored by microscopy, ammonia consumption, and nitrite accumulation as described previously [5,13]. Late exponential or early stationary phase cultures were transferred to fresh medium (0.25% inoculum), and the purity of cultures was routinely monitored by microscopic inspection and by the absence of bacterial growth in marine broth medium as described previously [5,13]. The specific growth rates (µ; h -1 ) were estimated by determining the slope of Ln nitrite concentrations versus time during exponential growth. The generation time (g; h) was calculated as g = ln(2)/µ. Respiratory activity and NO accumulation of N. maritimus cultures were measured with O2 (Unisense AS, Denmark) and NO (amiNO-600, Innovative Instruments, Sarasota, Florida) microsensors, respectively, in custom-built 35 ml of glass vials as described previously [50].
The genome of N. maritimus was initially sequenced in 2007 [51]. Cultures were harvested in May 2011 and June 2016 and stored at -80°C for subsequent genome re-sequencing. Genomes of evolved cultures were sequenced to > 50× coverage on Illumina MiSeq platform using sequenced runs of 2 × 250 PE reads. Raw reads were trimmed and filtered by Trimmomatic (version 0.36) to remove low-quality reads [15]. The high-quality reads were mapped to the reference N. maritimus published genome using Bowtie2 (version 2.3.4) [52]. The PCR duplicates of sequence reads were identified and removed by Samtools (version 1.6) [53].

Distribution and diversity of marine AOA genotypic groups and functional genes in the global ocean
Twenty-five marine AOA species genomes were clustered in 7 genotypic groups to represent populations from distinct phylogenetic lineages and ecological habitats. To investigate the overall distribution of these genotypic groups in the upper ocean (< 1,000 m), competitive fragment recruitment was conducted to determine the relative recruitment to available marine AOA species genomes in GOS and Tara Oceans metagenomic databases. In addition, the vertical distribution of these genotypic groups from epi-to hadopelagic waters was determined by competitive fragment recruitment analysis using metagenomic datasets of the North Pacific The recruited amino acid sequences were further classified into seven marine AOA genotypic groups.
To calculate the relative abundance of functional genes in marine AOA natural populations, we first trimmed the raw sequencing reads of Tara Oceans and deep ocean metagenomic samples by Trimmomatic (version 0.36) [15]. The trimmed reads were aligned via DIAMOND (version 0.9.24) [56] to the datasets of marine AOA functional genes (including amoA, amt, pstB, pit, mpnS, and ureC) found in available species genomes, MAGs and SAGs. The average fraction of marine AOA populations that possess pstB, pit, mpnS, and ureC genes was estimated by calculating the ratio of length-normalized counts for reads that mapped to these accessory functional genes with the criterion of E-value ≤ 1 × e -10 , reads sequence identity ≥ 80%, and reads length ≥ 100 bp divided by reads that mapped to AOA amoA genes, assuming each marine AOA cell contains one copy of the amoA gene (Table S4). The counts of metagenome reads that mapped to AOA high-affinity and low-affinity amt genes from environmental samples were calculated using HTSeq (version 0.9.1) and normalized by gene length [57].
To investigate the relative distribution of two homologous genes, A-type and V-type AOA atpA throughout the water column, we first mapped the trimmed metagenomic reads to the assembled scaffolds of metagenomic samples collected from epi-to hadopelagic waters using Bowtie2 (version 2.3.4) [52]. The metagenomic sequences of A-type and V-type AOA atpA genes were searched via BLASTP (version 2.2.28+) [55] with an E-value of ≤ 1 × e -10 and an identity of ≥ 80% against an in-house marine AOA species genome database containing atpA genes found in available AOA species genomes, MAGs and SAGs. The counts of metagenome reads that mapped to AOA A-atpA and V-atpA genes from environmental samples were calculated using HTSeq (version 0.9.1) and normalized by gene length [57].
To gain insight into the global diversity of marine AOA mpnS, atpA, and pstB genes, we compiled a collection of representative sequences of these genes from the Tara Oceans [58] and ALOHA [59] gene catalogs. For this purpose, we compiled mpnS, atpA, and pstB sequences from reference culture genomes, MAGs, and SAGs, aligned them using Clustal Omega [60] and constructed HMM profiles using HMMbuild [61]. These HMM profiles were used to screen the Tara Oceans and ALOHA gene catalogs using HMMsearch [61] using E-value of ≤ 1 × e -10 .
These candidate genes were then length filtered (only genes > 200 bp length passed this filter), and initial phylogenetic trees including gene sequences from reference genomes were constructed using the ETE toolkit [62] (workflow: standard_trimmed_fasttree). Outlier sequences were manually filtered using the phylogenetic trees as guidance and new trees were generated using the same workflow. The phylogenetic trees combining reference sequences and environmental sequences were then visualized using iTOL [63].
Data and materials availability: AOA genomes and metagenomes sequence data are available in the NCBI, JGI, BIGD, or DDBJ databases, and their accession numbers are listed in Table S6.
All other data products associated with this study are available from the corresponding authors upon request. Table S1. Summary information of AOA species genomes and MAGs as well as the genomes of other relevant microbial groups. Table S2. The 71 genes shared among 44 and 65 ammonia-oxidizing thaumarchaeotal genomes that were used to construct the concatenated phylogenomic trees in Figure 2 and Figure S1, respectively. Table S3. Average nucleotide identity (ANI) between two AOA genomes and fraction of genes shared between the two genomes. Table S4. Distribution of major metabolic pathway genes in the analyzed AOA genomes. Numerals indicate the numbers of homologs in each genome. Darker shades of red represent an increasing number of homologs, and white represents the lack of respective COGs in the genome. Table S5. Description of the non-synonymous and synonymous mutations observed during continuous culturing of Nitrosopumilus maritimus. Table S6. The list of AOA genomes and metagenomes used in this study. Sequence data are available through accession numbers in NCBI, JGI, BIGD, or DDBJ databases.