Introduction

Marine Thaumarchaeota are abundant, nitrifying chemolithotrophs that carry out ammonia oxidation and fix inorganic carbon [1,2,3], and they therefore contribute significantly to important nitrogen and carbon cycles in the oceans. They are often found in high abundances just below the deep chlorophyll maximum in the upper ocean [4, 5], and in the deeper mesopelagic ocean, they can comprise > 25% of the total prokaryotic cells [6, 7]. Understanding the factors that control their growth and abundance are critical to our view of N and C cycling in the oceans [5, 8, 9], but there is limited knowledge about top–down controls of these important microbes. In particular, viruses have yet to be isolated in culture that infect marine Thaumarchaeota, or any mesophilic oceanic archaea for that matter. The only culture-based observation of a marine archaeal virus comes from virus-like particles observed in the culture of a hyperthermophilic, deep-sea hydrothermal vent archaeon, Pyrococcus abyssi [10, 11]. A wide diversity of archaeal viruses, however, have been isolated from other habitats, including, for example, haloviruses from hypersaline evaporated salterns, but the isolation and knowledge of viruses infecting bacteria (bacteriophage) still far outweighs that of archaeal viruses [12,13,14].

While cultured-based approaches for discovering new archaeal viruses have had limited success, new culture-independent, high-throughput sequencing approaches provide valuable means for discovering new viruses in general [15]. Several recent metagenomic and single-cell genome studies specifically have provided evidence for the existence of viruses that infect marine Thaumarchaeota. These include recovery of marine prokaryotic fraction fosmids with similarity to a Pro-Nvie1, a probable provirus found in the genome of Nitrososphaera viennensis EN76, a terrestrial thaumarchaeon [16, 17]; virus sequences recovered from a thaumarchaeal single-cell amplified genome [18]; probable virus genes in the genome of Candidatus Nitrosomarinus catalina SPOT01, a cultured, marine thaumarchaeon [19]; and an 11.6 kb contig, GOV_bin_4552_contig-100_2, assembled from a viral fraction metagenome that contains viral capsid genes and a thaumarchaeal amoC gene [21]. AmoC is a subunit of the ammonia monooxygenase responsible for ammonia oxidation from which Thaumarchaeota derive energy [22].

Viruses often acquire metabolic genes from their hosts through horizontal gene transfer, and these so-called auxiliary metabolic genes (AMGs) are thought to bolster the metabolism of the infected host cells [23,24,25]. Identification of AMGs in viral genomes provides a convincing piece of evidence to connect viral sequences to their probable host(s). The recent discovery of a thaumarchaeal amoC gene on the viral contig GOV_bin_4552_contig-100_2 therefore provides strong evidence that this contig represents a thaumarchaeal virus. Recently developed k-mer-based tools, such as VirHostMatcher, can also help predict the probable host of metagenomic viral sequences by matching them to host genome sequences with which they have the highest similarity in nucleotide word usage patterns [20, 26,27,28]. These tools take advantage of the phenomenon that many viruses exhibit similar nucleotide usage patterns as their host probably due to strong selective pressures to use similar amino-acid codons as their host.

In order to identify additional new thaumarchaeal virus sequences, we have applied two viral contig identification tools VirSorter [27, 28] and VirFinder [41] to metagenomes from the Eastern Tropical North Pacific (ETNP) and other publically available metagenomes, to identify other viral contigs that encode thaumarchaeal amoC genes. In this way, we have identified 32 new probable thaumarchaeal virus contigs, representing 15 putative viral species that are distributed globally; are found in a variety of marine habitats; and appear to occupy distinct marine niches. We have also used the host prediction tool VirHostMatcher [20] to confirm Thaumarchaeota as the probable host of these viruses as well corroborate potential specific interactions between corresponding depth-partitioned Thaumarchaeota host and viral populations. Finally we provide phylogenetic evidence that these viruses are probably tailed and that some of them share a common evolutionary ancestor with marine Euryarchaeota viruses.

Methods

Sample collection and Metagenomic sequencing and assembly

Water samples were collected in April 2012 during cruise TN278 aboard the R/V Thompson using 10 L Niskin bottles on a 24 bottle sampling rosette. A Seabird 911 Conductivity Temperature Density meter and a Seabird SBE 43 Dissolved Oxygen Sensor were attached to the rosette.

DNA samples were obtained from 0.2 μm SUPOR filters from station 136 (–106.543°W 17.043°N) at 10 depths between 60 and 300 m, which included the oxycline and anoxic zones. Metagenomes from these samples have been previously published, and the sampling and processing methods are described therein [29]. Metagenomic reads and assembled contigs for individual samples can be found at GenBank BioProject PRJNA350692.

In this study, metagenomes from the 70 and 90 m samples were co-assembled with IDBA_UA [66] using default settings, and these contigs are also available under Genbank BioProject PRJNA350692 (contigs are named ETNP_CA_X). Protein encoding genes on these contigs were predicted and annotated using Prodigal using default settings [30].

Analysis of viral contigs and delineation of viral populations

Sequence similarity between contigs or genes on contigs were performed with blastn or blastp using default settings. Unless noted, only significant results were considered (E-value < 1E–5, Bit score ≥ 50). Structural prediction and similarity analyses were done using HHPred via their online interface (https://toolkit.tuebingen.mpg.de/#/hhpred) using standard settings [31]. Only significant results (E-value < 1E–5) were considered. Fifteen distinct viral populations were identified using average nucleotide identity (ANI) values determined from blastn results for predicted coding regions of the contigs and applying a 95% cutoff. Gene pair identities were only included in ANI averages if the alignment was over a minimum of 50% of the query or subject gene and the percent identity was ≥70%.

Fragment recruitment analyses

Reads from metagenomes were mapped to a collection of representative viral contigs, one from each viral species population (typically the longest contig in the species) (Table 1) using the bbsplit.sh script in the BBTools package (https://jgi.doe.gov/data-and-tools/bbtools/) with a minimum percent identity of 95% (minid = 0.95) to match reads to the most similar contig. The script bbsplit.sh assigns each read to the best matching contig. Tara Ocean metagenomes were downloaded from the European Nucleotide Archive using accession numbers provided in [32] and [21]. Read data used from Vik et al. [33], Hollibaugh et al. [34], Thrash et al. [35], and Oulas et al. [36] were downloaded from iVirus, the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA), or the Joint Genome Institute’s (JGI) Integrated Microbial Genomes & Microbiomes (IMG/M) website using accessions or file names provided in those studies. amoC reads were assigned as cellular or viral by “placement” of reads on the reference amoC tree (Fig. 1) [29, 67]. This tree was constructed using the HKY + i + g model using parameters estimated with modeltest [37] and minimum evolution as the criterion. Capsid reads were quantified by placement on the Pro-Nvie1-like capsid gene tree. Reads were aligned to amoC reference sequences in nucleotide space using PaPaRa: Parsimony-based Phylogeny-Aware Read Alignment program[69] . Paired end reads were combined into the same alignment using a python script and placed as one on the tree using the EPA: Evolutionary Placement Algorithm portion of RAxML [68]. Each read has a number, or “branchlength”, which corresponds to the similarity between the read and the sequence to which it is placed. Reads placed with a read “branchlength” longer than 2.0 were removed as erroneous. Spot testing indicated that these reads belonged to different genes than the one examined. Only a small percentage of reads were thus removed (0.1%).

Table 1 Amino-acid and structure similarity search results for ETNP_CA_420 predicted proteins
Fig. 1
figure 1

Phylogeny of Thaumarchaeota amoC sequences from cellular Thaumarchaeota genomes and thaumarchaeal viral contigs. The tips of the tree are labeled with gene identifiers (viral contigs) or isolate or single-cell amplified genome names. Genes from contigs obtained from co-assembly of ETNP metagenomes from this study are depicted in red; genes from contigs from the Delaware and Chesapeake Bay are depicted in blue; and the locations from which all other contigs were assembled are listed after gene numbers. The tree was constructed using the HKY85 DNA substitution model with invariable sites and gamma distributed rates of evolution and heuristic search of tree space using minimum evolution as the criterion. Numbers at the nodes indicate results from bootstrap analysis (100 replicates). Bootstrap analysis supports that amoC sequences from viral contigs sequences (“Viral amoC AMG clade”) form a phylogenetically distinct clade from cellular Thaumarchaeota sequences (“Cellular Thaumarchaeota amoC”). Genes marked with stars indicate representative contigs from putative viral species delineated by average nucleotide identity (Fig. S3) that were used for measuring population abundances in various habitats (Fig. 3)

Host prediction analyses with VirHostMatcher

For prediction of the probable host phylum of ≥10 kb thaumarchaeal viruses, nucleotide similarity scores (d2*) were calculated using VirHostMatcher for each viral contig against a database of ~5700 possible marine prokaryotic hosts that includes marine host genomes identified in [20] and metagenomically assembled genomes from [38, 39] (listed in Supplemental File 1). For each phylum of hosts in the database, we computed the difference in the mean of scores to that phylum and the mean of scores to all other phyla and normalized this difference with the standard deviation of the scores of the “all other phyla” group. The predicted host was selected as the phylum with the strongest normalized difference in mean scores, i.e., the phylum with the largest negative deviation in similarity when compared with all other phyla. Only phyla with six or more genomes in the database were included (n = 26 out of 41 possible phyla), representing 5607 possible host genomes.

For the more specific prediction of whether the viral contigs represent viruses that likely infect Thaumarchaeota from the “Deep” or “Shallow” group hosts, VirHostMatcher was applied to a database of Thaumarchaeota genomes from isolates or SAGs from the “Deep” (n = 16) or “Shallow” (n = 42) groups as determined by phylogenomics in [19]. t-tests were applied to determine if there were significant differences in the VirHostMatcher score means of comparisons with the Deep and Shallow thaumarchaeal host genomes for each virus.

Results and discussion

In order to identify additional new thaumarchaeal virus sequences, we first analyzed prokaryotic cellular fraction ( >0.22 µm) metagenomes collected from above an oxygen minimum zone (OMZ) from the ETNP where Thaumarchaeota were prevalent. Thaumarchaeal-specific amoA qPCR assays previously showed that Thaumarchaeota cells were abundant (1.9  × 105 to 1.0 × 105 copies per mL) and localized at 70–100 m [40] (Fig. S1). They represented up to 12% of prokaryotes at 70–100 m based on the metagenomic analysis of the single copy RNA polymerase gene rpoB (Fig. S1) [29]. In congruence, ammonia oxidation rates were highest (14–31 nM/d) at 70–100 m [40]. To identify viral contigs among metagenomic assemblies containing mixtures of host and viral sequences, the viral detection programs VirSorter [27, 28] and VirFinder [41] were applied to pick out viral contigs. Proteins encoded on these viral contigs were then searched against thaumarchaeal genome proteins to identify viral contigs with potential thaumarchaeal AMGs. Among contigs from co-assembly of the ETNP metagenomes from 70 and 90 m, we identified a 26 kb viral contig, named ETNP_CA_420, which encodes a thaumarchaeal-like amoC gene (Figs. 1, 2), representing a probable AMG (see below). Although contig ETNP_CA_420 only had a VirSorter category III prediction result (“possible” viral contig), it had a high and statistically significant VirFinder prediction score (score: 0.91, p = 0.012, Table S1), highlighting the utility of using multiple virus prediction tools. The discovery of ETNP_CA_420 builds on the recent, similar identification of a 12.2 kb contig, GOV_bin_4552_contig-100_2, which was assembled from viral fraction metagenomes and encodes both a viral capsid gene and a thaumarchaeal amoC gene [21]. Note that in our subsequent analyses, we have instead used a longer 14.6 kb contig named TARA_034_DCM_0.22−1.6_scaffold218395_1 because GOV_bin_4552_contig-100_2 is a 99.98% identical subfragment of TARA_034_DCM_0.22−1.6_scaffold218395_1. The latter contig was recovered from individual sample assembly of a cellular fraction (0.22–1.6 µm) Tara Ocean sample from the Red Sea [42].

Fig. 2
figure 2

Contig maps depicting predicted proteins encoded on representative thaumarchaeal viral contigs. Arrows depict the location and direction of predicted proteins on contigs, and the number 1 indicates the end with first nucleotide position of the contig. Fill colors indicate different categories of genes, as indicated in the legend, according to top hits from searches against NCBI’s nr protein database or protein structure similarity analyses (Table 1). Asterisks denote which genes showed similarity to Ca. N. catalina SPOT01 putative viral gene NMSP_1228. The color of the trapezoids connecting genes indicate amino-acid identities between genes.

In concordance with the VirFinder prediction result, all of ETNP_CA_420’s 21 genes are encoded on the same strand (Fig. 2), a trait characteristic of viral genomes [27, 28]. More importantly, several of its predicted proteins exhibit sequence and/or structural similarity to known viral structural proteins of previously characterized viruses (Table 1). The proteins to which ETNP_CA_420’s predicted genes have similarity include those that make up the main capsid structure; a portal protein, which forms the opening through which DNA moves in and out of the capsid; and a baseplate protein, which occurs at the end of tailed viruses. We highlight genes ETNP_420_6 and ETNP_420_8 that exhibit similarity to the protease and capsid domains, respectively, of the combined protease/capsid protein of Pro-Nvie1, a probable provirus discovered in the genome of the soil thaumarchaeon Nitrososphaera viennensis EN76 [16] (Table 1). The protease may serve a role in capsid maturation [16]. Also of interest was gene ETNP_CA_420_21 that shows distant (≤28% protein identity) similarity to several cyanophage isolate proteins annotated as CobS, the porphyrin biosynthetic enzyme responsible for the last step in vitamin B12 synthesis (Table 1). These cyanophage proteins probably do not synthesize B12 because they are phylogenetically distant from any host protein [43]. HHPred results show that ETNP_CA_420_21 has structural similarity instead to a protein involved in carboxysomes, intracellular organelles used to concentrate CO2 around Ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO) [44]. In any case, gene ETNP_CA_420_21 has higher sequence similarity to cellular archaeal contigs assembled from a Yellowstone hot spring affiliated with the newly described Asgard ‘superphylum’ of archaea (Candidatus Odinarchaeota archaeon LCB_4) [45] and from an acid mine drainage system (Candidatus Microarchaeum acidiphilum ARMAN-2) [46] (Table 1). We suggest genes on these contigs could be viral genes integrated or horizontally transferred into the genomes of these distantly related archaea.

The amoC genes and capsid proteins of contigs ETNP_CA_420 and TARA_034_DCM_0.22−1.6_scaffold218395_1 were then used to identify additional thaumarchaeal virus sequences by similarity searches to contigs from publically available metagenomic sequencing projects at the IMG/M database. The resulting contigs were recovered from several diverse marine habitats including other ETNP OMZ samples, low-temperature hydrothermal sediments and plume water, the Gulf of Mexico, and the Delaware and Chesapeake Bay estuaries (Figs. 1, 2 and S2, Table 2). We note that several of these contigs had no or low, category III virus prediction results via VirSorter but had significant VirFinder prediction scores, again highlighting the efficacy of using both tools in tandem to detect novel viral contigs (Table 2). The viral contigs exhibit several distinct genomic architectures (Fig. 2) that broadly belong to two larger groups defined by what type of capsid protein they possess—capsids with distant similarity to caudoviral Haloarchaea virus capsids or to the Pro-Nvie1 capsid (Fig. S2). Members of both capsid types exhibit structural similarity to capsids with HK97-like folds characteristic of the tailed virus order Caudovirales [47] (HHPred, E-value < 1E–5), suggesting that these thaumarchaeal viruses are tailed. As noted above, contig ETNP_CA_420 encodes a putative baseplate protein (Table 1), also supporting that it likely is a tailed virus. Furthermore, Pro-Nvie1 appears to be a tailed virus [16], and the halovirus-like capsids of the other thaumarchaeal viral contigs are most closely related to recently described, tailed Euryarchaeota Marine Group II viruses (Magroviruses) (Fig. S2) [48, 49]. More broadly, the phylogeny of Magrovirus, Halovirus, and Thaumarchaeota virus capsids point to a shared common ancestor among a larger group of presumably tailed viruses that infect the two archaeal phyla Euryarchaeota and Thaumarchaeota (Fig. S2). We also note that contig Ga0070747_1001249 from the Chesapeake Bay does not appear to encode a capsid protein. Based on this and its distinct genomic architecture from the other contigs, it could represent an interesting, distinct group of thaumarchaeal viruses.

Table 2 Information about Thaumarchaeota viral and cellular contigs that encode amoC genes (Fig. 1) or exhibit synteny to amoC-encoding viral contigs (Fig. 2)

Similarity between viral sequences found in this study and previously identified, putative thaumarchaeal viral sequences help corroborate that the latter are indeed thaumarchaeal viral sequences. A viral fosmid, Oxic1_7, identified in a fjord and that shares similarity to Pro-Nvie1 sequences [17], also shares similarity to proteins from thaumarchaeal viral contigs found in this study, including the putative portal protein ETNP_CA_420_3 (Table S1). Two proteins from caudoviral contigs recovered from a thaumarchaeal SAG [18] share 30–40% protein identity to thaumarchaeal viral contig sequences (Table S1). This provides additional support that thaumarchaeal viruses identified in this study are probably tailed. Several thaumarchaeal viral contigs carry a gene with similarity to gene NMSP_1228 previously identified as a putative viral gene in the genome of the cultured thaumarchaeon Ca. N. catalina SPOT01 [19] (Table S1). Although contig JGI25132J35274_1000069 assembled from a virome at 30 m in the ETNP lacks an amoC gene, it also likely represents a thaumarchaeal virus based on synteny to other amoC-encoding contigs (Fig. 2) and capsid phylogeny (Fig. S2). Note that ETNP_6_30_revised_scaffold28175_1 from [33], which was previously identified as a putative archaeal contig, is an identical subfragment of JGI25132J35274_1000069 that was obtained using the same raw data with a different assembly pipeline. Finally, host prediction using the k-mer similarity tool VirHostMatcher, independently predicted Thaumarchaeota as the probable host phylum for all but two of these viral contigs that are >10 kb in length (n = 17), including four contigs that lack amoC genes (Table S2).

Using a nucleotide identity cutoff of 95% to delineate putative viral species [50], the thaumarchaeal virus contigs represent at least 15 distinct viral species (Fig. S3). These populations exhibit different abundance patterns across various marine habitats and samples, suggesting they are ecologically distinct (Fig. 3). Abundances were determined by metagenomic mapping of reads to one representative contig from each of the species (Fig. S3, Table S1). In two ETNP virome depth profiles from another study [33], our archaeal virus populations exhibit evidence of depth partitioning (Fig. 3a). There was also a marked shift in dominance between the coastal and open-ocean ETNP sites from Vik et al. (Fig. 3a) and the open-ocean ETNP site from which ETNP_CA_420 was identified (Fig. 3c). Marine Thaumarchaeota belong to two major phylogenetic groups (“Deep” [or Water Column B] and “Shallow” [or Water Column A]) according to whether they predominantly occur in the upper water column ( <200 m) or deeper mesopelagic depths ( >200 m) [5, 19, 51,52,53]. Contig Ga0066372_10000192 notably dominated at 1000 m in the Vik et al. ETNP  Station 2 sample; was absent in surface waters; and thus may specifically infect deep Thaumarchaeota populations. Contig Ga0066372_10000192 was also only observed in deep, mesopelagic Tara Ocean metagenomes (Fig. 3b, Fig. S4), and only this contig was significantly correlated to sample depth, as tested with Spearman correlations (p = 2E–8, ρ = 0.84). Contig Ga0066372_10000192 exhibited significantly higher nucleotide similarity to Deep Thaumarchaeota genomes than Shallow genomes using VirHostMatcher (t-test, p < 0.05, Table S2), further supporting that this thaumarchaeal viral population infects Deep group Thaumarchaeota. Contig Ga0070751_1001009 likewise has higher k-mer similarity to Deep group genomes and likely infects these hosts, but corresponding depth partitioning of this contig to mesopelagic samples could not be corroborated by metagenomic data as it was not significantly detected in any Tara Ocean samples. Most of the other remaining thaumarchaeal contigs that generally were more abundant in epipelagic waters (Fig. 3, S4) had significantly higher nucleotide similarity to Shallow Thaumarchaeota genomes (t-test, p < 0.05, Table S2).

Fig. 3
figure 3

Abundance of representative Thaumarchaeota viral contigs and gene sequences in various marine samples as determined by metagenomic read mapping to a Coastal (station [stn] 2) and open-ocean (stn 6) ETNP samples from Vik et al. [33]. b Globally distributed Tara Ocean samples and c-e ETNP depth profiles from which contig ETNP_CA_420 was identified. For ac, metagenomic reads were mapped to representative contigs from the 15 viral species. Normalized read recruitment is depicted as the number of reads mapped per kilobase of the contig per billions of reads in the sample. Contigs that did not meet the mapping criteria in any samples are not depicted in the graphs. For a and c, recruitment values are only plotted if ≥ 30 reads were mapped to the respective contig. For Tara Ocean samples in b recruitment levels are only plotted if contig coverage was ≥1 or the percentage of the contig covered was ≥75%. The following is listed for each Tara sample on the vertical axis: Tara Ocean site number, oceanic region, sampling depth, and size fraction of the sample in µm. Mesopelagic samples are in bold and blue to highlight that contig Ga0066372_10000192 was only detected in mesopelagic samples. For d, read recruitment results are shown for reads that mapped to sequences from the viral amoC AMG (“Viral amoC”) or cellular Thaumarchaeota (“Cellular amoC”) clades as defined in Fig. 1. For e, recruitment results are shown for the amoC and capsid genes on contig ETNP_CA_420

We also used Tara Ocean metagenomes to explore other differences in the distribution of the thaumarchaeal virus populations and thus possible differences in their ecologies. Representative thaumarchaeal virus contigs were detected in 24 samples collected from several ocean basins; at surface, deep chlorophyll maximum, and mesopelagic depths; and in both cellular and viral fraction samples (Fig. 3b). Although the limited number of samples in which these viruses were detected made it difficult to make conclusive inferences about their potential ecological differences, recruitment data allowed for some preliminary insight into their probable niches, in addition to the depth partitioning described above. The underlying assumption here is that these virus populations infect particular host populations such that viral niches reflect the particular niches of their specific hosts. We noticed that contigs Ga00707474_1001249 and JGI25132135274_1000069 appeared to preferentially be found at low latitude sites. Only these two contigs exhibited significant Spearman correlations to latitude (p < 0.01, ρ = 0.54 and 0.68, respectively), suggesting that they may specifically infect Thaumarchaeota found in warmer, low latitude waters. The Thaumarchaeota Shallow group does contain several clades, which based on culture studies of a limited number of representative strains, hint that they may occupy distinct niches determined by temperature [19, 54, 55]. VirHostMatcher could potentially be used to test the hypothesis that these two viral populations infect warm-adapted hosts, but the biogeographic ranges of Shallow group clades need to be better constrained and more representative genomes from these clades are needed. We also observed that at a single sampling site and depth (TARA_067, 5 m, Benguela Current), a different population dominated the viral ( < 0.22 µm) and cellular fractions (0.45–0.8 µm), possibly reflecting the detection of a transition from a recent infection by one viral population, observed as sequences in the viral fraction, to an active infection by another population, observed as sequences in the cellular fraction (Fig. 3b).

We observed that none of the contigs identified from the Delaware and Chesapeake Bay estuaries (Ga0070747_1001249, Ga0129342_1000209, Ga0070751_1001009, DelMOWin2010_c10015535, Ga0070745_1004049) or from hydrothermal plume water (GBIDBA_10003243, GBIDBA_10004208, GBIDBA_10005806, and GBIDBA_10128132) were detected in any of the pelagic Tara Ocean samples. This is perhaps not surprising if these viral populations specifically infect Thaumarchaeota hosts adapted to estuarine and vent plume habitats, because the Tara Ocean samples did not sample such habitats. On the other hand, contig Ga0098036_10316722 identified from pelagic waters of the South Pacific was detected in several Tara Ocean samples (Fig. 3b, S4) and is closely related to contigs recovered from low-temperature hydrothermal mat samples (Figs. 1, 2, Fig. S3), suggesting that these viruses perhaps infect hosts that occupy these two disparate habitats. This is rather unexpected given a model that viruses generally infect specific hosts and the expectation that pelagic and hydrothermal mat Thaumarchaeota are probably not closely related. The latter assumption may not be true as low-temperature hydrothermal mats at the Kolumbo Volcano site do support abundant Thaumarchaeota that are closely related (99% identity by 16S rRNA) to pelagic strain Nitrosopumilus maritimus SCM1 [56]; however, 16S rRNA can fail to resolve closely related Thaumarchaeota clades that probably are ecologically distinct [19]. There are examples of some marine cyanophage that have somewhat broad host ranges and infect multiple genera, Prochlorococcus and Synechococcus, [57] but both of these genera occupy broadly similar niches—both are pelagic and mesophilic. The curious observations of these Thaumarchaeota viruses require further investigation.

The predicted AmoC protein sequences from viral contigs have high amino-acid identity ( >90%) to marine Thaumarchaeota AmoC sequences. Although it is difficult to establish the functionality of proteins by sequence analysis alone, the high degree of sequence similarity to cellular Thaumarchaeota proteins suggests that these AmoC proteins are functional. There was no significant difference in protein length between viral and host AmoC sequences (t-test, p < 0.05), nor was there any clear difference in particular amino-acid motifs used by each group. Several non-marine (soil and freshwater) Thaumarchaeota genomes of the genus Candidatus Nitrososphaera possess several copies of amoC, but copies of this gene from non-marine genomes form a separate lineage from marine host and viral sequences (Figs. S5, S6).

amoC nucleotide sequences from viral contigs, however, are quite dissimilar to marine thaumarchaeal amoC sequences (<80% identity) (Fig. S6). Note that we also recovered host contigs whose amoC sequences cluster with other host sequences and which contain genes for the other amo subunits (Fig. S7). The amoC sequences from viral contigs form a distinct phylogenetic group that we argue represents viral amoC AMGs (Fig. 1, S6). This is consistent with cyanophage photosystem psbA sequences that also form phylogenetically distinct clades from host versions of the gene [58]. This phylogenetic distinction made it possible for us to discriminate viral and cellular amoC sequences in metagenomic data. In Tara Ocean metagenomes, the ratio of viral amoC AMG to cellular sequences was significantly higher in viral fraction samples (<0.22 µm) than cellular fraction samples (average: 0.7 vs. 0.1, p < 0.05), supporting that amoC AMG clade sequences are associated with viral-sized particles. Furthermore, we found a handful of metatranscriptomic reads from the Gulf of Mexico [35] and coastal Georgia, USA (Sapelo Island) [34] that matched viral amoC and capsid genes with ≥90% identity, demonstrating active RNA expression of these viral genes.

Analysis of cellular fraction (>0.22 µm) ETNP metagenomic reads showed that viral amoC sequences surprisingly comprised 54 and 29% of total amoC reads at 70 and 90 m, respectively (Fig. 3d). The high abundance of viral amoC genes supports two non-mutually exclusive scenarios: (1) these samples have captured an active infection of Thaumarchaeota, whereby replicating viruses within cells were recovered from >0.22 µm fraction metagenomes or (2) the ETNP_CA_420 contig represents a highly prevalent integrated provirus. In either case, it represents a highly successful virus in these samples. It has been similarly reported that viral versions of the psbA gene can comprise a large portion (up to 60%) of total psbA abundance in natural communities [59]. The fraction of viral to total amoC genes was more modest in the hydrothermal vent mat samples from which contigs KVWGV2_10011101 and KVRMV2_100116932 were assembled respectively (Santorini Caldera: 10% and Kolumbo Volcano: 20%). Viral amoC genes were detected in 20% of Tara Ocean viromes (at least 10 reads mapped, ≥95% nucleotide identity). Viral amoC genes, however, were only detected in four >0.22 µm, prokaryotic fraction Tara Oceans samples (3% of samples for which any amoC gene was detected), and viral amoC genes never exceeded 2.2% of total amoC DNA abundance (range: 0.12–2.2%). The comparable levels of recruitment for ETNP_CA_420 capsid and amoC genes measured for ETNP samples also implies that most (~75%) thaumarchaeal viruses carried an amoC gene at that location (Fig. 3e). Contigs JGI25132J35274_1000069 and Ga0066372_10000192 notably lack amoC genes. These contigs are otherwise syntenous with amoC-encoding viral contigs and have a few genes with similarity to Thaumarchaeota, supporting that they too represent thaumarchaeal viruses. These contigs highlight along with the recruitment data (Fig. 3e) that not all thaumarchaeal viruses appear to carry amoC genes.

The discovery of widespread amoC-encoding viral sequences and in some cases high viral amoC abundances in metagenomic samples has potential implications for our understanding of N cycling in marine systems, to which Thaumarchaeota are major contributors. AMGs have been previously described for several enzymes involved in the biochemical cycles for most of the major elements comprising life including C, P, and S [23, 60,61,62]. Notably missing from this list were prominent N-related AMGs, but the discovery the nitrogen regulators ntcA in cyanophage [63]; PII and ammonia transporters (amt) on viral contigs [21]; amoC AMGs (in [21] and here); and more recently nitrate reductase genes in deep-sea vent viromes [64], have expanded the known set of key biogeochemical pathways impacted by viral AMGs. Viral infection should generally limit the abundance and thus contribution of Thaumarchaeota to nitrification, but cells infected by amoC-carrying viruses presumably still can contribute to nitrification. Analogous psbA AMGs in marine cyanophage are thought to contribute to the photosynthetic functioning of infected, natural populations [24]. Further work is needed to assess what fraction of Thaumarchaeota are infected by viruses at any given time, what fraction of viruses encode amoC, and to what degree cells infected with amoC-carrying viruses have enhanced contribution to N cycling. Our initial metagenomic survey suggests that viral amoC genes are distributed globally (Table 2, Fig. 3, S4), and at certain key sites of nitrification, they may be very abundant and could have an important impact on N cycling. Directed amoC-specific assays (e.g., amplicon sequencing) may be better equipped to assess amoC contribution than metagenome approaches that often yield low coverage results for individual genes. These potential biogeochemical implications may also extend to marine sediments, another important location of nitrification, because amoC AMGs were recovered from sediment samples as well.

The fact that thaumarchaeal viruses carry the gene for the AmoC subunit rather the other two subunits (A and B) gives potential insight into the biochemistry of ammonia monooxygenase. It is surmised that cyanobacterial viruses carry the psbA AMG encoding the D1 protein of the photosystem complex because this is the subunit most susceptible to damage and has a high turnover rate [59, 65]. We therefore hypothesize that AmoC likewise has a high turnover rate, such that expression of AmoC provides these viruses a selective advantage in maintaining cellular energy during infection.

The discovery of these novel thaumarchaeal virus sequences adds to the small but growing dataset of Thaumarchaeota virus sequences identified by isolation-independent approaches [17, 18, 21]. Because of the paucity of cultured archaeal viruses and particularly for marine archaea (see Introduction), these contigs help enlarge the known diversity of archaeal viruses and elucidate broader evolutionary relationships of viruses infecting Euryarchaeota and Thaumarchaeota (Fig. S2), two prominent phyla of archaea in the oceans. These new sequences will also assist in the discovery of new archaeal viruses. Case in point is the discovery of potential viral genes in Asgard superphylum archaeal sequences. This study also highlights the usefulness of analyzing cellular fraction metagenomes, not just viromes, for discovery new viral sequences, especially with availability of robust tools like VirFinder and VirSorter to distinguish viral sequences from prokaryotic host sequences.