Genomic inference of the metabolism of cosmopolitan subsurface Archaea, Hadesarchaea

The subsurface biosphere is largely unexplored and contains a broad diversity of uncultured microbes1. Despite being one of the few prokaryotic lineages that is cosmopolitan in both the terrestrial and marine subsurface2–4, the physiological and ecological roles of SAGMEG (South-African Gold Mine Miscellaneous Euryarchaeal Group) Archaea are unknown. Here, we report the metabolic capabilities of this enigmatic group as inferred from genomic reconstructions. Four high-quality (63–90% complete) genomes were obtained from White Oak River estuary and Yellowstone National Park hot spring sediment metagenomes. Phylogenomic analyses place SAGMEG Archaea as a deeply rooting sister clade of the Thermococci, leading us to propose the name Hadesarchaea for this new Archaeal class. With an estimated genome size of around 1.5 Mbp, the genomes of Hadesarchaea are distinctly streamlined, yet metabolically versatile. They share several physiological mechanisms with strict anaerobic Euryarchaeota. Several metabolic characteristics make them successful in the subsurface, including genes involved in CO and H2 oxidation (or H2 production), with potential coupling to nitrite reduction to ammonia (DNRA). This first glimpse into the metabolic capabilities of these cosmopolitan Archaea suggests they are mediating key geochemical processes and are specialized for survival in the subsurface biosphere.

The subsurface biosphere is largely unexplored and contains a broad diversity of uncultured microbes 1 . Despite being one of the few prokaryotic lineages that is cosmopolitan in both the terrestrial and marine subsurface [2][3][4] , the physiological and ecological roles of SAGMEG (South-African Gold Mine Miscellaneous Euryarchaeal Group) Archaea are unknown. Here, we report the metabolic capabilities of this enigmatic group as inferred from genomic reconstructions. Four high-quality (63-90% complete) genomes were obtained from White Oak River estuary and Yellowstone National Park hot spring sediment metagenomes. Phylogenomic analyses place SAGMEG Archaea as a deeply rooting sister clade of the Thermococci, leading us to propose the name Hadesarchaea for this new Archaeal class. With an estimated genome size of around 1.5 Mbp, the genomes of Hadesarchaea are distinctly streamlined, yet metabolically versatile. They share several physiological mechanisms with strict anaerobic Euryarchaeota. Several metabolic characteristics make them successful in the subsurface, including genes involved in CO and H 2 oxidation (or H 2 production), with potential coupling to nitrite reduction to ammonia (DNRA). This first glimpse into the metabolic capabilities of these cosmopolitan Archaea suggests they are mediating key geochemical processes and are specialized for survival in the subsurface biosphere.
South-African Gold Mine Miscellaneous Euryarchaeal Group (SAGMEG) Archaea thrive in a broad range of anoxic subsurface environments and temperature regimes (4 to 80°C) 5 . They were first identified in the terrestrial subsurface and have since been shown to be broadly distributed in marine sediments 2,3 . To begin to resolve their ecological roles and sample a broad diversity within the group, we reconstructed genomic bins from the White Oak River (WOR, North Carolina) estuary and from Yellowstone National Park (YNP) hot spring sediments ( Supplementary  Fig. 1). Two genomic bins were reconstructed (using emergent self-organizing map (ESOM) mapping of tetranucleotide signatures, see Supplementary Information for additional details) from the WOR estuary sediment (DG-33 and DG-33-1, ∼63% complete) from the anoxic methane-rich layer (53-54 cmbsf (centimetres below surface)) 6 . Two additional genomic bins (YNP_N21 and YNP_45, 80% and 90% complete, respectively) were obtained from a microaerophilic (0.45 mg per litre O 2 ) YNP hot spring (68.8°C, pH 8.6). This spring is relatively high in sulphate (25 ppm) and low in sulphide (0.27 ppm) and nitrogen species (nitrate and nitrite were below detection, ammonium was 1.25 × 10 −4 ppm). All four genomic bins were manually curated (see Methods) to remove potential contaminant sequences (Table 1). Based on the size and completeness of the bins we estimated the average size of all these genomes to be around 1.5 Mbp. The average gene size is relatively small (716 (DG-33-1), 688 (DG-33), 821 (bin YNP_45) and 820 (YNP_N21) bp) compared to the average prokaryote (∼920 bp 7 ). These observations indicate that SAGMEG genomes have been subjected to genome streamlining, possibly as a result of nutrient limitation 8 .
Phylogenetic identification of the genomic bins based on 16S rRNA genes (excluding DG-33-1, for which no such gene was recovered) revealed that they represent distinct lineages within the SAGMEG clade (Fig. 1). Both of the YNP bins are closely related to clones that were previously recovered from Obsidian Pool at YNP 5 , while WOR bin DG-33 is similar to rRNA sequences from subseafloor sediments (Fig. 1). rRNA gene phylogeny places SAGMEG Archaea as a basal lineage within the Euryarchaeota. We were not able to resolve the branching position of SAGMEG within the Euryarchaeota solely on the basis of rRNA phylogenies. The placement of the group is also inconsistent in the literature 4,9 . Therefore, to investigate the phylogenetic position of SAGMEG within the Archaea in more detail, we constructed phylogenetic trees based on concatenated ribosomal proteins ( Supplementary Fig. 2), and more comprehensively using 48 phylogenetically conserved markers ( Fig. 2 and Supplementary Fig. 3). These analyses The number of individual gene markers, completeness, contamination and strain heterogeneity were estimated using CheckM 38 . *This potential contamination has been shown to be Archaeal and thus may be overestimated; see Methods ('Genomic assembly, binning, and annotation' section) and Supplementary Fig. 10 for details.
revealed that SAGMEG Archaea form a distinct clade within the Euryarchaeota that represents a sister group to the class Thermococci.
To reflect their affiliation with the subsurface biosphere, we propose that they be given a new class name, 'Hadesarchaea'.
The first hints into the metabolic capabilities of the Hadesarchaea came from analyses of the stable isotopic composition of Archaeal cells (Bathyarchaeota, Thermoplasmatales, SAGMEG and Lokiarchaeota) in deep-sea sediments, which indicate consistently that all members

Euryarchaeota
Hadesarchaea (SAGMEG) Figure 1 | Phylogenetic analyses of 16S rRNA genes from three of the genomic bins in relation to those from a variety of other environments. Bin DG-33-1 lacked a gene long enough to accurately resolve its phylogeny. Taxa labelled in blue are from the terrestrial subsurface, orange are from the oceanic subsurface, green are from deep-sea sediments, purple are from shallow aquatic sediments, and red are from terrestrial hot springs. The phylogeny was generated based on 1,682 unambiguous characters using the RAxML (maximum likelihood) method with gamma rate distribution model, in the ARB software package 47 . Bootstrap values were generated using UFBoot 48 with 10,000 replicates; those >95 are designated with filled circles and those >70 are shown as open circles. The numbers of sequences present in the collapsed groups are labelled on the wedges.
of this Archaeal community incorporate organic carbon, suggesting a heterotrophic lifestyle 3 . Comparative genomic analyses of the obtained genome bins reveal that Hadesarchaea share several carbon pathways with other Euryarchaeota members. We could not identify genes encoding methyl-CoM reductase, indicating that Hadesarchaea are probably not involved in methanogenesis in a conventional manner. However, they do have a variety of genes for carbon metabolism present in methanogens and other Euryarchaeal lineages. All four Hadesarchaeal genome bins lack most of the genes for the forward and reverse tricarboxylic acid (TCA) cycle (or the glycoxylate bypass) for the production of acetyl-CoA. Instead, they contain genes for the C 1 pathway (the carbon monoxide dehydrogenase pathway), which is typically used by methanogens and numerous other anaerobes for CO 2 fixation or production 10 . The YNP genomic bins (YNP_45 and YNP_N21) lack several genes for the CO dehydrogenase pathway, including those encoding formylmethanofuran-tetrahydromethanopterin N-formyltransferase ( ftr), methenyltetrahydromethanopterin cyclohydrolase (mch) and methylenetetrahydromethanopterin dehydrogenase (mtd), while the WOR bin DG-33 encodes all three (Fig. 3). However, bin YNP_N21 has a mch gene, suggesting that the absence of these genes may be a result of the incomplete draft assemblies that were obtained. Alternatively, both YNP genomic bins have D-3-phosphoglycerate dehydrogenase, phosphoserine phosphatase and glycine hydroxymethyltransferase, which may be used to generate 5,10-methylene-THMPT from glycerate 3-phosphate (as illustrated for bin YNP_45 in Fig. 3). This is referred to as the glycine pathway, and is thought to be an ancestral form of carbon fixation 11 . However, it is possible that these C 1 -pathway genes are used in reverse for CO 2 production 12 .
The most complete genome bins YNP_45 and YNP_N21 lack the key glycolytic hexokinase, glucokinase ( pfkC) and pyruvate kinase ( pyk) genes, resulting in an incomplete glycolytic pathway.  ( porA) and fructose-1,6-bisphosphatases ( fbp), indicating that these Archaea contain a complete pathway for gluconeogenesis. With the exception of bin DG-33, all of the genomic bins contain a gene encoding a ribulose biphosphate carboxylase (RubisCO) homologue (Fig. 3). To confirm that the genes are related to known RubisCOs, we generated a phylogenetic tree of these proteins ( Supplementary Fig. 4). This analysis revealed that the Hadesarchaeal RubisCOs form a distinct clade within Archaeal RubisCO type III proteins, which were recently shown to be involved in AMP metabolism in other Euryarchaea 13 . However, the genomes of previously described Archaea lacked Calvin-Benson-Bassham (CBB) cycles, which is the pathway for carbon fixation using RubisCO 13 . Interestingly, YNP_45 and YNP_N21 have near-complete CBB pathways, only lacking two CBB genes: phosphoribulokinase ( prk) and fructose-1,6-bisphosphatase II ( glpX). The ability to fix inorganic carbon opens up a new source of carbon in the deep subsurface, where dissolved organic carbon has been shown to be extremely low 14,15 but inorganic carbon is readily available 16 .
Acetate and hydrogen constitute an important link between respiration and fermentative processes in the subsurface 17 . Hydrogen is believed to be a primary substrate for microbial growth in the subsurface and in hot springs 5,18,19 . Hadesarchaea constitute a significant proportion of microbial communities in hydrogen-rich hot springs at YNP 5 , and all genome bins contain genes encoding acetyl-CoA synthetase for the conversion of acetate into acetyl-CoA. In addition, these bins also contain genes encoding several hydrogenases, including Ni-dependent hydrogenase, Ni,Fe hydrogenase and coenzyme F 420 hydrogenase. Ni,Fe hydrogenases are capable of both H 2 consumption and production, and all the Hadesarchaea genomic bins contain the genes encoding the small and large subunits, G-components, as well as maturation factors. It is possible that the Ni,Fe hydrogenases, F 420 -hydrogen dehydrogenase and cytochrome b5 genes (also present in all four genomic bins) are used to derive energy by electron transfer from H 2 , as demonstrated in other Euryarchaeota 20 . We did not identify any Fe-hydrogenase encoding genes, which are generally assumed to  Figure 3 | Diagram of metabolic capabilities of Hadesarchaea. Black arrows represent genes that were identified in bin YNP_45 and YNP_N21. Grey arrows are those that were not found in the YNP bins, but are present in WOR bin DG-33-1. DNRA, Denitrification and dissimilatory nitrite reduction to ammonia; FDH, format dehydrogenase; Hdr, membrane-type heterodisulphide hydrogenases; F 420 , coenzyme F 420 ; V, complex V ATPase; COX, carbon-monoxide dehydrogenase; Ni,FeIII, Ni,Fe-dependent hydrogenases; Nit, nitrite reductase; Hyd, cytoplasmic hydrogenases; Nuo, NADH dehydrogenases; Sdh, succinate dehydrogenases; Znu, zinc transporter; Liv, a branched amino-acid transporter; Mal and Msm, monosaccharide transporters; PST, phosphate transporter. be associated with H 2 production and have only been identified in bacteria 20 , in the Hadesarchaeal genome bins.
Interestingly, some of these hydrogenases have been demonstrated to be involved in sulphur reduction (sulphidogenesis) 21 and sulphide oxidation 22 . The YNP genomic bins YNP_45 and YNP_N21 contain phylogenetically distinct genes that are related to sulphohydrogenase (Supplementary Fig. 5). They also encode the nuoB, nuoH and ndhI genes of the NADH:quinone oxidoreductase, which are essential for transferring electrons from NADH to S 0 (ref. 21). As these two Archaeal bins are associated with a hot sulphur spring environment, it seems likely that these genes have a role in sulphur cycling. Cultured species in the phylogenetically related sister group Thermococci are commonly shown to be reducing S 0 to sulphide 23 .
Carbon monoxide is a common gas in hydrothermal environments 24 and is produced by thermal decomposition of organic matter or as a side product of some anaerobes 25 . It is therefore likely to occur widely in sediments and the subsurface. It has been estimated that after sulphate reduction, CO oxidation is the most energetically favourable anaerobic reaction in the deep terrestrial subsurface 15 . In addition to carbon monoxide (cox) genes, the YNP_45 Hadesarchaea bin also has distinct genes for catalytic nickel-containing CO dehydrogenase (CooS) and the iron sulphur subunits (CooF), suggesting that they are able to oxidize CO and H 2 O to CO 2 and H 2 as an energy source. The other bins are missing cooS, but have genes with sequence similarity to cooF present in Thermococcus spp. Phylogenetic analyses of these proteins show they are affiliated with 'clade E' (Supplementary Fig. 6) 12 and related to other methanogenic Euryarchaeota. The CooF sequences of Thermococcus spp, which also belong to this clade, have the ability to couple anaerobic CO oxidation to hydrogen production using cooS genes 26 . If the Hadesarchaea are deriving energy from the oxidation of CO, then the C 1 pathway and hydrogenases are probably producing CO 2 and H 2 27 . CO oxidation coupled to anaerobic nitrate reduction is an energetically favourable reaction 28 that has previously been demonstrated in bacteria 29 . Potentially, Hadesarchaea may be coupling CO oxidation to the reduction of nitrite to ammonia. We identified a distinct nitrite reductase in bin YNP_N21 (Supplementary Fig. 7). Electron acceptors such as nitrite are generally thought to be depleted in the subsurface 16 , but it has been shown that the deep terrestrial subsurface contains considerable concentrations of nitrate that may be derived from radiolytic sources 30 . Thus, it seems feasible that these nitrite reductases can be used to derive energy.
The current understanding of the diversity, biology and ecology of the Archaea is very limited, when considering how few of the known phyla have been cultured or genomically explored, especially considering that many of these enigmatic lineages are predominant in nature. In the current study, we have obtained the first genomic sequences of a widespread group related to Euryarchaeota 4 , here named Hadesarchaea. Inference of their metabolic capabilities indicates that they are probably deriving energy from the oxidation of carbon monoxide, which may be coupled to H 2 O or nitrite reduction in the terrestrial subsurface. They contain a variety of central carbon metabolic (C 1 pathway) genes found in methanogens, which may be used for carbon fixation. The reconstruction of these novel genomes expands our knowledge of the evolutionary histories and metabolic potential of these widespread Archaea. Their metabolic features encoded in relatively streamlined genomes make them prominent members of the deep dark subsurface biosphere.

Methods
Sample collection and processing. The estuary genomic libraries were obtained as described previously 31 6 . Each core was sectioned into 2 cm intervals.
From each site, one core was subsampled for geochemical analyses, and the other core was subsampled for DNA extractions. The sulphate, sulphide and methane profiles of these cores indicated a distinct sulphur and methane-cycling zonation. The sulphate-reducing zone (SRZ) started in sediment layers where sulphate was abundant and sulphide just started to accumulate (8-12 cm). The methaneproducing zone (MPZ) was characterized by accumulating methane (a single sample from site 1 of this zone was taken at 52-54 cm (ref. 6). DNA was extracted from these sediments using the UltraClean Mega Soil DNA Isolation Kit (MoBio), using 6 g of sediment, and stored at −80°C until use. Hot spring sediment samples from YNP were collected from a hot spring in Lower Culex Basin (from a site named as 10Y13 at 44°34.23 N, 110°47.41 W) as described previously 32 . This uncharacterized hot spring has a temperature around 70°C and pH of 8.6, and the hot spring sediments appeared light brown in colour. Sediment samples were taken using 50 ml falcon tubes, preserved in sucrose lysis buffer and kept at −20°C until further use. DNA used for the metagenomic library was extracted from ∼500 mg of sediment using cetyltrimethyl ammonium bromide (CTAB)/phenol/chloroform method 33 .
Single-cell genomics. Cell fractions were obtained from the Yellowstone hot spring sediment sample using Nycodenz density gradient centrifugation 34 , and sent to the Bigelow Laboratory for Ocean Sciences for flow sorting and amplification to generate single-cell amplified genomes (SAGs) as described previously 32 . A 384-well microtitre plate was then screened for Archaea using a LightCycler 480 qPCR instrument (Roche) with SYBR Green I Master kit and utilizing an Archaea-specific primer pair, A519F (5′-CAGCMGCCGCGGTAA-3′) and A1041R (5′-GGCCATGCACCWCCTCTC-3′). A SAG positive for the Archaeal 16S rRNA gene sequence was selected (identified by sequencing its 16S rDNA using the Sanger sequencing method), a sequencing library was prepared using NexteraXT library preparation kit (Illumina), and the library was paired-end sequenced (2 × 300 bp) on a MiSeq instrument (Illumina). 16S rRNA amplification and Sanger sequencing of the multiple displacement amplication (MDA)-amplified DNA from the cell yielded one phylotype belonging to SAGMEG. After sequencing, no contaminants were detected based on BLASTn comparison of contigs with the NCBI nt database. The number of high-quality sequences generated from the MiSeq run was 3,260,839, representing a total of 0.5 Gbp of sequence data, which were then assembled with SPAdes version 3.5.0 35 using the following parameters: -sc -careful -k 21,33,55,75,101,125.
Genomic assembly, binning and annotation. Illumina sequencing adapters from the read pairs (2 × 150 bp) were removed from both YNP and WOR libraries using Scythe (https://github.com/vsbuffalo/scythe), and low-quality reads (Q < 30) were trimmed with Sickle (https://github.com/najoshi/sickle). Reads with three or more Ns or with average quality score of less than Q20 and a length <50 bps were removed. Trimmed, screened, paired-end Illumina reads from WOR were assembled using IDBA-UD 36 with the following parameters (-pre_correction -mink 55 -maxk 95 -step 10 -seed_kmer 55). To maximize assembly, reads from different sites were co-assembled. The WOR assembly was generated from high-quality reads (378,027,948, average read length 124 bp and average insert 284 bp) of site 1 (52-54 cm) from WOR as described previously 31 .
The YNP hot spring assembly consisted of 666,428,020 high-quality paired-end Illumina reads representing a total of 86.4 Gbp of sequence data with an estimated insert size of 337 bp. Assembly was performed using the IDBA-UD assembler 36 with the following parameter: -maxk 124. Contigs of genes of particular interest were checked for chimaeras by looking for dips in coverage within read mappings. Initial binning of the assembled fragments was performed using tetra-nucleotide frequency signatures using 5 kbp fragments of the contigs 37 . The ESOMs were manually delineated and curated based on clusters within the map (as shown in Supplementary Figs 8 and 9). Coverage was enhanced by recruiting reads (from each individual library/sample) to scaffolds by BLASTN (bitscore >75), which was then normalized to the number of reads from each library.
The accuracy of the binning was then assessed by checking the genomic bins to which each of the 5 kbp subportions of the contigs were assigned. If the contig was >15 kbp then the contig was assigned to the bin that contained the majority (>50%) of its 5 kbp subportions. The completeness, contamination and strain heterogeneity of the genomes within bins were then estimated using CheckM 38 . Some of the fragments were identified as contaminants based on their phylogenetic placement, and were manually removed. Closely related variants of a lineage were retained in a bin. Binning was also manually curated based on GC content, top blast hits and mate-pairings.
Examination of fragments identified as contaminants in the YNP_N21 bin revealed that CheckM 38 might be inflating contamination levels. CheckM identifies multiple copies of genes and then calls the fragment they are on as contaminants. We believe in the case of bin YNP_N21 that gene duplications are indeed apparent in the genome. For example, ornithine carbamoyltransferase was identified to be present on four different large (12-71 kb) contigs that are all clearly Archaeal and have tetranucleotide signature binning, blast taxa matching and GC content consistent with other Hadesarchaea. This suggests that these are multiple copies of that gene in the genome, and not contaminants. CheckM also identified seven copies in the bin as a gene on five different contigs; four of these hypothetical are present on two contigs (two on each of the contigs). These sequences are considerably different (only 26-35% similarity at the protein level) and have unique hits to genes in GenBank, including hypotheticals, diguanylate cyclases and dihydroorotate dehydrogenases. The estimate that this bin has 10% contamination is thus potentially overestimated. Removal of four relatively low-coverage contigs (NODE_1 0_length_48481_cov_4.44764_ID_19, NODE_22_length_22948_cov_6.3 2857_ID_43, NODE_32_length_12588_cov_4.53992_ID_63 and NODE_3 5_length_10923_cov_3.69392_ID_69) resulted in the reduction of the double copy markers from six to one. However, examination of these contigs revealed they are definitely Archaeal and related to other Hadesarchaea bins present in this study. We therefore decided to leave them in the bin and have flagged them as potential contaminants in the public databases.
Co-assembly of the Yellowstone SAGMEG (YNP_N21) SAG and metagenomic bin was performed by first extracting metagenomic reads mapped against the YNP_N21 bin contigs using bwa 39 and samtools 40 . This resulted in an improved assembly compared with that obtained just from the SAG reads (Supplementary Table 1 and Supplementary Fig. 10). The read coverage in contigs was calculated using bedtools 41 . Extracted reads were then co-assembled with SAG reads using SPAdes (version 3.5.0) 35 with the same parameters as the single-cell genome assembly. Resulting contigs larger than 5 kbp were manually checked for contamination and removed.
Genes were called using Prodigal 42 . Initially gene functions were assigned by comparison to the KEGG database 43 . The called functions were finalized by comparisons with functional databases, including PFAM 44 and arCOG 45 . Proteins with important ecological or geochemical relevance were selected for protein alignments using MUSCLE 46 and phylogenetic comparisons with top BLAST hits in NCBI.
A total of 48 arCOGs (Archaeal cluster of orthologous genes 45 ) previously determined to be universally present in single copies 51 were identified and extracted from the four SAGMEG genome bins using PSI-BLAST 51 . The 48 reference arCOGs and identified orthologues from the genome bins were aligned using MAFFT (mafft-linsi option) 52 . Alignments were trimmed for poorly aligned regions and gaps using the BMGE tool 50 (with the following settings: -m BLOSUM30 -g 0.5) and concatenated before phylogenomic inferences were perfomed using both maximum likelihood and Bayesian methods. Maximum likelihood analysis was performed using IQ-Tree 53 with the C60+LG mixture model for 1,000 ultra-rapid bootstraps and Bayesian inference was carried out using Phylobayes 54 with the CAT+GTR+Г mixture model of amino acid evolution for about 4,000 generations until convergence (maxdiff of 0.15, burnin of 2,500 generations and subsampling every tenth tree) for two of the independent chains out of the four.
Accession numbers. The genomes supporting the results of this article are available in NCBI Genbank under the BioProjectID PRJNA298487 (LQMP00000000 and LQMQ00000000 for the YNP genomic bins), and BioProjectID PRJNA270657 (LQHI00000000 and LQHJ00000000 for the WOR genomic bins).