Unique mobile elements and scalable gene flow at the prokaryote–eukaryote boundary revealed by circularized Asgard archaea genomes

Wu, Fabai; Speth, Daan R.; Philosof, Alon; Crémière, Antoine; Narayanan, Aditi; Barco, Roman A.; Connon, Stephanie A.; Amend, Jan P.; Antoshechkin, Igor A.; Orphan, Victoria J.

doi:10.1038/s41564-021-01039-y

Download PDF

Article
Open access
Published: 13 January 2022

Unique mobile elements and scalable gene flow at the prokaryote–eukaryote boundary revealed by circularized Asgard archaea genomes

Nature Microbiology volume 7, pages 200–212 (2022)Cite this article

23k Accesses
33 Citations
126 Altmetric
Metrics details

Subjects

Abstract

Eukaryotic genomes are known to have garnered innovations from both archaeal and bacterial domains but the sequence of events that led to the complex gene repertoire of eukaryotes is largely unresolved. Here, through the enrichment of hydrothermal vent microorganisms, we recovered two circularized genomes of Heimdallarchaeum species that belong to an Asgard archaea clade phylogenetically closest to eukaryotes. These genomes reveal diverse mobile elements, including an integrative viral genome that bidirectionally replicates in a circular form and aloposons, transposons that encode the 5,000 amino acid-sized proteins Otus and Ephialtes. Heimdallaechaeal mobile elements have garnered various genes from bacteria and bacteriophages, likely playing a role in shuffling functions across domains. The number of archaea- and bacteria-related genes follow strikingly different scaling laws in Asgard archaea, exhibiting a genome size-dependent ratio and a functional division resembling the bacteria- and archaea-derived gene repertoire across eukaryotes. Bacterial gene import has thus likely been a continuous process unaltered by eukaryogenesis and scaled up through genome expansion. Our data further highlight the importance of viewing eukaryogenesis in a pan-Asgard context, which led to the proposal of a conceptual framework, that is, the Heimdall nucleation–decentralized innovation–hierarchical import model that accounts for the emergence of eukaryotic complexity.

Methanotrophic Methanoperedens archaea host diverse and interacting extrachromosomal elements

Article 25 June 2024

Gene gain and loss push prokaryotes beyond the homologous recombination barrier and accelerate genome sequence divergence

Article Open access 26 November 2019

Timing the origin of eukaryotic cellular complexity with ancient duplications

Article 26 October 2020

Main

To chronicle the emergence of evolutionary innovation is a long-standing pursuit in biology. Due to scant record of reliable microscale fossils, resolving evolutionary history at the cellular scale relies primarily on molecular comparisons across present-day life, provided that phylogenetic relatives can be well delineated. Culture-independent metagenomics has substantially expanded our access to the Earth’s diverse biomes¹, including lineages carrying genetic imprints of critical evolutionary events through deep time. The Heimdallarchaeota, previously referred to as the ancient archaea group (AAG)², are one such group and the closest known relative of eukaryotes as suggested by phylogenomics^3,4,5. Heimdallarchaeotes and their related lineages collectively called the Asgard archaea contain a sizeable repertoire of eukaryotic signature proteins (ESPs)^3,6,7. However, the genetic make-up of Heimdallarchaeotes has so far only been inferred from a few metagenome-assembled genomes (MAGs), which are fragmented and suffer from uncertainty in their completeness and accuracy^{3,7,8,9,10,11,12}. Mobile (genetic) elements, including transposons, viruses and plasmids, which are known to play dominant roles in evolution¹³, are frequently misassembled, omitted or misassigned during MAG assembly and binning¹⁴. These drawbacks propagate into uncertainties in the resolution of archaeal lineages related to eukaryotes and can obscure the drivers of evolutionary crosstalk and divergence between eukaryotes and their prokaryotic relatives.

Results

Circular Heimdallarchaeota genomes

Recovering contiguous genomes from environmental samples is notoriously challenging due to their enormous biodiversity and strain-level heterogeneity, while most known lineages have been hard to isolate due to their unresolved metabolism and/or poorly understood partner-dependent growth. We overcame these limitations by combining cultivation methods with molecular community profiling to progressively dissect environmental microbial enrichment cultures where a clonal expansion of our species of interest was accompanied by a reduction in diversity (Extended Data Fig. 1 and Methods). Using anaerobic cultivation methods, we enriched a member of the Heimdallarchaeota AAG clade from a barite-rich rock retrieved in 2017 from the Auka hydrothermal vent field (23° 57′ N, 108° 51′ W) located in the southern Pescadero Basin near the southern tip of the Gulf of California at a water depth of 3,674 m (ref. ¹⁵). While initially below detection, this rock-associated AAG phylotype emerged at 1–4% of the 16S ribosomal RNA gene relative abundance in 3 lactate-supplemented, anaerobic enrichment cultures incubated at 40 °C after 7 months (Extended Data Fig. 1, Supplementary Tables 1–3 and Supplementary Note 1). In an independent set of enrichments inoculated with sediments collected from the Auka site in 2018 (23° 53′ N, 108° 48′ W), alkane-supplemented anaerobic incubations at 37 °C additionally yielded a second AAG phylotype that increased in 16S rRNA gene relative abundance from 0.03 to 4–7% after 9 months (Supplementary Tables 4 and 5 and Supplementary Note 1).

De novo assembly^16,17,18 of Nanopore long-read and Illumina paired-end sequencing of genomic DNA recovered from these enrichments (Supplementary Table 6) resulted in complete circularized genomes of the two AAG species from the barite and sediment enrichment cultures, with genome sizes of 3.32 and 3.08 million base pairs (Mbp), respectively. The two circular AAG genomes showed 82% alignment fraction, 88% average nucleotide identity (ANI), 90% amino acid identity (AAI) and 97.9% 16S rRNA identity (Supplementary Table 7), which demarcate a clear species boundary¹⁹ within the same genus²⁰. Thus, we propose the species names Candidatus Heimdallarchaeum endolithica PR6 (endo- (Greek), within; lithos (Greek), rock) and Candidatus Heimdallarchaeum aukensis PM71 (Auka, the local vent field) denoting their environmental origins (Fig. 1a).

**Fig. 1: Complete genomes of *Ca. Heimdallarchaeum* spp. provide insights for eukaryogenesis.**

Taxonomy and metabolism

The taxonomy of Asgard archaea is yet to reach consensus. The initial Heimdallarchaeota³, despite remaining monophyletic in all phylogenomic analyses, was proposed to either split into four phyla (Heimdall-, Gerd-, Kari-, Hodarchaeota)⁷ or alternatively grouped under a single order named the Heimdallarchaeia²¹. In this study, we collectively refer to them as ‘the Heimdall group’. Phylogenomic analyses based on 76 concatenated ribosomal proteins show that the Heimdallarchaeum spp. constitute a deeper-branching clade related to the previously described MAG AB_125 (ref. ³), well placed under ‘Heimdall’ in all proposed classification strategies (Fig. 1b and Extended Data Fig. 2). Additionally, we also identified a fragmented MAG B53_G16²² (299 contigs, 1.67 Mbp, approximately 50% complete) from the Guaymas Basin, formerly assigned under the Pacearchaeota, which we now designate as a strain of Ca. H. endolithica, with an average ANI of 97.5% compared with our PR6 strain.

Ca. Heimdallarchaeum spp. are predicted to garner energy by anaerobically oxidizing organic substrates via processes involving a partial tricarboxylic acid (TCA) cycle and, given the absence of discernible terminal electron accepting pathways, dissipating electrons via H₂ production (Extended Data Fig. 3a). They each encode one membrane-bound hydrogenase (MBH) complex and two cytosolic sulfhydrogenase complexes (SHYI and SHYII) (Fig. 1c). Hydrogen has been hypothesized to act as a syntrophic intermediate bridging archaea and bacteria before the engulfment of mitochondrial ancestor by an (Asgard) archaeal ancestor of eukaryotes^4,23,24,25. Indeed, in the recent description of Ca. Prometheoarchaeum syntrophicum, MBH associated with unusual membrane extensions were hypothesized to facilitate cell–cell contact and hydrogen exchange with syntrophic partner bacteria²³. Following from this concept, we postulate that cytosolic hydrogen generation by SHY, as found in the Ca. Heimdallarchaeum spp., could impose a selective advantage for a hydrogen-dependent endosymbiotic strategy (Fig. 1c).

Eukaryotic signatures

One of the many challenges of resolving the relationship between archaea and eukaryotes is the curation of representative, high-quality genomes across lineages at their interface. To this end, we verified the complete marker gene coverage of the Ca. Heimdallarchaeum spp. as well as six other highly contiguous Asgard archaea genomes (Extended Data Fig. 4a, Methods and Supplementary Note 2). They include three previously described^3,23,26 and three assembled in this study from our enrichment cultures—a Lokiarchaeote that we have named Ca. Harpocratesius repetitus FW102, a Thorarchaeote FW25 and a Heimdall group Gerdarchaeote AC18 (Fig. 1d). Notably, the dual-contig assembly Ca. H. repetitus FW102, which relates to Ca. P. syntrophicum MK_D1 at the family level, contains two complete sets of 16S/23S rRNA genes, potentially relevant to their growth strategies in the environment²⁷.

These complete genomes confirmed that many of the previously described ESPs^3,6 are distributed universally across known Asgard phyla (Fig. 1d), specifically genes involved in (1) membrane remodelling (endosomal sorting complexes required for transport components VPS4/VPS22/VPS25), (2) cytoskeleton organization (actin, profilin and gelsolin (except in Odin LCB_4)), (3) protein N-linked glycosylation (OST3/STT3/ribophorin) and (4) intracellular trafficking (roadblock/LC7/dynein family and a large repertoire of small GTPases). On the other hand, enzymes involved in the synthesis of ester-linked phospholipids, which are critical for closing the ‘lipid divide’ between the Archaea and Eukaryota domains^23,26, show a mosaic distribution across the Asgard archaea lineages (Fig. 1d). For example, both Ca. Heimdallarchaeum spp. in our study lack 1-acyl-sn-glycerol-3-phosphate acetyltransferase involved in the attachment of the second fatty acid chain to the glycerol backbone²⁸.

Maximum-likelihood analysis using a previously described approach based on the SR4 model^3,29 and a concatenation of a complete set of 56 single-copy markers, indicates a close relationship between the Heimdall group archaea, which include the Heimdallarchaeum spp. and eukaryotes (Fig. 1d). This supports a parsimonious topology, reported in multiple studies^3,5,7. We additionally produced a set of customized Asgard-specific Hidden Markov Models (HMMs) (Supplementary Data 1) that complement existing Archaea-specific HMMs along with a set of filtering parameters (Methods and Supplementary Tables 8 and 9) as resources. Maximum-likelihood analyses of a greater diversity of Asgard archaea^7,11,12,16 that were selected through the framework described above (19 of 282 evaluated MAGs shown in Extended Data Fig. 2) further verified the phylogenetic topology, placing the Heimdall group closest to eukaryotes (Extended Data Fig. 4b). We note that statistical model selection, taxonomic evenness and assumptions with rooting represent ongoing debates for deep phylogeny^5,7. The circularized genomes and resources described in this study may assist with future analyses of the Asgard archaea using a broader range of statistical parameters and emerging high-quality genomes.

Abundant repetitive features

Our approach retained a substantial number of non-tandem repeats (3% of genome lengths) and tandem CRISPR or intragenic repeats (212 and 262 counts) within the circular Ca. Heimdallarchaeum spp. genomes (Fig. 2a,b). This is notably more prominent relative to the recently constructed circular genomes of Ca. P. syntrophicum²³, where no tandem repeats and only 1% of non-tandem repeats were observed.

**Fig. 2: Circular *Heimdallarchaeum* genomes reveal abundant repeats belonging to complex networks of transposases/integrases and CRISPR–Cas operons.**

Non-tandem repeats in the Ca. Heimdallarchaeum spp. overlap prominently with one of the most pervasive mechanisms of gene transfer within and between genomes, that is, a total of 11 families of transposases/integrases, 7 of which have multiplied and transposed to result in up to 27 copies within an individual genome (Fig. 2a). These and other transposases/integrases found in Asgard archaea primarily cluster with various small families within the 96,367 transposase/integrase sequences recovered from the prokaryotic Genome Taxonomic Database (GTDB)³⁰ (Fig. 2c). Despite the under-representation of archaeal sequences in public databases and in the transposase/integrase dataset in this study, they have representatives in almost all clusters. The intermingled evolutionary relationship between archaeal and bacterial transposases/integrases documented in this study can potentially be both the result of, and contributor to, the gene flow observed between these two domains^31,32,33.

The circular genomes of Ca. Heimdallarchaeum spp. contain seven CRISPR–Cas systems (Fig. 2b), including five complete operons (labelled C1–3, 5, 6), one array-free operon (C7) and one orphan array (C4) (see Extended Data Fig. 5 for the complete gene organizations). Contrasting the overall gene conservation between the two genomes, these CRISPR–Cas systems exhibit strong variability and site-specific integration (Fig. 2d). For example, C5 and C6 exhibited a complete local operon swap, while C3 and C4 were integrated immediately next to transfer RNA genes, a feature often exploited by bacteriophages³⁴ and other Heimdallarchaeal mobile elements (see examples in Fig. 3 below).

**Fig. 3: Unique Heimdallarchaeal mobile elements with viral and transposable features.**

CRISPR–Cas-guided discovery of mobile elements

We recruited a total of 1,565 Heimdall-associated CRISPR spacers in our Pescadero metagenomes constructed in this study and previously published Guaymas metagenomes (Methods). They revealed eight protospacers within four distinct mobile elements, which are hosted by Ca. Heimdallarchaeum spp. and are unrelated to any previously reported mobile elements (outlined in Fig. 3a). We named them Heimdallarchaeal mobile elements HeimM1 and HeimM2 and Heimdallarchaeal viruses HeimV1 and HeimV2, respectively.

HeimM1, detected within the sediment-hosted Ca. H. aukensis, is a C2-associated small defence island encoding an efflux pump CcmA and contains a protospacer that matches a spacer at the same genomic locus in the rock-hosted Ca. H. endolithica PR6 C1 (Fig. 3b). Such a territorial dispute within the genome, as well as the site-specific integrations of CRISPR–Cas outlined above, exemplify the emerging view that defence systems are mobile elements themselves³⁵ and contribute to gene flow between habitats.

HeimM2 (8 kbp) encodes an internalin-like, leucine-rich repeat peptide and an enzyme homologous to rRNA self-splicing homing endonucleases (Fig. 3c). The latter are typically found as group I introns embedded within rRNA genes and are considered selfish elements. In this study, this gene was part of a mobile element inserted exactly between the only copy of the 16S rRNA gene and the tRNA gene ArgTCT, suggesting that it has likely been co-opted by HeimM2 for site-specific integration at this site.

The putative integrated viruses HeimV1 and HeimV2 are both found in Ca. H. endolithica. Each encodes proteins with homologues preferentially found in the viral database IMG/VR v.3³⁶ compared to the microbial genome database GTDB v.202, and viral structural proteins predicted by machine learning-based annotations (PhANNs³⁷) (Fig. 3d,e and Extended Data Fig. 6).

HeimV2 (44 kbp), integrated at the same site as HeimM2, may be a hybrid between a virus and a previously undescribed class of transposons, which we tentatively call aloposons, in reference to the twin giants Aloadae in Greek mythology. They share the following features (Fig. 3d). First, they all contain tandem genes encoding proteins 3,000–6,000 amino acids in size, which we refer to as Otus and Ephialtes, the Aloadae twins. Second, they all integrate at different tRNA sites downstream of the giant genes. Aloposon2 in Ca. H. endolithica and Aloposon3 in Ca. H. aukensis represent a highly conserved element that has transposed from one tRNA site to the other during its coevolution with its host. Third, they all encode four consecutive genes upstream of the giant genes, including a gene encoding a bacterial MinD/ParA-like AAA family ATPase. Additionally, we found tandem giant genes in two Thorarchaeota MAGs showing distant homology to the Heimdallarchaeum giant proteins, as well as many unrelated giant genes across the Asgard archaea, some of which may also be part of Asgard mobile elements (Extended Data Fig. 7).

Putative virus HeimV1 (30 kbp) is a circular element with a highly polycistronic gene arrangement and an enrichment in nucleic acid-processing enzymes, viral structural proteins and viral gene homologues (Fig. 3e). As shown in Fig. 3f, HeimV1 exists in two states. Besides the genome-integrated lysogenic state found in one of the incubations, where its sequencing read abundance was at the same level as its genomic neighbourhood, in another enrichment incubation, HeimV1 showed an anomalously high read abundance relative to the host Ca. H. endolithica, suggestive of active replication. PCR and Sanger sequencing further confirmed the circularized state of HeimV1 as well as its integration between the host transposase and tRNA genes. Furthermore, the detailed sequencing read abundance profile across HeimV1 shows the characteristic V shape of an unsynchronized, bidirectionally self-replicating population of circular DNA elements (Fig. 3f). Such a well-defined profile can only emerge if the replications in each HeimV1 circular element initiate at a defined origin of replication.

The mobile elements described above also influence ecosystems beyond the southern Pescadero Basin vent system. CRISPR spacers targeting HeimV1 and HeimV2 were detected in metagenomes from the Guaymas Basin²², a hydrothermal vent site 400 km northwest of the southern Pescadero Basin. The Pescadero-derived mobile element HeimM1 in Ca. H. aukaensis also exists in the Ca. H. endolithica B53_G16 MAG assembled from the Guaymas Basin. Furthermore, HeimV1-related proviruses encoding tail fibre protein homologues are also found in the Heimdall group MAGs from the Gulf of Mexico in the Atlantic (Gerdarchaeota clade E44_bin34 (ref. ⁹)) and from the South China Sea (Hodarchaeota clade B3_Heim¹⁰) on the other side of the Pacific (Fig. 3e). Notably, the contig in the E44_bin34 MAG maintains the same gene synteny around the tail fibre gene as in HeimV1, albeit with only approximately 30% sequence homology. These observations indicate the expansive distribution of these mobile elements in diverse lineages of Heimdall group archaea across a large geographical range in deep sea ecosystems.

Diverse evolutionary origins of Heimdallarchaeal viruses

Phylogenetic analyses of viral genes indicate that HeimV1 and HeimV2 share their evolutionary origins with bacteriophages. As shown in Fig. 4a, the viral integrase of HeimV1 is phylogenetically most closely related to integrases found in environmental bacteriophages identified to be hosted by the phylum Bacteroidetes, along with integrases found in seven families of Bacteroidetes and other viruses with microbial hosts that are unidentified. Similarly, independent phylogenetic analyses of homologues of proteins affiliated with prophage transcriptional regulators, IbrA and IbrB, which are encoded by HeimV2 simultaneously found their closest relatives in bacteriophages or unidentified elements targeting diverse members of phylum Firmicutes (Fig. 4b and Extended Data Fig. 8).

**Fig. 4: Gene phylogeny of Heimdallarchaeal viruses and other mobile elements.**

While most viruses encoding genes related to HeimV1 and HeimV2 are unclassified, several belong to the order Caudovirales, including members of the family Siphoviridae. Well-studied members of Caudovirales are known to be tailed bacteriophages packaging double-stranded DNA, in line with the machine learning-based predictions of tail fibres in both HeimV1 and HeimV2 (>90% confidence; Fig. 3d,e).

Heimdallarchaeal viruses and other mobile elements associated with the Heimdall group archaea are predicted to have origins in both bacteria and archaea. For example, HeimV1 encodes a protein with two unknown domains flanking a full-length CTPase homologous to Noc/ParB/SpoJ-like proteins that bind DNA and regulate bacterial cell division (Fig. 4c). On the other hand, the HeimV1 methylase gene appears to have evolved from the Asgard archaea and is potentially involved in evading host detection (Fig. 4d). Phylogenetic analysis suggests that divergence of this viral methylase from its host was an ancient event that occurred before the divergence between the Heimdall and Loki group archaea, estimated to have taken place around two billion years ago³⁸.

A survey of Heimdallarchaeum-associated protospacers within the entire Pescadero/Guaymas metagenomic dataset yielded 56 total contigs belonging to the putative Heimdall group mobile elements (Supplementary Data 2). Most coding sequences (76.9%) have no apparent homology with known microorganisms and viruses, while another 13.1% have homologues in diverse bacteria (Fig. 4e), which is higher than the 8.9% archaeal fraction. This further suggests that mobile elements and viruses may play a prominent role in shaping the evolution of Heimdallarchaeota by introducing functional innovations of bacterial origin.

Asgard–eukaryote parallelism in bacterial gene import

To understand the consequence of cross-domain gene flow in the evolution of Asgard archaea, we performed protein orthology-based functional and taxonomic profiling³⁹ of the proteomes encoded by the complete genomes in this study. Functional analyses of the Asgard archaeal proteome based on clusters of orthologous groups (COGs)^39,40 revealed distinct categories of genes that are associated with different taxonomic groups (Fig. 5a). The Archaea-related proteins in Asgard archaea were predominantly represented by information processing functions, including translation (J), transcription (K) and replication and repair (L), which is similar to the key archaeal modules inherited by eukaryotes⁴¹. By contrast, the annotated bacteria-related proteins were preferentially enriched in metabolic functions, including energy production and conversion (C) and the metabolism and transport of amino acids (E), carbohydrates (G) and inorganic ions (P). Different from both the above groups, nearly half of eukaryote-related proteins within the Asgard genomes were dedicated to intracellular trafficking and secretion (U), and cytoskeleton (Z) and protein modification (O) functions.

**Fig. 5: Functional and taxonomic profiling of gene content cross Asgard archaea.**

The import of bacterial genes into archaea and eukaryotes have been independently explored^31,32,41,42. In this study, we show that the inheritance of information processing from the Archaea and metabolic functions from the Bacteria domain in the Asgard archaea is very similar to the signature of the eukaryotic genome profile. Strikingly, the archaeal:bacterial gene ratio forms an inverse relation with the genome size in Asgard archaea that is quantitatively comparable with previous characterizations across eukaryotes⁴¹ (Fig. 5b). Such a quantitative agreement on their genome size dependence suggests that the bacterial import of genomic material into eukaryotes may not necessitate an independent mechanism (such as endosymbiosis⁴²) or a dramatically different selective force from their closest archaeal relatives. Instead, genome size control alone may be sufficient to account for the over-representation of bacterial genes in some eukaryotes⁴³.

Domain-specific scaling of gene flow

Different scaling laws appear to govern the fluidity of genes with different taxonomic origins within the Asgard archaea. The total number of genes with closest orthologues in Archaea were remarkably invariable at approximately 900 genes across all Asgard archaeal representatives that span a threefold difference in genome size, from 1.5 Mbp in Odin LCB_4 to 4.4 Mbp in Lokiarchaeotes (Fig. 5c). While the archaeal reference database is currently significantly smaller than the bacterial one, which likely caused an underestimation of the exact number of archaea-related genes, the trend cannot be explained by such a database bias. One the other hand, we found that genome completeness and accuracy is key to capturing this feature since it is otherwise entirely obscured in Asgard genomes of variable completeness and contamination levels (Extended Data Fig. 9). By contrast, the bacterial, eukaryotic and taxonomically unassigned fractions of the genome increased linearly with the remaining portion of the genome. These scaling properties suggest a fundamental difference in the evolutionary plasticity between conserved archaeal ‘core’ genes and other fractions of the gene content with different evolutionary origins among the Asgard archaea.

Decentralized eukaryotic innovation

Eukaryote-related proteins (ERPs) capture present-day Asgard–Eukaryota protein orthologues that are estimated to be most closely related to each other. They include, but are not restricted to, previously investigated ESPs^3,6,7—loosely defined as eukaryotic proteins with no archaeal or bacterial homologues in the predicted last eukaryotic common ancestor (LECA)⁴⁴. Our analyses show that the scaling property of ERPs is similar to bacteria-related but not archaea-related proteins (Fig. 5c), prompting us to explore their evolutionary fluidity across Asgard archaea lineages.

Beyond the ESPs described above, which are shared by all Asgard archaea (Fig. 1d), we found diverse families of ERPs existing in only one or two of the Asgard clades examined in this study (Fig. 6a). Comparison of the circular genomes of Ca. Heimdallarchaeum spp. and the Lokiarchaeote Ca. P. syntrophicum revealed fewer than half of their ERP families being shared, notably with members of the Heimdallarchaeum harbouring fewer ERPs overall, despite their closer phylogenetic relationship with eukaryotes (Fig. 6b). Furthermore, even species related at the genus (Ca. Heimdallarchaeum spp.) or family levels (within Thorarchaeota/Lokiarchaeota) have apparent differences in their ERP pools (Fig. 6b). Such a high mobility of ERPs in the recent evolutionary history of Asgard archaea suggests that many of these genes are involved in the auxiliary but not core cellular functions. They are likely, or could have been during their evolutionary history, shuffled as part of their mobilomes. Hence, the evolutionary entanglement between the Asgard archaea and the Eukaryota must be understood in the pan-Asgard space and in the context of genome size expansion.

**Fig. 6: Distribution of ERP genes and the hypothesized HDH model for eukaryotic origin.**

Thus, our analyses collectively suggest a plausible scenario where an ancestral Heimdall group archaeon with a small genome engaged in endosymbiosis with a bacterium and established the archaeal basis of information processing in the first eukaryotic common ancestor (FECA). The remaining defining features of eukaryotes are a result of decentralized innovations across the tree of life that became hierarchically imported, most frequently and often indirectly, through Asgard archaea lineages closest to FECA, to ultimately orchestrate LECA (Fig. 6c). As such, it is possible that the acquired non-essential genes were later co-opted to serve essential functions as the archaeon–bacterium symbiont expanded its regulatory complexity. We refer to this conceptual framework as the Heimdall nucleation–decentralized innovation–hierarchical import (HDH) model for future implementation and debate.

Discussion

The contiguous and complete genomes of Asgard archaea constructed in this study allowed us to resolve the composite origins of their genetic repertoires and identify diverse, unique mobile elements as their drivers. One important facet to be considered is timescale. While the pivotal role of horizontal transfer in the diversification of Asgard archaea is evidenced by the high number of bacteria-related genes found in this study, a considerable fraction of these genes is likely now stable in their respective lineages and only a certain fraction is a part of their present-day mobilomes—the entire set of mobile elements in a genome. However, the uncharted features, such as the extraordinarily large proteins in aloposons and Asgard-specific host range of mobile elements found in this study, suggest that the Asgard archaea mobilome may still hold ancient signatures inherited around the time of eukaryogenesis. Expanding the repertoire of complete genomes in a broader Asgard archaea taxonomic range, pan-genomic analyses of the same or closely related species and molecular clock approaches will together help chronicle the horizontal transfer events across their evolutionary history. Given that the presence of bacterial genes is prevalent in both branches of the Asgard–eukaryote sisterhood, it will be particularly exciting to explore the extent to which bacterial genes have been transferred into their shared ancestors before eukaryogenesis.

Genome size variability in both eukaryotes and prokaryotes have been attributed to rapid expansion driven by mobile elements followed by gradual erosion under natural selection (such as nutrient availability)^45,46. It is thus reasonable to assume that such expansion–erosion cycles would have occurred around the time of eukaryogenesis. While the mechanism of genome expansion around eukaryogenesis is genetic, which will be further elucidated by future discoveries of more Asgard archaea mobile elements, the selection pressure for these traits is ecophysiological. In this study, we showed that the influx of genes into the Asgard archaea is highly constrained by genome size in a similar fashion as in eukaryotes. Hence, resolving the ecophysiological drivers of genome size stratification across Asgard archaea lineages may help us unlock the origin of eukaryotic genome complexity.

Etymology

Ca. H. endolithica PR6

Heimdall, watchman of the gods in Norse mythology; archaios (Greek), ancient, primitive; endo- (Greek), within; lithos (Greek), rock). Proposed classification: class Ca. Heimdallarchaeia, order Ca. Heimdallarchaeales, family Ca. Heimdallarchaeaceae, genus Ca. Heimdallarchaeum.

Ca. H. aukensis PM71

Heimdall, watchman of the gods in Norse mythology; archaios (Greek), ancient, primitive; Auka, the local hydrothermal vent field in the southern Pescadero Basin where the species originated; -sis (Greek), process or condition. Proposed classification same as above.

Ca. H. repetitus FW102

Harpocrates, Greek god of silence; archaios (Greek), ancient, primitive; repetita (Latin), repetitive (referring to the high fraction of repetitive sequences that constitute 4% of the genome). Proposed classification: class Ca. Lokiarchaeia, order Ca. Lokiarchaeales, family Ca. Prometheoarchaeaceae, genus Ca. Harpocratesius.

Methods

Hydrothermal vent rock and sediment sample collection

Rock no. NA091-R045 (source of Ca. H. endolithica PR6, Ca. H. repetitus FW102 and Thorarchaeote FW25) and rock no. NA091-R008 (source of Heimdall group Gerdarchaeote AC18) were retrieved from the Auka hydrothermal vent site situated on the margin of the southern Pescadero Basin of the Gulf of California using remotely operated vehicle Hercules during research expedition NA091 on E/V Nautilus on 2 November 2017. Local venting fluids have a measured temperature approaching 300 °C, contain hydrocarbons and hydrogen and are precipitating minerals, such as calcite and barite¹⁵. R045 was collected during dive H1658 at coordinates 23.956987786° N, 108.86227922° W at a water depth of 3,674 m, near shimmering water, a sign of locally focused hydrothermal fluid discharge. R008 was collected during dive H1657 at coordinates 23° 57′ N, 108° 52′ W at a water depth of 3,651 m. After shipboard recovery, rock samples were placed in Mylar bags prefilled with 0.2 µm filtered bottom seawater collected during the same dive, flushed with N₂ gas for 10 min, sealed and stored at 4 °C until preparation for incubations in the laboratory.

Sediment sample no. FK181031-S0193-PC3 (source of Ca. H. aukensis) was collected during the research expedition FK181031 on R/V Falkor to the southern Pescadero Basin on 14 November 2018. The sample was collected during dive S193 at the Auka hydrothermal vent site (23.954822° N, 108.863009° W, water depth of 3,657 m), near the site where rocks nos. NA091-R045 and NA091-R008 were collected in 2017. The sediment push core was extruded upwards and sectioned into discrete 3 cm depth horizons on board immediately after recovery, transferred into sterile Whirl-Pak bags and sealed in a larger Mylar bag, flushed with argon gas, heat-sealed and stored at 4 °C until use in the laboratory.

Sample collection permits for the expedition were granted by the Dirección General de Ordenamiento Pesquero y Acuícola, Comisión Nacional de Acuacultura y Pesca (Permiso de Pesca de Fomento no. PPFE/DGOPA-200/18) and the Dirección General de Geografía y Medio Ambiente, Instituto Nacional de Estadística y Geografía (authorization no. EG0122018), with the associated diplomatic note no. 18-2083 (CTC/07345/18) from the Secretaría de Relaciones Exteriores-Agencia Mexicana de Cooperación Internacional para el Desarrollo/Dirección General de Cooperación Técnica y Científica.

Artificial seawater medium recipe

Artificial seawater was prepared as described in Scheller et al.⁴⁷ with minor modifications. Briefly, 1 l of artificial seawater (ASW) medium contained 46.6 mM MgCl₂, 9.2 mM CaCl₂, 485 mM NaCl, 7 mM KCl, 20 mM Na₂SO₄, 1 mM K₂HPO₄, 2 mM NH₄Cl, 1 ml of 1,000× trace element solution, 1 ml of 1,000× vitamin solution and 0.5 mg of resazurin and was buffered by 25 mM HEPES buffer adjusted to pH 7.5. One litre of 1,000× trace element solution contained 50 mM nitrilotriacetic acid, 5 mM FeCl₃, 2.5 mM MnCl₂, 1.3 mM CoCl₂, 1.5 mM ZnCl₂, 0.32 mM H₃BO₃, 0.38 mM NiCl₂, 0.03 mM Na₂SeO₃, 0.01 mM CuCl₂, 0.21 mM Na₂MoO₄ and 0.02 mM Na₂WO₄. One litre of 1,000× vitamin solution contained 82 μM d-biotin, 45 μM folic acid, 490 μM pyridoxine, 150 μM thiamine, 410 μM nicotinic acid, 210 μM pantothenic acid, 310 μM para-aminobenzoic acid, 240 μM lipoic acid, 14 μM choline chloride and 7.4 μM vitamin B₁₂.

Enrichment cultivation

Rock no. NA091-R045 was anaerobically fragmented; then, approximately 5 g wet weight was crushed using a sterile agate mortar and pestle on 8 November 2018 and immediately immersed in anaerobic ASW medium in 25–125 ml of butyl rubber-stoppered serum bottles supplemented with different carbon/energy sources, including lactate, H₂/CO₂, hexane and decane and incubated in the dark at 40 °C (Extended Data Fig. 1a). The headspace for all cultures was flushed and overpressurized with N₂ gas (2 atm). For the H₂-containing cultures, the N₂ gas headspace was replaced with H₂/CO₂ at an 80:20 mixture by flushing for 1 min and subsequent equilibration at 2 atm. After 33 d of incubation, the lactate-fed first-generation culture produced 5 mM sulphide, indicating active sulphate reduction. This enrichment was mixed by gentle shaking and diluted 1:100 vol/vol into fresh anaerobic ASW medium containing the same suite of carbon/energy sources as described above (Extended Data Fig. 1b). A transfer using the liquid fraction-lacking rock particles from the primary lactate enrichment was also included to enrich for members of the planktonic community alone with lactate as the carbon and energy source. This enrichment was later found to be devoid of the AAG (Heimdall) phylotype. Third- and fourth-generation cultures were set up in the following months through 1:100 dilution (Extended Data Fig. 1b). Further details of microbial community development in these enrichments are provided in Supplementary Note 1 and Supplementary Tables 1–3.

R008 was prepared as above except using 2 atm of methane in the headspace as the sole carbon source and electron donor. The culture was passaged twice using a 1:100 dilution under the same culturing conditions; the cell fraction was collected by centrifugation after a total of 22 months for metagenomic sequencing (described below).

For sediment enrichment cultivation, the top 3 cm section of the sediment core was mixed with anaerobic ASW at a 1:4 vol/vol ratio; a total of 60 ml volume each was dispensed into seven 125 ml glass serum bottles sealed with butyl rubber stoppers. The headspace was replaced by ethane (2 atm) in 2 bottles (Supplementary Table 5), while the headspace in 1 bottle was replaced by 100% N₂ gas (2 atm). The cultures were incubated at 37 °C in the dark. Further details on microbial community development are provided in Supplementary Note 1 and Supplementary Table 4.

Mineralogical analyses

The mineralogical composition of rocks NA091-R045 and R008 was characterized on a PANalytical X’Pert Pro X-Ray diffractometer. A dried rock aliquot was finely powdered using a clean agate mortar and pestle and scanned from 3 to 75° (2θ angle) at a 0.0167° step size. Mineral identification was performed with the X’Pert HighScore software v4.1 using the search and march algorithm.

DNA extraction

Combined cells with rock or sediment substrate were pelleted through centrifugation at 13,000 r.p.m. for 3 min. For amplicon sequencing, unless specified in Supplementary Table 6, DNA was extracted using the Qiagen DNeasy PowerSoil kit (catalogue no. 47014) according to the manufacturer’s instructions as described previously⁴⁸ with a minor modification, where mechanical shearing was carried out using the MP Biomedicals FastPrep-24 system (catalogue no. 116004500) at level 5.5 for 45 s. For genomic sequencing, incubated rock and sediment cultures were extracted using multiple approaches, including the Qiagen DNeasy PowerSoil kit, ZymoBIOMICS 96 MagBead DNA Kit (catalogue no. D4302; Zymo Research Corporation), Quick-DNA 96 Kit (catalogue no. D3010; Zymo Research Corporation), ZymoBIOMICS DNA Microprep Kit (catalogue no. D4301; Zymo Research Corporation) and a standard phenol/chloroform-based protocol. The list of samples and their extraction methods are provided in Supplementary Table 6.

16S rRNA gene amplicon sequencing

For amplicon (iTAG) sequencing of 16S rRNA genes, extracted DNA was amplified using primer pair 515f/806r GTGCCAGCMGCCGCGGTAA/ GGACTACHVGGGTWTCTAAT, barcoded and sequenced at Laragen using the Illumina MiSeq platform and analysed using Qiime v.1.8.0 (ref. ⁴⁹) as described previously⁴⁸. Taxonomic assignment was based on the SILVA 138 database (https://www.arb-silva.de)⁵⁰.

Full-length 16S archaeal rRNA gene sequences were amplified using the archaeal primer pair SSU1Arf/SSU1492Rngs TCCGGTTGATCCYGCBRG/ CGGNTACCTTGTKACGAC as described by Bahram et al.⁵¹, multiplexed as instructed by PacBio and sequenced using the PacBio Sequel II at the Brigham Young University DNA Sequencing Center and then analysed using the DADA2 package v1.9.1 in R v3.6.0 as described in Callahan et al.⁵² using the SILVA 138 database for taxonomic classification. Note that in the SILVA 138 database, all Asgard archaea clades are classified under Asgardarchaeota.

Metagenomic sequencing

A total of 11 metagenomic sequencing runs were performed using the Illumina and Oxford Nanopore platforms, with details listed in Supplementary Table 6. For Illumina short-read sequencing, libraries were constructed using the NEBNext Ultra and Nextera Flex Library kits as specified in the Supplementary Table 6. Sequencing was carried out using a HiSeq 2500 system (single-end, 100 bp) at the Caltech Genetics and Genomics Laboratory and HiSeq 4000 system at Novogene (paired-end, 150 bp). Only paired-end data were used for assembly, while all data were used for error correction. Due to the low DNA quantity obtained from the sediment incubation that yielded Ca. H. aukensis, we used multiple displacement amplification with the QIAGEN REPLI g Midi Kit before library preparation for Nanopore sequencing. Oxford Nanopore sequencing libraries were constructed using the PCR Barcoding Kit (catalogue no. SQK-PBK004) and were sequenced on MinION flow cells FLO-MIN106. Base calling was performed with the ONT Guppy software v.3.4.5.

Genome assembly, error correction and read coverage mapping

Two different approaches were used to assemble contiguous genomes from metagenomes. For species of interest, if Nanopore sequencing yielded high read coverage and read lengths N50 > 2 kb, we obtained more contiguous genomes through de novo assembly purely based on Nanopore reads. If Nanopore sequencing did not yield a high number of reads or exhibited low read lengths, we obtained more contiguous genomes through de novo assembly first based on Illumina reads and then joined using Nanopore reads.

For Ca. H. endolithica, Nanopore sequencing data were assembled de novo using Canu¹⁷ v.2.1, which yielded a 30 Mbp assembly, including a 3.4 Mbp contig. The approximate 40 kilobase (kb) regions at two ends of an approximate 3.4 Mbp contig were repetitive. This repeated region was deleted at one end and the two ends were joined to result in a circular genome. The resulting genome was mapped using BamM (http://ecogenomics.github.io/BamM/, based on Burrows–Wheeler Aligner⁵³ mapping) with 150 bp Illumina paired-end reads (88× coverage on average) and 100 bp single-end reads (20× coverage). Mapped reads were then used for error correction through pilon⁵⁴ v.1.22. To account for the reduced mapping at the edges (approximate 50 bp region), the two ends of the genomic sequence were joined, read-mapped and error-corrected again using the same methods. After the genome was annotated, it was rotated such that the genomic sequence ended with tRNA (GlyCCC), which was the integration site of the putative provirus HeimV1. All sequencing reads derived from incubations of the same rock were mapped onto the final genome using BamM, which was then used for coverage calculation through bedtools (https://bedtools.readthedocs.io/en/latest/).

For Ca. H. aukensis, Illumina PE150 bp sequencing data were assembled using SPAdes¹⁸ v.3.14.1 with the ‘-meta’ option and k-mers 21,33,55,77,99. The assembly was then scaffolded using Nanopore reads through two iterations of LRScaf⁵⁵ v.1.1.10. The Ca. H. aukensis genome was joined after trimming the identical sequences at the two ends. The end-joining region was verified through PCR amplification and Sanger sequencing using the primer pair CGCTTTCTTCAAACAATATTTCTGGTG/CTTACTTTCTCTCGGTCCATTTTTCAC. Finally, a 1 kbp stretch of unresolved genomic sequence at an approximate 2.9 Mbp position was resequenced through PCR amplification and Sanger sequencing using the primers GAGTTTTTTCAATCTTATAATGCCAAACTAAAAAATAG (forward), CAGTCAGATTTGACACAATTTTGGTC (reverse) and GCTGGACTCAACCTATAACTAATAGT (reverse). The final assembly was read-mapped, error-corrected through pilon v.1.24 using 346× coverage. It was rotated as described above to place the tRNA gene GlyCCC at the end.

The metagenome containing the Lokiarchaeote Ca. H. repetitus FW102 was assembled using Canu v.2.1, as described for the Ca. H. endolithica genome, and then binned using metabat2 v.2.15 (ref. ⁵⁶) with default parameters. The bin was then used to recruit long reads using minimap2 v.2.17 and reassembled and binned again. We then used LRScaf to scaffold the contigs and used ten iterations of pilon v.1.24 to achieve error correction and resolve ambiguous bases.

The Thorarcheote FW25 MAG was assembled using the hybrid assembly of Illumina reads and Nanopore reads using SPAdes v.3.14.1 with k-mers 21,33,55,77,99, and then binned using metabat2 v.2.15 with default parameters. The MAG bin was then used to recruit reads through MIRAbait in the MIRA v.4 package (http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_intro). These reads were then used for hybrid assembly with Nanopore long reads via SPAdes v.3.14.1 with k-mers 21,33,55,77,99. It was then binned again using metabat2 v.2.15 with default parameters to yield the final Thorarcheote FW25 MAG.

The metagenome containing Gerdarchaeote AC18 was assembled from Illumina reads using SPAdes v.3.14.1 with k-mers 21,33,55,77,99 and then binned using metabat2 v.2.15 with default parameters. The MAG bin was then used to recruit reads through MIRAbait in the MIRA v.4 package and then reassembled and binned using SPAdes and metabat2 to yield the final Gerdarchaeote AC18 bin.

Alignment fraction, ANI and AAI

ANI and alignment fraction values, independently calculated for rRNA, tRNA and coding gene sequences were obtained using ANIcalculator⁵⁷ 2014-127, v.1.0 (https://ani.jgi.doe.gov/html/download.php?). Note that Lokiarchaeote FW102 contains 2 copies of 16S rRNA genes at 99% identity with each other, and Thorarchaeote BC has a partial 16S rRNA gene. The alignment of 16S rRNA was carried out using SINA⁵⁸ v.1.2.11. The AAI values of translated proteomes were obtained with the enveomics package v1.8.0⁵⁹. The final output is shown in Supplementary Table 7.

Genome and mobilome annotations

Gene calling was done using a combination of Prodigal v.2.6.3 and Glimmer v.3.0.2 using translation code 11 within the RASTtk⁶⁰ pipeline, now under the PATRIC package v1.032⁶¹. Translated coding sequences were annotated and domain-assigned using eggNOG mapper³⁹ v.2. The tRNA, 16S rRNA and 23S rRNA genes were identified using RNAmmer⁶² v.1.2 embedded in RASTtk. Thus far, 5S rRNA gene sequences could not be predicted through the existing HMM using various approaches. Long, non-tandem repeats were identified using RASTtk with the default cut-off of 95% identity and 100 bp. Tandem repeat sequences were identified using RASTtk, Prokka v1.14.6 and CRISPRCasTyper 1.1.4⁶³. Prokka and CRISPRCasTyper both employ MinCED (https://github.com/ctSkennerton/minced) to identify repeats and detect intragenic tandem repeats, which were manually removed from the CRISPR–Cas analyses. The Cas genes were annotated using CRISRCasTyper.

All identified Heimdallarchaeum mobilomes were further analysed using PSI-BLAST 1.10.0⁶⁴, CDD search v3.19⁶⁵ and PhANNs webserver (version March 2021)³⁷.

Genome evaluation and HMM construction

Marker coverage was carried out using a two-step process. First, we used the automated marker analyses via CheckM⁶⁶ v.1.1.3 with the lineage_wf option and the default HMM E value cut-off, which included the 149 standard archaeal single-copy marker set. Next, each of the missing markers was examined with hmmer⁶⁷ v.3.3.2 using the hmmsearch option with manual inspection of alignment regions and bitscores. This rescued markers unidentified through the default cut-offs by CheckM as well as divergent variants that most likely functionally replace the genuinely missing marker. The detailed description of markers missed by CheckM can be found in Supplementary Note 2 and the final evaluation of marker presence is displayed in Extended Data Fig. 4a and Supplementary Table 15. Next, we constructed an updated HMM set to replace the CheckM set by (1) updating all HMM to the most recent versions, (2) removing the six commonly missing or duplicated markers shown in Extended Data Fig. 4a from the list and (3) overcoming the pitfall of existing HMMs constructed using only a few sequences acquired from Euryarchaeota and Crenarchaeota. We manually constructed Asgard-specific versions based on the 282 Asgard archaea genomes. The HMMs constructed in this study are PF00832.ASG, PF00861.ASG, PF01194.ASG, PF01287.ASG, PF01667.ASG, PF03874.ASG, PF03876.ASG, PF13656.ASG, TIGR00270.ASG, TIGR00336.ASG, TIGR00442.ASG, TIGR02338.ASG and TIGR03677.ASG. The updated HMM file has been provided as a supplementary data file. The updated HMM was used to evaluate the 282 genomes reported in this study and in the literature^{3,6,7,8,9,10,11,12,16,23,26,68,69,70,71,72,73,74,75,76,77} through (1) CheckM, which uses Prodigal for gene calling, and (2) the more up to date HMMER3.2.2 on our gene calls described above. The latter generally produced slightly higher completeness and redundancy values (Supplementary Tables 8 and 9). For the expanded set of Asgard archaea genomes used for the phylogenomic analyses shown in Extended Data Fig. 4b, we applied the following filtering criteria: ≤100 contigs, >96% marker completeness and <8% marker redundancy. We also took the evenness of taxonomic sampling into account. The set is also shown in the Asgard archaea tree in Extended Data Fig. 2. The importance of genome quality evaluation is highlighted in Extended Data Fig. 9.

Phylogenomics

A phylogenomic tree of Asgard archaea was constructed with IQ-TREE v.2.1.2 (ref. ⁷⁸) using a partitioned analysis⁷⁹ with model selection using ModelFinder⁸⁰ and 1,000 ultrafast bootstrap replicates using UFBoot2⁸¹ on a concatenated alignment generated from MUSCLE⁸² v.3.8.1551 alignments of 76 archaeal marker genes identified in the genomes using HMMs included with anvi’o v.6.2 (ref. ⁸³). The phylogenomic tree was visualized using iTOL⁸⁴ and rooted with the TACK superphylum.

The Archaea–Eukaryota phylogenomic tree, including the Asgard genomes discussed in this study, was constructed based on the 56 Archaea–Eukaryota ribosomal proteins used by Zaremba-Niedzwiedzka et al.³ using reference sequences from the corresponding Dryad repository. In addition to the Asgard archaea identified in this study, additional sequences of the most complete genomes representing different lineages of the TACK superphylum were added to the dataset. Sequences of 56 archaeal COGs obtained from the Dryad repository were used as reference databases to retrieve homologous sequences from target genomes using BLAST⁸⁵ v.2.10.1. Each set of archaeal COG sequences were aligned using MUSCLE v.3.8.1551 and inspected and trimmed manually. Manually trimmed alignments were then further trimmed using BMGE⁸⁶, recoded to four-state SR4 using a custom script (https://github.com/dspeth/bioinfo_scripts/tree/master/phylogeny) and finally concatenated and converted to PHYLIP format using catfasta2phyml v1.1.0 (https://github.com/nylander/catfasta2phyml). The final concatenated, recoded alignment was used to calculate phylogenies using IQ-TREE v.2.1.2 (ref. ⁷⁸) using a C60 model adapted for SR4 recoded data by Zaremba-Niedzwiedzka et al.³ and 1,000 ultrafast bootstrap replicates using UFBoot. The phylogenomic tree was visualized using iTOL⁸⁴ and rooted with Euryarchaeota as the outgroup. The genomes and conserved genes used for the phylogenomic analyses are listed in Supplementary Tables 16 and 17.

Discovery of Heimdallarchaeum-targeting mobile elements through CRISPR spacer targeting

Repeat sequences from the Heimdallarchaeum CRISPR arrays were used to blast against the CRISPR repeats we recruited, using CRISPRCasTyper, from multiple databases with a 95% alignment and 95% identity cut-off. The databases include GTDB v.95, our in-house assemblies from the Pescadero Basin (this study, F.W. et al. manuscript in preparation and Speth et al.⁸⁷; Supplementary Table 10, 22 sets) and published assemblies from the Guaymas Basin²² (Supplementary Table 11, 16 sets).

While no homologous CRISPR repeats were found in the entire GTDB database, we found several CRISPR arrays from the Guaymas and Pescadero assemblies with identical repeats to the Heimdallarchaeum CRISPR repeats found in this study, demonstrating the specificity of the CRISPR discovery approach. Since both the Guaymas and Pescadero CRISPR sets comprise assembled sequences that were not de-replicated, the entire CRISPR spacer collection from the recruited CRISPR arrays was de-replicated using a 100% identity cut-off. Notably, no spacer overlap was found between the Guaymas and Pescadero CRISPR sets. In total, the final de-replicated, putative Heimdallarchaeota spacerome in this study consisted of 455 from the 2 original Heimdallarchaeum genomes, 578 from the Pescadero Basin assemblies and 532 from the Guaymas Basin assemblies. We note that the above set likely only represents a fraction of the true Heimdallarchaeum spacerome given that the original CRISPR repeats came from only two species.

Next, to identify potential mobile genetic elements (MGEs) targeted by the Heimdallarchaeum spacerome, we used BLAST to search for spacer matches in the above three assembly datasets, the two Ca. Heimdallarchaeum genomes and various published virus databases/datasets, which are the RefSeq virus database r98⁸⁸, IMG/VR v.3 (ref. ³⁶) and the huge phage⁸⁹, giant virus⁹⁰ and Loki’s castle virus datasets⁹¹. To avoid self-matches, the CRISPR arrays containing the spacers were replaced by Ns in their respective assemblies. For the homology cut-off, we used 95% alignment and 95% identity as described previously⁹². Strikingly, no spacer matches were found from any of the viral datasets or GTDB genome database. The spacer matches to the Guaymas and Pescadero Basins metagenomes are listed in Supplementary Tables 12 and 13.

We then de-replicated the putative MGEs/viruses identified above using BLAST, removed contigs smaller than 2.8 kb and manually examined the target gene neighbourhoods and potential self-match due to CRISPR arrays that evaded detection and blocking. These contigs, together with the ones described in Fig. 2, ultimately constitute the 56 putative Heimdallarchaeota MGEs listed in Supplementary Table 14.

Resolution of the genomic insertion and circularization of HeimV1

To capture the two different states during the life cycles of HeimV1 (Fig. 3f), we used three primer sets to amplify the sequences around the two insertion sites of HeimV1 and confirmed them using gel electrophoresis and Sanger sequencing. Set 1 amplified the region between upstream tRNA GlyCCC in the Ca. H. endolithica genome and the first coding gene of the HeimV1 (GTGAATCAATAGCTTTCACTTATAATGAG/GTGATTGTATTAAGTCTGCAACATATTC). Set 2 amplified the regions containing the transposase in the Ca. H. endolithica genome and the integrase in HeimV1 (CTTAGATATGTACGTGATAGGATCATATG/CTTCTTTCCTCTTTTTGTCTCTGCTTC). Set 3 amplified the two ends of the circular HeimV1 (CTTAGATATGTACGTGATAGGATCATATG/GTGATTGTATTAAGTCTGCAACATATTC). Each primer set amplified approximately 2 kb of target regions with set 1 and set 2 indicating the presence of the integrated state of HeimV1 and set 3 indicating the circular state.

Protein clustering of integrases and transposases

Protein sequences showing integrase and transposase domains, identified using eggNOG mapper from the 8 Asgard archaea MAGs, were pooled and clustered at 90% sequence identity using cd-hit⁹³ v.4.8.1. The resulting representative sequences were used for two sequential rounds of homology searches using DIAMOND⁹⁴ v.2.0.6 against the protein sequences obtained from the GTDB v.95 genome database. A cut-off of >20% sequence identity, >85% sequence alignment and <15% length difference was used for the first round; a cut-off of >30% sequence identity, >90% sequence alignment and <10% length difference was used for the second round. The resulting protein sequences were combined with the Asgard archaea integrases/transposases originally pooled and were clustered together using 95% sequence identity with cd-hit. The resulting 96,367 representative sequences were clustered using ASM-Clust⁹⁵ with a sequence subset size of 5,000 to generate the alignment score matrix, using default values for the other settings.

Taxonomic profiling through protein orthologues

The taxonomic clustering and COG analyses were carried out using eggNOG mapper³⁹ with the eggNOG orthologue database v.5.0. The protein counts belonging to each taxonomic group (Archaea/Bacteria/Eukaryota/Unassigned) were extracted from the output and fitted linearly with MATLAB R2018a using the polyfit function and yielding Fig. 5c.

Since different proteins evolved at different rates, we combined the use of a single cut-off-based protein clustering approach with functional domain-based manual refinement to capture and compare ERPs across the lineages selected in this study. First, we used BLAST v.2.2.26 to evaluate the sequence homologies within the entire proteome of the eight MAGs in this study. We then used an 80% alignment length (relative to the length of the shorter protein sequence) and 0.24 alignment × identity cut-off to yield candidate protein clusters, which we then cross-referenced with the Eukaryota group in the eggNOG classification to generate 227 candidate ERP clusters. Finally, we manually examined the relatedness within and between each ERP cluster through batch searches using the conserved domain database⁶⁵. This led to the recombination of the candidate ERP clusters into the functionally distinct 135 ERP families. To align with previous work^3,6, all small GTPases were classified as one single ERP family, constituting 291 proteins from the 8 representative Asgard archaea MAGs.

Maximum-likelihood analyses of proteins encoded by HeimV1 and HeimV2

Homology search for all peptide sequences of HeimV1 through DIAMOND⁹⁴ v.2.0.6 was carried out against the GTDB v.95, Pescadero Basin and Guaymas assemblies, RefSeq virus database⁸⁸, IMG/VR³⁶ and huge phage⁸⁹, giant virus⁹⁰ and Loki’s castle virus datasets⁹¹. The search outputs were pre-clustered with a 70% identity cut-off using cd-hit v.4.8.1 (ref. ⁹³). The representative sequences were aligned using the MAFFT v.7.475 (ref. ⁹⁶) option linsi and trimmed with trimAl v.1.4.1 (ref. ⁹⁷), option gappyout. Maximum-likelihood analyses were carried out with IQ-TREE v.2.1.12 (ref. ⁷⁸) using the LG4X model and ultrafast bootstrap with 2,000 replicates. The phylogenetic tree was visualized and prepared using iTOL⁸⁴.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The assembled genomes and raw metagenomic sequencing reads can be found on the National Center for Biotechnology Information database under BioProject no. PRJNA721962. Source data are provided with this paper.

Code availability

The custom script for recoding of amino acid sequences to four-state SR4 can be found at https://github.com/dspeth/bioinfo_scripts/tree/master/phylogeny. Other custom scripts can be found at https://github.com/wufabai/genomics.

References

Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
Article CAS PubMed Google Scholar
Takai, K. & Horikoshi, K. Genetic diversity of archaea in deep-sea hydrothermal vent environments. Genetics 152, 1285–1297 (1999).
Article CAS PubMed PubMed Central Google Scholar
Zaremba-Niedzwiedzka, K. et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541, 353–358 (2017).
Article CAS PubMed Google Scholar
Spang, A. et al. Proposal of the reverse flow model for the origin of the eukaryotic cell based on comparative analyses of Asgard archaeal metabolism. Nat. Microbiol. 4, 1138–1148 (2019).
Article CAS PubMed Google Scholar
Williams, T. A., Cox, C. J., Foster, P. G., Szöllősi, G. J. & Embley, T. M. Phylogenomics provides robust support for a two-domains tree of life. Nat. Ecol. Evol. 4, 138–147 (2020).
Article PubMed Google Scholar
Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Expanded diversity of Asgard archaea and their relationships with eukaryotes. Nature 593, 553–557 (2021).
Article CAS PubMed Google Scholar
Bulzu, P.-A. et al. Casting light on Asgardarchaeota metabolism in a sunlit microoxic niche. Nat. Microbiol. 4, 1129–1137 (2019).
Article CAS PubMed Google Scholar
Dong, X. et al. Metabolic potential of uncultured bacteria and archaea associated with petroleum seepage in deep-sea sediments. Nat. Commun. 10, 1816 (2019).
Article PubMed PubMed Central Google Scholar
Huang, J.-M., Baker, B. J., Li, J.-T. & Wang, Y. New microbial lineages capable of carbon fixation and nutrient cycling in deep-sea sediments of the northern South China Sea. Appl. Environ. Microbiol. 85, e00523-19 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cai, M. et al. Diverse Asgard archaea including the novel phylum Gerdarchaeota participate in organic matter degradation. Sci. China Life Sci. 63, 886–897 (2020).
Article CAS PubMed Google Scholar
Sun, J. et al. Recoding of stop codons expands the metabolic potential of two novel Asgardarchaeota lineages. ISME Commun. 1, 30 (2021).
Article Google Scholar
Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3, 722–732 (2005).
Article CAS PubMed Google Scholar
Nelson, W. C., Tully, B. J. & Mobberley, J. M. Biases in genome reconstruction from metagenomic data. PeerJ. 8, e10119 (2020).
Article PubMed PubMed Central Google Scholar
Paduan, J. B. et al. Discovery of hydrothermal vent fields on Alarcón Rise and in Southern Pescadero Basin, Gulf of California. Geochem. Geophys. Geosyst. 19, 4788–4819 (2018).
Article Google Scholar
Caceres, E. F. et al. Near-complete Lokiarchaeota genomes from complex environmental samples using long and short read metagenomic analyses. Preprint at bioRxiv https://doi.org/10.1101/2019.12.17.879148 (2019).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A.metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Article CAS PubMed PubMed Central Google Scholar
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Article PubMed PubMed Central Google Scholar
Barco, R. A. et al. A genus definition for bacteria and archaea based on a standard genome relatedness index. mBio 11, e02475-19 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rinke, C. et al. A standardized archaeal taxonomy for the Genome Taxonomy Database. Nat. Microbiol. 6, 946–959 (2021).
Article CAS PubMed Google Scholar
Dombrowski, N., Teske, A. P. & Baker, B. J. Expansive microbial metabolic versatility and biodiversity in dynamic Guaymas Basin hydrothermal sediments. Nat. Commun. 9, 4999 (2018).
Article PubMed PubMed Central Google Scholar
Imachi, H. et al. Isolation of an archaeon at the prokaryote–eukaryote interface. Nature 577, 519–525 (2020).
Article CAS PubMed PubMed Central Google Scholar
López-García, P. & Moreira, D. The Syntrophy hypothesis for the origin of eukaryotes revisited. Nat. Microbiol. 5, 655–667 (2020).
Article PubMed Google Scholar
Sousa, F. L., Neukirchen, S., Allen, J. F., Lane, N. & Martin, W. F. Lokiarchaeon is hydrogen dependent. Nat. Microbiol. 1, 16034 (2016).
Article CAS PubMed Google Scholar
Manoharan, L. et al. Metagenomes from coastal marine sediments give insights into the ecological role and cellular features of Loki- and Thorarchaeota. mBio 10, e02039-19 (2019).
CAS PubMed PubMed Central Google Scholar
Roller, B. R. K., Stoddard, S. F. & Schmidt, T. M. Exploiting rRNA operon copy number to investigate bacterial reproductive strategies. Nat. Microbiol. 1, 16160 (2016).
Article CAS PubMed PubMed Central Google Scholar
Yao, J. & Rock, C. O. Phosphatidic acid synthesis in bacteria. Biochim. Biophys. Acta 1831, 495–502 (2013).
Article CAS PubMed Google Scholar
Susko, E. & Roger, A. J. On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24, 2139–2150 (2007).
Article CAS PubMed Google Scholar
Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 38, 1079–1086 (2020).
Article CAS PubMed Google Scholar
López-García, P., Zivanovic, Y., Deschamps, P. & Moreira, D. Bacterial gene import and mesophilic adaptation in archaea. Nat. Rev. Microbiol. 13, 447–456 (2015).
Article PubMed PubMed Central Google Scholar
Nelson-Sathi, S. et al. Origins of major archaeal clades correspond to gene acquisitions from bacteria. Nature 517, 77–80 (2015).
Article CAS PubMed Google Scholar
Groussin, M. et al. Gene acquisitions from bacteria at the origins of major archaeal clades are vastly overestimated. Mol. Biol. Evol. 33, 305–310 (2016).
Article CAS PubMed Google Scholar
Williams, K. P. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res. 30, 866–875 (2002).
Article CAS PubMed PubMed Central Google Scholar
Koonin, E. V., Makarova, K. S., Wolf, Y. I. & Krupovic, M. Evolutionary entanglement of mobile genetic elements and host defence systems: guns for hire. Nat. Rev. Genet. 21, 119–131 (2020).
Article CAS PubMed Google Scholar
Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2021).
Article CAS PubMed Google Scholar
Cantu, V. A. et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 16, e1007845 (2020).
Article CAS PubMed PubMed Central Google Scholar
Betts, H. C. et al. Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin. Nat. Ecol. Evol. 2, 1556–1562 (2018).
Article PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinform. 4, 41 (2003).
Article Google Scholar
Alvarez-Ponce, D., Lopez, P., Bapteste, E. & McInerney, J. O. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proc. Natl Acad. Sci. USA 110, E1594–E1603 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ku, C. et al. Endosymbiotic gene transfer from prokaryotic pangenomes: inherited chimerism in eukaryotes. Proc. Natl Acad. Sci. USA 112, 10139–10146 (2015).
Article CAS PubMed PubMed Central Google Scholar
Brueckner, J. & Martin, W. F. Bacterial genes outnumber archaeal genes in eukaryotic genomes. Genome Biol. Evol. 12, 282–292 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kurland, C. G., Collins, L. J. & Penny, D. Genomics and the irreducible nature of eukaryote cells. Science 312, 1011–1014 (2006).
Article CAS PubMed Google Scholar
Giovannoni, S. J. et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science 309, 1242–1245 (2005).
Article CAS PubMed Google Scholar
Kapusta, A., Suh, A. & Feschotte, C. Dynamics of genome size evolution in birds and mammals. Proc. Natl Acad. Sci. USA 114, E1460–E1469 (2017).
Article CAS PubMed PubMed Central Google Scholar
Scheller, S., Yu, H., Chadwick, G. L., McGlynn, S. E. & Orphan, V. J. Artificial electron acceptors decouple archaeal methane oxidation from sulfate reduction. Science 351, 703–707 (2016).
Article CAS PubMed Google Scholar
Mason, O. U. et al. Comparison of archaeal and bacterial diversity in methane seep carbonate nodules and host sediments, Eel River Basin and Hydrate Ridge, USA. Microb. Ecol. 70, 766–784 (2015).
Article CAS PubMed Google Scholar
Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336 (2010).
Article CAS PubMed PubMed Central Google Scholar
Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2013).
Article CAS PubMed Google Scholar
Bahram, M., Anslan, S., Hildebrand, F., Bork, P. & Tedersoo, L. Newly designed 16S rRNA metabarcoding primers amplify diverse and novel archaeal taxa from the environment. Environ. Microbiol. Rep. 11, 487–494 (2019).
Article PubMed Google Scholar
Callahan, B. J. et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 47, e103 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Article PubMed PubMed Central Google Scholar
Qin, M. et al. LRScaf: improving draft genomes using long noisy reads. BMC Genom. 20, 955 (2019).
Article CAS Google Scholar
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 7, e7359 (2019).
Article PubMed PubMed Central Google Scholar
Varghese, N. J. et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43, 6761–6771 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pruesse, E., Peplies, J. & Glöckner, F. O. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28, 1823–1829 (2012).
Article CAS PubMed PubMed Central Google Scholar
Rodriguez-R, L. M. & Konstantinidis, K. T. The enveomics collection: a toolbox for specialized analyses of microbial genomes and metagenomes. PeerJ. Prepr. 4, e1900v1 (2016).
Google Scholar
Brettin, T. et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci. Rep. 5, 8365 (2015).
Article PubMed PubMed Central Google Scholar
Davis, J. J. et al. The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities. Nucleic Acids Res. 48, D606–D612 (2020).
CAS PubMed Google Scholar
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Article CAS PubMed PubMed Central Google Scholar
Russel, J., Pinilla-Redondo, R., Mayo-Muñoz, D., Shah, S. A. & Sørensen, S. J. CRISPRCasTyper: automated identification, annotation, and classification of CRISPR–Cas loci. CRISPR J. 3, 462–469 (2020).
Article CAS PubMed Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Lu, S. et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 48, D265–D268 (2020).
Article CAS PubMed Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article CAS PubMed PubMed Central Google Scholar
Angle, J. C. et al. Methanogenesis in oxygenated soils is a substantial fraction of wetland methane emissions. Nat. Commun. 8, 1567 (2017).
Article PubMed PubMed Central Google Scholar
Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).
Article CAS PubMed Google Scholar
Rasigraf, O. et al. Microbial community composition and functional potential in Bothnian Sea sediments is linked to Fe and S dynamics and the quality of organic matter. Limnol. Oceanogr. 65, S113–S133 (2020).
Article CAS Google Scholar
Seitz, K. W. et al. Asgard archaea capable of anaerobic hydrocarbon cycling. Nat. Commun. 10, 1822 (2019).
Article PubMed PubMed Central Google Scholar
Seitz, K. W., Lazar, C. S., Hinrichs, K.-U., Teske, A. P. & Baker, B. J. Genomic reconstruction of a novel, deeply branched sediment archaeal phylum with pathways for acetogenesis and sulfur reduction. ISME J. 10, 1696–1705 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tully, B. J., Graham, E. D. & Heidelberg, J. F. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data 5, 170203 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vavourakis, C. D. et al. Metagenomes and metatranscriptomes shed new light on the microbial-mediated sulfur cycle in a Siberian soda lake. BMC Biol. 17, 69 (2019).
Article PubMed PubMed Central Google Scholar
Wong, H. L. et al. Disentangling the drivers of functional complexity at the metagenomic level in Shark Bay microbial mat microbiomes. ISME J. 12, 2619–2639 (2018).
Article CAS PubMed PubMed Central Google Scholar
Penev, P. I. et al. Supersized ribosomal RNA expansion segments in Asgard archaea. Genome Biol. Evol. 12, 1694–1710 (2020).
Article CAS PubMed PubMed Central Google Scholar
Farag, I. F., Zhao, R. & Biddle, J. F. ‘Sifarchaeota,’ a novel Asgard phylum from Costa Rican sediment capable of polysaccharide degradation and anaerobic methylotrophy. Appl. Environ. Microbiol. 87, e02584-20 (2021).
Article PubMed PubMed Central Google Scholar
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Article CAS PubMed Google Scholar
Chernomor, O., von Haeseler, A. & Minh, B. Q. Terrace aware data structure for phylogenomic inference from supermatrices. Syst. Biol. 65, 997–1008 (2016).
Article PubMed PubMed Central Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522 (2018).
Article CAS PubMed Google Scholar
Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 5, 113 (2004).
Article Google Scholar
Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ. 3, e1319 (2015).
Article PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).
Article Google Scholar
Criscuolo, A. & Gribaldo, S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 10, 210 (2010).
Article PubMed PubMed Central Google Scholar
Speth, D. R. et al. Microbial community of recently discovered Auka vent field sheds light on vent biogeography and evolutionary history of thermophily. Preprint at bioRxiv https://doi.org/10.1101/2021.08.02.454472 (2021).
Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015).
Article CAS PubMed Google Scholar
Al-Shayeb, B. et al. Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020).
Article CAS PubMed PubMed Central Google Scholar
Schulz, F. et al. Giant virus diversity and host interactions through global metagenomics. Nature 578, 432–436 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bäckström, D. et al. Virus genomes from deep sea sediments expand the ocean megavirome and support independent origins of viral gigantism. mBio 10, e02497-18 (2019).
Article PubMed PubMed Central Google Scholar
Shmakov, S. A. et al. The CRISPR spacer space is dominated by sequences from species-specific mobilomes. mBio 8, e01397-17 (2017).
Article CAS PubMed PubMed Central Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article CAS PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Speth, D. R. & Orphan, V. J. ASM-Clust: classifying functionally diverse protein families using alignment score matrices. Preprint at bioRxiv https://doi.org/10.1101/792739 (2019).
Nakamura, T., Yamada, K. D., Tomii, K. & Katoh, K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics 34, 2490–2492 (2018).
Article CAS PubMed PubMed Central Google Scholar
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank W. Fischer for critical comments on the manuscript, L. Kelly for advice on viral sequence analysis, A. Roger for discussions on phylogenetic methods and K. Makarova and E. Koonin for discussions on CRISPR–Cas systems. We thank the pilots, crew and participants on the cruises to the southern Pescadero Basin, FK181031 on R/V Falkor operated by the Schmidt Ocean Institute and NA091 on E/V Nautilus operated by the Ocean Exploration Trust, with NA091 supported by the Dalio Foundation and Woods Hole Oceanographic Institute. This research used samples provided by the Ocean Exploration Trust’s Nautilus Exploration Program, cruise NA091. We thank chief scientists S. Wankel and A. Michel for the opportunity to sail on NA091, Co-Chief Scientists D. Caress and R. Zierenberg on FK181031, and S. Wankel, A. Foulk and L. Marsh, R. Zierenberg and D. Cardace for assistance with shipboard processing of rock samples and J. Magyar and S. Goffredi for shipboard processing of sediment samples. Illumina library construction and Nanopore sequencing were performed at the Millard and Muriel Jacobs Genetics and Genomics Laboratory at Caltech. F.W. was supported by the Netherlands Organisation for Scientific Research Rubicon Award no. 019.162LW.037 and a Human Frontiers Science Program Long-term Fellowship no. LT000468/2017. D.R.S. was supported by the Netherlands Organisation for Scientific Research Rubicon Award no. 019.153LW.039 and the Caltech GPS Division Texaco Postdoctoral Fellowship. J.P.A. is funded by the National Science Foundation (NSF) no. OCE-1431598. V.J.O. is a Canadian Institute for Advanced Science fellow in the Earth 4D program. This research was supported by a Caltech Center for Evolutionary Science Pilot Grant (F.W. and V.J.O.), the NOMIS Foundation (V.J.O.), the Simons Foundation Principles of Microbial Ecosystems project (V.J.O.) and the NSF Center for Dark Energy Biosphere Investigations (no. OCE-0939564, V.J.O. and J.P.A.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, USA
Fabai Wu, Daan R. Speth, Alon Philosof, Antoine Crémière, Stephanie A. Connon & Victoria J. Orphan
Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
Fabai Wu, Daan R. Speth, Aditi Narayanan, Igor A. Antoshechkin & Victoria J. Orphan
Department of Earth Sciences, University of Southern California, Los Angeles, CA, USA
Roman A. Barco & Jan P. Amend
Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
Jan P. Amend

Authors

Fabai Wu
View author publications
You can also search for this author in PubMed Google Scholar
Daan R. Speth
View author publications
You can also search for this author in PubMed Google Scholar
Alon Philosof
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Crémière
View author publications
You can also search for this author in PubMed Google Scholar
Aditi Narayanan
View author publications
You can also search for this author in PubMed Google Scholar
Roman A. Barco
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie A. Connon
View author publications
You can also search for this author in PubMed Google Scholar
Jan P. Amend
View author publications
You can also search for this author in PubMed Google Scholar
Igor A. Antoshechkin
View author publications
You can also search for this author in PubMed Google Scholar
Victoria J. Orphan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.W., D.R.S., A.C. and V.J.O. conceived the project. D.R.S. and V.J.O. collected the hydrothermal vent samples. F.W., D.R.S., A.C. and S.A.C. carried out the microbial incubations, periodic sampling, DNA extraction, sulphide analyses and amplicon sequencing. I.A.A. and A.N. prepared the Illumina sequencing libraries. I.A.A. performed the Oxford Nanopore sequencing. F.W., I.A.A. and A.P. assembled the Asgard archaea genomes. D.R.S. performed the phylogenomic analyses, protein clustering and overall bioinformatics platform support. R.A.B. performed the ANI/AAI analyses and taxonomic evaluation. F.W., D.R.S. and A.P. annotated the genomes. F.W. performed the PacBio HiFi 16S sequencing, protein phylogenetic analyses, marker HMM construction, comparative genomics, CRISPR/mobilome discovery, statistical analyses and wrote the paper. V.J.O. revised the paper. D.R.S., A.P., A.C., R.A.B. and J.P.A. provided critical comments on the paper. All authors read and approved the manuscript. V.J.O. and J.P.A. supervised the work.

Corresponding authors

Correspondence to Fabai Wu or Victoria J. Orphan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Microbiology thanks Brett Baker and the other, anonymous, reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The emergence of Ca. Heimdallarchaeum endolithica belonging to the Ancient Archaea Group of Heimdallarchaeota in a series of incubations derived from the same rock originated from Pescadero basin.

a. Maximum fraction of the AAG phylotype within the first-generation incubations from the rock detected within a 13-month period. b. Maximum fraction of the AAG phylotype within the serial dilution cultures of the initial lactate-fed culture. c. Amplicon sequencing of a hypervariable region in 16S rRNA gene showing the fraction of AAG phylotype in second-generation lactate-fed cultures. Mixed, mixture of rock and medium transferred from the first-generation incubation. Planktonic, only top-layer medium was transferred. d. Community complexity reduction over time as indicated by total operational taxonomic unit (OTU) counts (top) and the Shannon diversity index (bottom). e. Full-length 16S rRNA gene survey using universal archaea primers showing a single abundant AAG phylotype (Ca. Heimdallarchaeum endolithica) species above noise, and its 16S sequence dissimilarity (percent sequence identity difference) with other archaea in the community. Loki, a Lokiarchaeota phylotype; Thermopl, a Thermoplasmatota phylotype; Woese, a Woesearchaeia phylotype. f. Wide-field microscopy images of a large multispecies biofilm isolated from the lactate-fed 2^nd-generation incubation, which was stained using DAPI (DNA), FM1-43 (membrane lipids), and concanavalin A (extracellular matrix). Imaging was repeated two times with similar observations. In c and d, error bars indicate SD. N=2, independent DNA samples extracted from the same incubation.

Source data

Extended Data Fig. 2 Maximum-likelihood analyses of 282 Asgard Archaea MAGs and genomes rooted using 15 TACK archaea.

The different clades are labeled in different colors, with clade names indicated in the same color. MAGs selected for detailed phylogenomics analyses are annotated, with published ones in black and those constructed in this study in bold blue. Jord and Wukong clades do not yet have representatives passing the genome selection filter based on Marker coverage and genome contiguity scores. Detailed descriptions of these genomes can be found in the Supplementary Tables 8 (All Asgard archaea), S9 (Selected Asgard archaea), and S16 (TACK), Markers used can be found in Supplementary Table 17.

Source data

Extended Data Fig. 3 Genome-based metabolic predictions of Ca. Heimdallarchaeum spp. and comparisons with other contiguous, near-complete Asgard Archaea MAGs.

a. Illustration of metabolic reconstruction highlighting hydrogen metabolism and tricarboxylic acid (TCA) cycle. Abbreviations: α-KG, α-ketoglutarate; OAA, oxaloacetate; SHY, sulfhydrogenase (cytosolic hydrogenase); MBH, membrane-bound hydrogenase; lac, lactate; carb.hydr., carbohydrate; Pyr, pyruvate; PEP, phosphoenolpyruvate; Ace, acetate; Eth, ethanol. b and c. Enzymes involved in TCA cycle reactions (b) and cytosolic hydrogen evolution (c) in each genome/MAG representatives of Asgard archaea.

Extended Data Fig. 4 Marker determination and phylogeny of expanded representatives of Asgard archaea.

a. Differential distributions of putatively single-copy archaea marker genes in initially selected genomes/MAGs, which show cross-asgard and clade-specific marker coverage features. b. Maximum-likelihood phylogeny of an expanded selection of asgard archaea MAGs in relation to Euryarchaeota, TACK, and Eukaryota. Ca. H. aukensis was omitted to improve evenness in the taxonomic selection here due to its close relation with Ca. H. endolithica. Detailed descriptions of the 51 genomes used in the analyses can be found in Supplementary Tables 9 (Selected Asgard archaea) and S16 (TACK+Eukaryotes). Markers used can be found in Supplementary Table 17. Purple indicates genomes and MAGs constructed in this study.

Source data

Extended Data Fig. 5 CRISPR/Cas systems in Ca. Heimdallarchaeum spp.

a-e. Schematic showing the gene synteny of the CRISPR/Cas systems (serial numbers and operon types are in bold pink) and their alignments between the two genomes. Genes conserved between the two genomes are labeled in various shades of blue and purple to assist visualization. Genes only appearing in one of the genomes are in yellow. Red indicates CRISPR arrays. Array sizes are indicated by the number of repeats such as [77x]. In b, The Aloposons with giant genes are also shown to illustrate their site-specific integration. f. Size distribution of spacers in each CRISPR array.

Source data

Extended Data Fig. 6 Numbers of sequences homologous to some of the proteins encoded by Heimdallarchaeal viruses HeimV1 and HeimV2.

Magenta stars indicate enrichments in viral database. The homology search was carried out using diamond v2.0.6 using a e-value cutoff of 10^-3. ASG, asgard archaea genomes; FWA, in-house metagenomic assemblies of microbial communities in Pescadero basin incubations; PAA, publicly available and published metagenomic assemblies of microbial communities in Guaymas basin sediment; Vir, IMGVR3 viral database; Gtdb, genomic sequences from GTDB v202. See methods and supplementary tables for the details of these datasets.

Source data

Extended Data Fig. 7 Giant proteins encoded by Asgard archaea.

a. Gene synteny showing 1) an additional genomic region with truncated, fragmented sequences homologous to one of the giant genes in Aloposons, and 2) tandem giant genes which show high homologies with their neighbors are found in Thorarchaeotes, and are distantly related to one of the two giant genes in Aloposons. b. Giant proteins larger than 3000 a.a. encoded by selected Asgard archaea representatives. In dark grey are part of the Aloposons. Functional domains as identified through conserved domain database (CDD) analyses are indicated on the right. Purple indicates genomes constructed in this study.

Extended Data Fig. 8 Maximum-likelihood analyses of HeimV2 IbrA-like protein.

The branch names are as follows: For viruses, serial numbers followed by viral taxonomy then followed by host taxonomy if available. For microbial genomes, serial numbers followed by taxonomy. In total, 147 proteins were included in the analyses.

Source data

Extended Data Fig. 9 Scaling property of gene flow is obscured by fragmented genomes of varying quality.

The plots show the number of Archaea-related genes in relation to the total gene counts in the Asgard archaea genomes. a. only the 8 genomes investigated in detail in this study. All genomes have less than 20 contigs and with verified coverage of all archaeal markers. b. In addition to a, an additional 12 genomes were added (in black), which contain no more than 100 contigs with a loosened completeness scores as shown in Supplementary Table 9. Since marker redundancy differs among lineages, contamination level is hard to assess. c. In addition to b, all other 262 published Asgard archaea genomes were added (in green). This indicates a severe deviation from the invariable relation shown in a, but instead show a near linear relation. This can be understood that in either incomplete or contaminated genomes, all types of genes have equal possibility to be retained. For example, the 1.5Mb Odinarchaeote genome contains the similar number of Archaea-related genes (~900) as a Lokiarchaeote genome sized 4.4Mb. However, if a Lokiarchaeote is fragmented into 300 contigs and only 1.5Mb in total length is randomly binned into a MAG, the latter will roughly contain ~ 300 Archaea-related genes. Hence, the type of relation shown in (a) can only be captured in highly confident, complete genomes. Legend for all panels is shown in c.

Source data

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2.

Reporting Summary

Peer Review Information

Supplementary Tables

Supplementary Tables 1–17.

Supplementary Data 1

Hidden Markov Models for Asgard archaea markers.

Supplementary Data 2

Sequences of mobile elements targeting Ca. Heimdallarchaeum spp.

Source data

Source Data Fig. 1

Phylogenomic trees in Newick format.

Source Data Fig. 4

Phylogenetic trees in Newick format.

Source Data Fig. 5

Numerical data.

Source Data Fig. 6

Numerical data.

Source Data Extended Data Fig. 1

Numerical data.

Source Data Extended Data Fig. 2

Phylogenetic trees in Newick format.

Source Data Extended Data Fig. 4

Phylogenetic trees in Newick format.

Source Data Extended Data Fig. 5

Numerical data.

Source Data Extended Data Fig. 6

Numerical data.

Source Data Extended Data Fig. 8

Phylogenetic trees in Newick format.

Source Data Extended Data Fig. 9

Numerical data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, F., Speth, D.R., Philosof, A. et al. Unique mobile elements and scalable gene flow at the prokaryote–eukaryote boundary revealed by circularized Asgard archaea genomes. Nat Microbiol 7, 200–212 (2022). https://doi.org/10.1038/s41564-021-01039-y

Download citation

Received: 12 May 2021
Accepted: 29 November 2021
Published: 13 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1038/s41564-021-01039-y

This article is cited by

The emerging view on the origin and early evolution of eukaryotic cells
- Julian Vosseberg
- Jolien J. E. van Hooff
- Thijs J. G. Ettema
Nature (2024)
Asgard archaea modulate potential methanogenesis substrates in wetland soil
- Luis E. Valentin-Alvarado
- Kathryn E. Appler
- Jillian F. Banfield
Nature Communications (2024)
Viperin immunity evolved across the tree of life through serial innovations on a conserved scaffold
- Helena Shomar
- Héloïse Georjon
- Aude Bernheim
Nature Ecology & Evolution (2024)
Biosynthesis of GMGT lipids by a radical SAM enzyme associated with anaerobic archaea and oxygen-deficient environments
- Yanan Li
- Ting Yu
- Zhirui Zeng
Nature Communications (2024)
RNA-guided RNA silencing by an Asgard archaeal Argonaute
- Carolien Bastiaanssen
- Pilar Bobadilla Ugarte
- Fabai Wu
Nature Communications (2024)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Circular Heimdallarchaeota genomes

Taxonomy and metabolism

Eukaryotic signatures

Abundant repetitive features

CRISPR–Cas-guided discovery of mobile elements

Diverse evolutionary origins of Heimdallarchaeal viruses

Asgard–eukaryote parallelism in bacterial gene import

Domain-specific scaling of gene flow

Decentralized eukaryotic innovation

Discussion

Etymology

Ca. H. endolithica PR6

Ca. H. aukensis PM71

Ca. H. repetitus FW102

Methods

Hydrothermal vent rock and sediment sample collection

Artificial seawater medium recipe

Enrichment cultivation

Mineralogical analyses

DNA extraction

16S rRNA gene amplicon sequencing

Metagenomic sequencing

Genome assembly, error correction and read coverage mapping

Alignment fraction, ANI and AAI

Genome and mobilome annotations

Genome evaluation and HMM construction

Phylogenomics

Discovery of Heimdallarchaeum-targeting mobile elements through CRISPR spacer targeting

Resolution of the genomic insertion and circularization of HeimV1

Protein clustering of integrases and transposases

Taxonomic profiling through protein orthologues

Maximum-likelihood analyses of proteins encoded by HeimV1 and HeimV2

Reporting Summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links