A catalogue of 1,167 genomes from the human gut archaeome

The human gut microbiome plays an important role in health, but its archaeal diversity remains largely unexplored. In the present study, we report the analysis of 1,167 nonredundant archaeal genomes (608 high-quality genomes) recovered from human gastrointestinal tract, sampled across 24 countries and rural and urban populations. We identified previously undescribed taxa including 3 genera, 15 species and 52 strains. Based on distinct genomic features, we justify the split of the Methanobrevibacter smithii clade into two separate species, with one represented by the previously undescribed ‘Candidatus Methanobrevibacter intestini’. Patterns derived from 28,581 protein clusters showed significant associations with sociodemographic characteristics such as age groups and lifestyle. We additionally show that archaea are characterized by specific genomic and functional adaptations to the host and carry a complex virome. Our work expands our current understanding of the human archaeome and provides a large genome catalogue for future analyses to decipher its impact on human physiology.

from the human Git reveal taxonomic expansion of the archaeome. Phylogenetic tree of genomes clustered at 99% similarity ('strains'), shown with the following characteristics (from left to right): proposed original taxa (indicated by stars on the branch of the phylogenetic tree), including ultrafast bootstrap values. Species representatives are highlighted by bold genome numbers. Isolates, representatives of unknown genera and species are indicated by a coloured dot next to the genome number. Taxonomic affiliation of representative genomes is shown at order, genus and species level. The number of genomes assigned to the strain-level taxon is shown in the grey histogram. The origin displays the origin of the samples from which this genome and its representatives could be assembled. The pie chart displays the proportion of the origins. The respective genome size of the representative genome is displayed in megabases (Mb; brown bars). There is an overview of the absence and presence of genes involved in host interactions: with bile salt hydrolases (blue; BSH) and oxygen resistance genes (green), and the presence of genomes potentially coding for adhesins/adhesin-like/'Flg_new' domain 16 proteins (orange). Genomes (strain list) were analysed using MaGe Microscope and genes were counted as present when automatic annotation was positive ('putative' annotation was counted as positive).

NaTurE MIcrOBIOlOgy
Methanocorpusculum (3; 0.3%). Methanobacterium, Haloferax and Halorubrum spp. were represented by only one genome each. Of the 1,167 genomes, 10 (0.85%) could not be assigned to any previously described genus and 98 genomes (8.3%) did not match any known species. A large proportion of genomes not matching any known species (n = 83) and genera (n = 10) were affiliated with the order Methanomassiliicoccales. Read-based community profiling revealed a fraction of 1.22% archaeal reads in representative original datasets (Supplementary Table 2a-f). Based on growth rate index analyses, we received good evidence that the major archaeal species are indeed actively replicating within their habitat (see Supplementary Information and Extended Data Fig. 2). Pan-genome analyses (Supplementary Information) revealed that the gut archaeome still remains largely undersampled.
Archaeal protein profile correlates with geographic and demographic parameters. In total, 1.8 million proteins were identified from the 1,167 genomes, 54% of which were annotated as hypothetical proteins. A protein catalogue of all 1,167 archaeal genomes was generated by clustering the genes predicted across all genomes and excluding singleton clusters, resulting in 28,581 cluster representatives (>50% amino acid identity and >80% coverage) (Extended Data Fig. 3 and Supplementary Material 1). 2,050 proteins (thereof 58% hypothetical proteins) were found to be shared among >50 genomes in our dataset, mirroring the taxonomic distance of the two most abundant orders, Methanomassiliicoccales and Methanobacteriales (Fig. 3a).
The protein catalogue had predictive potential for some metadata categories (Fig. 2 , Supplementary Tables 3 and 4, and Extended Data Figs. 4 and 5). Highest prediction accuracies were reached for the lifestyle (urban/rural) of an individual (overall accuracy = 100%). Prediction accuracies >70% were still reached for the continent, country, health status, age group or sex of an individual, whereas the body mass index (BMI) group was less suitable to build supervised learning models (prediction accuracies <70%) and achieved significance only when predictions were based on actual numerical BMI values rather than grouped BMI categories (R = 0.4, P = 2.9 × 10 −5 ). For some metadata categories such as lifestyle, sex and origin per country of an individual, predictions improved if they were based on abundances (mapped protein matrix) rather than presence/ absence (unified protein catalogue). Please refer to Supplementary Information for results on combinatory effects of multiple metadata categories, and on the association of hypothetical proteins with various metadata categories (Supplementary Tables 5 and 6).
The dataset reveals previously undescribed members of the human gastrointestinal archaeome. We obtained 20 genomes affiliated with Methanosphaera sp., including three genomes from isolates. Taxonomically, human-associated Methanosphaera genomes were affiliated to three distinct species-level clades (Extended Data Fig. 6 and Supplementary Table 7a-c). Among those, M. stadtmanae was the most commonly retrieved, with 17 genomes (14 MAGs). M. stadtmanae reads represented a fraction of 0.028% among all microbial reads, with an average fraction of 13.45% among all archaeal reads in reference datasets (for details, see Supplementary Table  2a, and also for other taxa mentioned below). Two MAGs (average nucleotide identity (ANI) 98.5%) clustered within M. cuniculi, and were retrieved from healthy Asian subjects living in an urban environment. The M. cuniculi type strain was originally isolated from the intestinal tract of a rabbit 28 and has not been reported thus far in human hosts. One additional MAG belonging to the genus Methanosphaera was binned from a gut metagenome of a diseased (colorectal cancer) European male (BMI 21, age 64 years, urban environment). This genome clustered together with RUG761, a genome recovered from cattle intestines 29 (ANI 99.0%; Extended Data Fig. 6).
The dataset of human-associated Methanomassiliicoccales consisted of 145 genomes corresponding to 12 species (Supplementary  Table 1a). The genomes were distributed into two families, most of them belonging to 'host-associated' Methanomethylophilaceae (116 genomes), the other to Methanomassiliicoccaceae ('free-living clade'; 29 genomes). Five of the candidate species corresponded to genomes previously found in human samples, comprising 81% of the Methanomassiliicoccales from the present study. These included Methanomassiliicoccales Mx06 sp. 16 (44 genomes), Methanomethylophilus alvus 15 (37 genomes) and Methanomassiliicoccus intestinalis 14 (20 genomes), being the most prevalent Methanomassiliicoccales representative in human populations 16 . Mx06 representatives were mostly present in young adults (aged 32 years (average), n = 34) from rural areas (80%; n = 40) in Oceania, Asia and Africa (65%, 13% and 7%, respectively; n = 43). Together with its high prevalence (80%) in a population of 7-to 48-year-old uncontacted Amerindians 16,30 , it appears that this species is strongly linked with nonwesternized populations. The young age of people with this species contrasts with previously reported positive correlation between age and methanogen prevalence. Several representatives of this species have the genetic potential to metabolize TMA, a bacterial metabolite involved in trimethylaminuria and suspected in cardiometabolic, cardiovascular and renal diseases. This species is part of a well-supported clade that is separated from other Methanomethylophilaceae genera (Methanomethylophilus, Methanogranum and Methanoplasma spp.) and belongs to the candidate genus 'UBA71' following the Genome Taxonomy Database (GTDB) classification (Supplementary Table 1c). We thus suggest that it represents a previously undescribed genus and species, and propose the name of 'Candidatus Methanoprimaticola hominis' gen. nov., sp. nov. (Me.tha.no.pri.ma.ti' co.la. N.L. pref. methanopertaining to methane; N.L. pl. n. Primates a zoological order; L. suff. -cola (from L. masc. or fem. n. incola) an inhabitant, dweller; N.L. fem. n. Methanoprimaticola a methane-forming dweller of primates; ho'mi.nis. L. gen. n. hominis of a human) for representatives of Mx06 (representative MAG: GUT_GENOME268463). Ca. Methanoprimaticola hominis represented 0.094% of all microbial reads (691 studies), and 0.50-69.22% of all archaeal reads in 48 of 691 analysed studies (Supplementary Table 2a).
In addition to the species previously identified through MAGs or culture approaches, we identified 6 undescribed species of Methanomassiliicoccales, represented by 24 MAGs. One of those gathers 12 MAGs and was more often found among Asian people. We propose naming it 'Ca. Methanoprimaticola macfarlanii' sp. nov. (mac.far.la'ne.i. N.L. gen. n. macfarlanei named after George T. Macfarlane; representative MAG: GUT_GENOME251929). This species represented 0.076% of all microbial reads in 691 screened studies (Supplementary Table 2a).
A number of additional archaeal taxa not yet described to be constituents of the human GIT were recovered from the MAG dataset. For details on these and other taxa (Halorubrum, Haloferax, Methanocorpuscum and Methanobacterium spp.), please refer to Supplementary Information. The M. smithii clade splits into two separate species. An overview on host association, geography, genome size and taxonomic association of known Methanobrevibacter spp. and genomes is given in Fig. 4 Table 8b). This gene is involved in molybdate transport and responsible for substrate binding 32 . In addition, among the top 25 discriminative proteins (Extended Data Fig. 7 and Supplementary Table 9b), the molybdate ABC transporter permease component, as well as the molybdate ABC transporter ATP-binding protein, were identified in 94% of all M. smithii genomes, but in none of the M. smithii_A genomes. This indicates a different pathway for molybdate acquisition in the M. smithii_A clade. The M. smithii_A genomes were further characterized by additional unique membrane/cell-wall-associated proteins, such as adhesin-like proteins, surface proteins and a number of uncharacterized membrane proteins/transporters (Extended Data Fig. 7).
Based on the extent of discriminative features, and an ANI of only 93.95% between the two representative genomes of M. smithii and M. smithii_A, we propose to rename the smithii_A clade, represented by isolate WWM1085 (GUT_GENOME143185 (ref. 33  ribosomal RNA gene sequences, which is most probably the reason for missing this clade separation previously. However, analysis of the mcrA gene revealed a consistent difference between the two clades, with an average of 2.15% difference in amino acid sequence (1.82-2.22%; Supplementary Material 4).
The human archaeome carries a complex, previously unseen virome. We identified 94 viral populations in our genome datasets (Extended Data Fig. 8 and Supplementary Table 10a-c). Of the identified proviruses, 91 viral species representatives were found to be specific for Methanobrevibacter A, and one each for Methanomassiliicoccus and Methanosphaera spp., and Methanomethylophilaceae UBA71.
Although archaeal viruses in extreme environments were discovered in the early 1970s 34,35 , little is known about nonextremophilic viruses in the highly abundant mesophilic environments, and only a few nonextremophilic archaeal viruses have been isolated so far [36][37][38][39] . To the best of our knowledge, no viruses/proviruses have been identified in the past infecting Methanomassiliicoccales and Methanobacteriales members of the human gut.  We explored the uniqueness of these 175 high-and mediumquality proviruses by comparing them with the latest comprehensive human Gut Virome Database (GVD) 3 , and the Viral Refseq Database, using the network-based viral classification tool vCon-TACT2 (ref. 40 ). However, none of the viruses clustered with any of the sequences in the databases. Due to the lack of similar archaeal viral genomes in the reference databases, the classification and further characterization of discovered archaeal viruses through metagenomic approaches remain challenging.
Taken together, these results reveal that archaeal viruses probably have a currently underestimated diversity and probable ecological importance in the human gut microbiome.
Human-associated archaea exhibit a lower proportion of bacterial genes than animal-associated archaea. The adaptation of archaea to the GIT may have been favoured by specific acquisition of genes from the resident bacterial community providing additional functions. To assess this possibility, we compared the retrieved Methanosphaera and Methanobrevibacter genomes with isolates and genomes derived from animal sources (Supplementary  Table 11). For this comparison, and to rule out false information from contaminating reads, we used only genomes from isolates and MAGs with 0% contamination.
Human-associated methanogens revealed a significantly lower proportion of genes most probably derived from bacterial origin, irrespective of whether we considered isolates only or both isolates and MAGs. Human-associated Methanobrevibacter spp. carried, on average, approximately 2.84% genes annotated as of nonarchaeal origin, which was significantly lower than the proportion of nonarchaeal genes in animal-associated Methanobrevibacter sp. (6.09%; Mann-Whitney U-test, P = 0.00308; genomes from isolates only: 6.36%). This was mainly due to a significantly increased contribution of clostridia-derived genes (specifically from Lachnospiraceae) in genomes from animals (P = 0.00116 and P < 0.00001, Mann-Whitney U-test; Extended Data Fig. 8). Lachnospiraceae representatives are mainly specialized on plant degradation. In particular, Methanobrevibacter smithii/smithii_A (Ca. M. intestini) representatives revealed a very low contribution of potentially nonarchaeal genes (2.11%; genomes from isolates only: 1.8%).
Human-associated Methanosphaera spp. carried on average a proportion of 1.45% of genes of bacterial annotation (genomes from isolates only: 0.68%). Animal-associated Methanosphaera spp., however, contained a significantly higher proportion of bacterial genes (6.74%; P = 0.00452, Mann-Whitney U-test; genomes from animal isolates only: 5.31%). The differences were mainly due to a significantly increased contribution of Bacilliand Erysipelotrichia-derived genes in genomes from animals (P = 0.000441 and 0.000509, respectively; Student's t-test; Extended Data Fig. 9). For information on Methanomassiliicoccales, please refer to Supplementary Information.
Our results indicate that adaptation towards the human host might not necessarily be reflected by a (generally) higher proportion of genes derived from the human gastrointestinal bacteriome.
Host-associated archaea are distantly related to environmental relatives. We reasoned that host-associated archaea are taxonomically and functionally distant from their environmental relatives due to the characteristics of their individual host environments.
In 16S rRNA gene-based analyses (Supplementary Table 12a,b), we found that members of genera Methanobrevibacter and Methanosphaera, as well as Ca. Methanomethylophilus belonged almost exclusively to taxa from host-associated (animal, human, plant) sources, whereas Methanocorpusculum and Nitrososphaeria spp., and Haloferaceae were more related to environmental strains (Fig. 5a) 42,43 and M. luminyensis 13,16 , widespread in various environments.
Functional and metabolic interaction of the archaeome with the gut environment. We analysed specific features that could indicate the advanced interaction of the human-associated GIT archaea with their gut environment (host and nonarchaeal microbiome; Fig. 1).
Loss of genes involved in dealing with oxidative stress is considered to be a trait of host association, because environmental strains have to face nonpermanently, strict anaerobic conditions, whereas this is not the case for strains inhabiting the GIT. We therefore analysed the presence of genes associated with oxygen resistance (catalase, superoxide dismutase, peroxiredoxin, rubredoxin and thioredoxin 44 ). Catalase was detected in some Methanomassiliicoccales (mainly Methanomassiliicoccus representatives) and Haloarchaea, and in Methanobrevibacter arboriphilus and Methanobacterium spp. The presence of a superoxide dismutase was rarely detected, namely in members of Haloferax and Halorubrum spp. None of the Methanobrevibacter representatives, except M. arboriphilus, carried the peroxiredoxin gene. In contrast, thioredoxin and rubredoxin were detected in most of the genomes (Fig. 1).
Additional functions of interest are adhesins and bile salt hydrolases (that is, choloylglycine hydrolase (CGH)). Adhesins or adhesin-like proteins were widely observed (  Table 13). For example, different from all Methanomassiliicoccales, all human gut Methanobrevibacter spp. have the genetic potential for formate and H 2 /CO 2 utilization. However, 83% of all methanogenic MAGs (including Methanobacteriales and Methanomassiliicoccales) have the mtaABC genes, providing the genetic potential to use methanol. The two dominant Methanobrevibacter spp. carry mtaABC genes, whereas four species that are rarely present do not carry these genes, strongly suggesting that methanol utilization might provide a selective advantage in the human gut. However, the condition under which Methanobrevibacter sp. uses methanol and whether it is a methanogenic substrate or enters an anabolic pathway remains to be elucidated.
The two dominant Methanobrevibacter sp. also display the genetic potential to use alcohols (probably secondary alcohols and ethanol) as electron donors for methanogenesis. One of the Methanosphaera spp. may also have the genetic capacity to reduce methanol with ethanol for methanogenesis as described earlier 45 , but this species was encountered only once in our analyses, and M. stadtmanae cannot perform this pathway.
The majority (11/13) of the GIT-associated species of Methanomassiliicoccales code for the MttBC methyltransferase and corrinoid protein needed for methanogenesis from TMA. This capacity would allow them to decrease the concentrations of this molecule produced by gut microbiota and involved in cardiovascular diseases 16,21 . The presence of the mttBC genes was detected in a larger proportion of the Methanomassiliicoccales MAGs originating from Europe and North America (~60%) with respect to Africa and Asia (~40%) or Oceania (17%) (Extended Data Fig. 10). These variations may reflect different TMA-production capacity by bacteria in the microbiota across these populations and diet habits. One of the two species of Methanomethylophilaceae lacking TMA-utilization capacity (Ca. Methanoprimatia macfarlanii) also lacks MtbBC and MtmBC methyltransferases and corrinoid proteins for dimethylamine and monomethylamine utilization, respectively. However, several strains of this species have the genes encoding the synthesis of pyrrolysine (pylSBCD), a proteinogenic amino acid (UAG codon encoded) quite exclusive to methylamine-specific methyltransferases 46,47 . The absence of detection of the methylamine-specific methyltransferases in these MAGs, including MttBC for TMA utilization, is thus probably due to genome incompleteness. The other species lacking methylamine methyltransferase, corresponding to Methanomassiliicoccales Mx02 (ref. 16 ), also lack any other genes known to be involved in methyl-compound utilization or in any alternative methanogenesis pathways (Supplementary Table 13

Discussion
Our work adds original information on the biology of the GIT archaeome, by characterizing a collection of 1,167 nonredundant archaeal genomes. We were able to make initial associations between the diversity of gut-associated archaea with several demographic and geographic patterns. However, many geographic locations remain undersampled to date.
As our genome collection is based on public datasets processed for the analysis of the bacterial component of the microbiome, a large number of archaeal species requiring specialized methods for cell lysis and DNA extraction 6 may be missing. Moreover, sequencing stool samples is not necessarily representative of the complete diversity of species in the intestines, because some archaea have been shown to form biofilms and stick to the epithelium 49 . Besides, as several taxa in our collection are represented by only single genome representatives, additional conspecific strains will be needed to allow profound analyses. Thus, we are far from capturing the entire diversity of the GIT archaeome.   The overall observed percentage of archaea present in the human gut microbiomes (~1.2%, Supplementary Table 2a) is in agreement with recently reported average percentages based on 16S rRNA gene and shotgun metagenomic information 18 . The abundance of methanogenic archaea in the human gut is highly variable and represented by two physiological types of humans, namely methane emitters (>5 p.p.m. methane in breath, ~20% of the western population; 2% archaeal signatures in overall microbial GIT community) and nonemitters (0.002% archaeal signatures), exhaling negligible amounts of this gas. The effects of these striking differences of highand low-methane emitters on host physiology are largely unclear to date, but are considered to be relevant to health and disease 18 .
The presented genome collection and the catalogue of 1.8 million putative proteins can now serve as a unique source to generate hypotheses to be addressed in future studies. This includes aspects on: (1) the archaeal physiology and metabolism; (2) the detailed comparison and differentiation of free-living, animal-and human host-associated archaea (see also ref. 50,51 ), including the aspect of HGT; (3) the interaction with the bacterial microbiome and the virome; and (4) the type of archaeal cross-talk with the human host. Moreover, considering that only 9 of 27 archaeal species detected in the human gut metagenomes had a cultured representative, the provided resource can serve as a starting point for targeted cultivation of previously uncultivated members of the archaeome and their virome.
Due to missing metadata and limited statistical power, it is challenging to establish significant associations between the archaeal genomic diversity and human lifestyles or diseases herein. Thus, experimentally driven, well-designed studies will ultimately elucidate the impact of archaea on human health 9 . Moreover, incorporating both transcriptomics and proteomics data will further reinforce the genomic predictions and improve our understanding of the regulation of archaeal physiology and host adaptation. Future efforts should also seek to extend the dataset beyond the gastrointestinal environment, to other human body sites and hosts.
Overall, our work contributes substantially to the understanding of the microbiome of the human GIT as a complex multi-domain bacterial, archaeal, fungal and viral network [52][53][54][55][56] . All microbial puzzle pieces have co-evolved and adapted together within the gut ecosystem, so study of these dynamic multi-kingdom interactions holistically will provide crucial insights into the role of the gut microbiome in health.

Methods
A resource summary is provided in Supplementary Table 14.
Dataset description. To explore the diversity of archaea in human gastrointestinal samples, we compiled publicly available genomes from four recent collections of MAGs 2,25-27,57 . Briefly, the Unified Human Gastrointestinal Genome (UHGG) collection (data access June 2020, https://www.ebi.ac.uk/metagenomics/ genomes) holds published, nonredundant MAGs and isolates, collected from public repositories and associated metadata information (see ref. 2 for more details). No statistical methods were used to predetermine sample sizes. We additionally included published genomes from cultured archaea available in the National Center for Biotechnology Information (NCBI) 58 , Pathosystems Resource Integration Center (PATRIC) 59 and Integrated Microbial Genomes and Microbiomes (IMG/M) 60 repositories.
Genomes were compared using Mash v.2.1 (ref. 61 ) and, for genomes that were estimated to be identical and had a Mash distance of 0, only one was selected. In addition, we included genomes of 'Ca. Methanomethylophilus alvus' 15 42 ) to complete the dataset. Those genomes were assigned a genome accession no. (GUT_GENOME286998, GUT_GENOME287001, GUT_GENOME287002, GUT_GENOME287004), as given in Supplementary Table 1a. This brought the total number of genomes used for the analysis in the present study to 1,167. Data collection and analysis were not performed blind to the conditions of the experiments.
Genome quality and taxonomic classification. The completeness of the nonredundant 1,167 genomes was evaluated by CheckM v.1.0.11 (ref. 62 ) and only genomes that were >50% complete and had <5% contamination were selected (following the protocol from ref. 2 ; Extended Data Figs. 1 and 2a-c). This procedure yielded 1,167 nonchimeric 63 (clade separation score (CSS) = 0; Supplementary  Fig. 1), and species (<95% ANI similarity, >75% genome completeness; Supplementary Table 1c; 27  genomes). For this, the best quality genome (genome completeness, minimal contamination, strain heterogeneity and assembly continuity based on the N50 value) from each cluster was selected as representative or, whenever an isolate was available, it was preferred and used for further analysis.
Read mapping was performed with Bowtie2 (ref. 64 ) for the genomes that had original raw reads available and were post-processed using samtools 65 . Strain heterogeneity within each MAG was computed using the script 'polymut.py' from the CMseq tool (https://github.com/SegataLab/cmseq). Alignment files were used together with the parameters --minqual 30 and --cov 10, following the method description in refs. 2,25 . A threshold of ≤0.5% indicates heterogeneity of assembly and the higher likelihood of one strain present per assembly. GUNC 63 was used to detect chimerism in all 1,167 genomes and resulted in a CSS of 0 for all genomes (Supplementary Table 1e). A CSS closer to a value of 0 indicates that a genome is free of contamination and all genes are assigned to the same taxonomy, whereas a CSS score closer to 1 indicates chimerism. The CSS, taken together with the contamination thresholds from CheckM, demonstrated that our 1,167 genomes were not chimeric in nature.
DRep v.2.0.0 (ref. 66 ) was used to dereplicate the complete dataset at 95% and 99% ANI values. The 95% ANI values were selected to separate between species boundaries (n = 27) 67 . A cut-off of 99% was selected for strain delineation, provided that a stable number of clusters for MAGs >75% complete had <5% contamination (n = 98; Extended Data Fig. 3a). Lower thresholds did not affect the number of strains recovered. The resulting strain and species representatives are given in Supplementary Table 1a-c.
All genomes were taxonomically annotated following the procedure given in ref. 2 . The taxonomic assignment was performed using the GTDB Toolkit v.0.3.1 (database release 04-RS89) 68 and default parameters that utilize a set of 122 marker genes to identify archaeal MAGs. Previously undescribed species and genera were defined when no taxonomic information was assigned for all members of a species cluster and their species representatives based on the GTDB database. The methodology is detailed in Supplementary Fig. 1.
The protein catalogue was generated by combining all predicted CDSs (total number 1,790,493) derived from the 1,167 nonredundant archaeal genomes. MMseqs2 linclust 73 was used to cluster the concatenated proteins dataset using the options '--cov-mode 1 -c 0.8' (minimum coverage threshold of 80% the length of the shortest sequence) and '--kmer-per-seq 80' . Proteins were clustered at different percentage identities and the number of unique proteins resulting per clustering for each taxonomic family was computed and visualized (Extended Data Fig. 3b). To reduce the risk of contaminants, the proteins were filtered to remove all nonclustered proteins. This gave a total of 28,581 proteins clustering at 50% identity (Supplementary Material 1) visualized using the library pheatmap 74 in R. MMseqs2 using the 'easy-search' was additionally used for aligning the 28,581 proteins to UniRef 50 (ref. 75 ) (date of download January 2021) to verify predicted proteins that resulted in 13,254 (46.37%) proteins with a hit.
In addition to the protein catalogue, the various species and strain subsets of the total 1,167 archaeal genomes (Supplementary Table 1b,c) were submitted to MaGe MicroScope (Microbial Genome Annotation & Analysis Platform 76 ), for detailed analyses of genomic synteny, and the detection of bile salt hydrolases, oxygen resistance genes and adhesins, following the automated annotation of MaGe (Supplementary Table 1f).

Relative abundance of archaea in human metagenomes. Raw read datasets (691)
were obtained from studies of the human gut microbiome, out of which 691 (of 1,167) medium-or high-quality archaeal MAGs were assembled. The remainder was not made public by their original submitters (Supplementary Table 1a).
We mapped raw reads to the 27 reference archaeal species representatives using Bowtie2-align 64 and post-processed using samtools 65 . The generated sorted mapping files were used to calculate the breadth of coverage. Breadth of coverage was calculated by dividing the total number of bases covered (using samtools mpileup) by the length of the reference genome. To get the percentage coverage breadth we multiplied the resulting number by 100.
For measuring the relative abundance of the 27 archaeal species in the different metagenomics datasets we used CoverM (https://github.com/wwood/CoverM) and the relative_abundance calculation method (Supplementary Table 2f).
Reads were additionally mapped using Kraken v.2.1.2 (ref. 77 ) (with default settings) against (1) a custom database of the UHGG catalogue available from the MGnify FTP site (http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_ genomes/human-gut/v1.0/uhgg_kraken2-db) and (2) a customized database of the 27 archaeal species representatives in our dataset because we supplemented the initial resource with additional isolates (Data description). Results were processed using Bracken v.2.5.3 (ref. 78 ) using both read lengths 100 and 250 to estimate the relative abundance of domain-, family-and species-level taxa (Supplementary  Table 2b-d). We did not observe differences in the output values of the analysis between read lengths 100 and 250.
Protein abundance estimation. To avoid estimations based on potential false negatives derived from sample processing or genome binning, all raw reads were aligned on the unified archaeal protein catalogue using DIAMOND BLASTx 79 . The hits were counted and the result was transformed into a matrix of the number of hits for each protein per study using the pandas library 80 . This resulted in a mapped protein matrix used for further statistical analysis to minimize the risk for sample or batch effects in our dataset (Supplementary Material 2).
Besides genomic information (genome length, number of contigs, N50, GC content, genome completeness, genome contamination, and number of rRNAs and transfer RNAs), 11 metadata categories (numerical 2, categorical 9) could be considered for the dataset. Information about the geographic origin was available for 1,063 genomes (91% of the dataset covered countries from maximum to minimum: the USA, Israel, Spain, Sweden, Fiji, UK, Austria, Denmark, the Netherlands, France, China, Peru, Germany, Madagascar, United Republic of Tanzania, Australia, Canada, Ireland, Italy, Russia, El Salvador, Iceland, Mongolia, Norway, on five continents; Supplementary Table 1d and Fig. 2a).
To overcome biases introduced by potential residual MAGs contamination issues, we focused our analyses on patterns observed in two or more genomes, unless stated otherwise. In addition, we explored protein diversity patterns and their functional characterization among isolated genomes to corroborate those observed in MAGs. Finally, to avoid estimations based on potential false negatives derived from sample processing or genome binning, raw reads were mapped on the unified archaeal protein catalogue (Supplementary Material 1) as a reference to generate a mapped protein matrix (Supplementary Material 2), which minimized the risk for sample or batch effects in our dataset.
Supervised classification and regressions with RandomForest were applied to predict respective metadata categories from the unified archaeal protein catalogue and the mapped protein matrix with the q2-sample-classifier plugin 81 . To reduce the risk of overfitting, the matrices were downsampled to a minimum of 50 genomes for each tested metadata category, as recommended by scikit-learn 0.24.1 (ref. 82 ). First subsets of each metadata category were created from the entire protein matrix and randomly split into a training set and a test set with the proportions 80%:20%. By using K-fold cross-validation, the training set served as a learning model to predict class probabilities with settings for optimized feature selection and parameter tuning. In the end, model accuracy was determined by comparing the predicted values between the training and test datasets.
Pan-genome analysis. Pan-genome analysis was performed using Panaroo 83 in 'strict' mode because it accounts for potential annotation errors, fragment assemblies and contaminated genomes to recover an accurate pan-genome. Pan-genome analysis was performed for archaeal genomes of the same families and the same genus. We used Heaps' law (η = κ × n − α) to estimate whether we had an open or a closed pan-genome 84 , This analysis was carried out in the R package 'micropan' 85 using a default permutation value of 100, where η is the predicted number of genes for a particular number of genomes (n), and κ (intercept parameter) and α (decay parameter) are the constants used to fit the curve after the genomes have been ordered in a random way. An open pan-genome is indicated by α < 1 whereas a closed pan-genome is indicated by α > 1.
Estimation of growth rates. Growth rates were estimated using GRiD 86 in the multiplex mode (minimum coverage = 1 and reassignment of ambiguous reads) by a customized GRiD database based on the created subset of high-quality archaeal genomes on species level. As the original raw reads were not available for each representative genome and the remaining read sets were not made publically available, growth rate estimates covered 131 metagenomic read sets (70% of all archaeal genomes grouped at strain level).
In-depth taxonomic and clustering analyses of the various genera. ANI distances and tree matrices were calculated using the online resources of the enveomics platform 87 , MaGe 76 , as well as Microbial Genomes Atlas (MiGA) 88 . Dendrograms, built on the ANI tree matrix, were annotated using the iTOL tool (Interactive Tree Of Life) 89 , and processed using InkScape. For specific considerations involving additional genomes from animals, a subselection of the archaeal genomes was reanalysed together with the additional genomes following the same settings as described for the protein catalogue procedures above (respective datasets are given in the Supplementary Table 12).
McrA genes were extracted via MaGe, hosting all strain-level genomes (Supplementary Material 4). McrA genes were aligned using MegaX 90 , and a maximum likelihood tree was calculated (default settings).
Bacterial and archaeal BSH genes were derived from ref. 91 and supplemented with BSH genes from genomes in the present study. Sequences were cropped and a tree was calculated using the MEGA-X Maximum Likelihood Phylogeny Reconstruction. The tree was annotated using the iTOL tool 89 .
Initial HGT analysis. Representative genomes from isolates and MAGs with 0% contamination according to CheckM results were selected for these analyses (Methanosphaera spp.: 8 from humans, 7 from animals; Methanobrevibacter spp.: 30 from humans, 11 from animals). A list with full details is provided in Supplementary Table 11. Genomes from animals were obtained from NCBI (ncbi. nlm.nih.go/genome), representing all available high-quality genomes (isolates, MAGs) of the respective genus at the time point of analysis (2020; Supplementary Table 11). The selected genomes were further characterized as previously mentioned using eggNOG-mapper v.2.0.0 (ref. 70 ) and the previously mentioned databases (Genome annotation and protein catalogue). Annotated genes were sorted according to their taxonomic affiliation (eggNOG output information: 'best_tax_level'), and the proportion of archaeal and bacterial genes was calculated for all genomes and genera. Data were visualized using Krona 92 .

Detection of virulence and resistance genes.
To predict potential virulence genes in all 1,167 archaeal genomes, ABRicate v.0.5 (https://github.com/tseemann/ abricate) was used to profile the following databases: CARD 93 , Resfinder 94 , PlasmidFinder 95 , ARG-ANNOT 96 , EcOH 97 and MEGARes 2.0 (ref. 98 ), as well as NCBI AMRFinderPlus 99 . As ABRicate is solely based on DNA sequences, blastX searches using DIAMOND 79 was used to complement results from ABRicate on the level of protein sequences in the virulence factor database (VFDB v.20191122) 100,101 and CARD together with the Resistance Gene Identifier 102 .
Specific groups of proteins and genes involved in human interaction were investigated according to available annotations from MaGe 76 and eggNOG-mapper 70 .

Viral identification, quality estimation and comparisons to viral databases.
To assess the presence of prophages, VirSorter2 (ref. 103 ) was used to scan all MAGs. CheckV 104 was used to estimate completeness and assess the quality of VirSorter2-predicted viruses. To ensure that we overcame possible contamination issues that could potentially result from the binning process, we selected proviruses flanked within archaeal contigs for this analysis. VirSorter2 tends to overestimate provirus boundaries (https://github.com/jiarong/VirSorter2), therefore CheckV is recommended to apply a quality control check and remove false positives. CheckV looks for host-virus boundaries based on differences in GC content and gene annotation in a sliding window approach. Proviruses (detected by VirSorter2 followed by CheckV and CheckV on a separate run) that had a quality assignment of medium quality (50-90% completeness) of high quality (>90% completeness), or were complete, were considered for further analysis. Quality assignments by CheckV are based on Minimum Information about an Uncultivated Virus (MIUViG) standards 105 . It is worth mentioning that proviruses detected by Virsorter2 followed by CheckV were detected by running CheckV independently. The selected proviruses were subsequently clustered with MMseqs2 using the 'linclust' function with the same parameters previously specified and MMseqs function 'result2repseq' to select a viral cluster representative.
We identified 94 viral populations in our genome datasets. This number is the result of clustering 45 high-quality (>90% completeness) and 130 medium-quality (50-90% completeness) archaeal proviruses, flanked within archaeal contigs, at 95% identity and 80% coverage, where one to a maximum of two proviruses were identified per host. The selected cut-off is commonly used for viral species 3,105-110 definition (Extended Data Fig. 8 and Supplementary Table 10a-c).
Open reading frames of viral populations with the previously specified MIUViG quality were used as input for vConTACT2 (ref. 40 ) including Viral RefSeq genome (v.97). VconTACT2 is used to affiliate a family or a genus rank group to viral populations and thus to determine taxonomic diversity.
A recent study was published by Gregory et al. 3 where a human GVD harbours 33,242 viral populations, including 0.1% archaeal viruses resulting from 2,697 gut metagenomes in 32 studies. This dataset was used as a reference database to scan the identified viral scaffolds using MMseqs2 'easy-linclust' function at 50, 80, 90 and 95% identity.
Comparison to environmental archaea. For considerations based on 16S rRNA genes, 16S rRNA genes of representative genomes were extracted using Metaxa2 (ref. 111 ) (n = 314; not all 16S rRNA genes could be recovered). This dataset was supplemented with data from amplicon sequencing studies and clone sequences from archaeal signatures from human gastrointestinal samples (dataset described in ref. 9 ; n = 381 in total). These sequences were aligned and classified using the SILVA rRNA database 112 . More specifically, the retrieved 16S rRNA genes were subjected to the ACT tool (alignment, classification and tree service) 113 , using the following parameters: basic alignment parameters: 'removed'; search and classify, minimum identity with query sequence: '0.95'; number of neighbours per query sequence: '10'; compute tree; workflow: 'Denovo including neighbours' and default parameters; and advanced tree computation parameters, positional variability filter: 'none' , domain: 'archaea' . Unclassified sequences were removed from the dataset. Via SILVA SINA, ten next neighbours were selected, and information on their isolation source was gathered through NCBI (Supplementary Material 3 and Supplementary Table  12a; the final dataset contained 566 sequences). Grouping was performed at the genus/species level, and information on the percentage of host-associated archaea in all groups was displayed as a circle packing plot (RawGraphs online tool, https://app.rawgraphs.io).
For genome-based analyses, a set of 623 archaeal MAGs identified from environmental and gastrointestinal samples (for example, rumen, guinea-pigs and baboon faeces) was used as a reference dataset for comparison to the set of archaea isolated from the human gut microbiome 2,114 . All environmental genomes used were >50% complete, and also up to 90% complete, with <5% contamination as well. To estimate the pairwise ANI distance between environmental archaeal genome dataset (Supplementary Table 12) and the archaeal genomes from the human gut microbiome, we used fastANI 67 , a tool that effectively discriminated intra-and interspecies boundaries for >90,000 prokaryotic genomes.

Metabolic interaction of the archaeome with the gastrointestinal environment.
Proteins involved in methanogenesis were searched in all genomes using customized Hidden Markov Model profiles (threshold e-value 10 −5 ) implemented in Macsyfinder 115 . This allowed us to determine the presence of enzymatic complexes on the basis of the presence of all or most subunits. The presence in the 26 methanogenic species was first evaluated based on the representative genome (which are the most complete/less contaminated). If most of the MAGs in a species have an enzyme, then this enzyme was considered to be present in the species, even if absent from the best representative genome.

Functional interaction of the archaeome with the gastrointestinal environment.
Specific functions were searched for ('search by keywords'-function) in MaGe 76 . Presence and absence information was used for tree annotation through iTOL 89 . The backbone tree was based on ANI similarity as described above.
Tools used for data visualization. Principal coordinate analyses (PCoAs) and other graphic displays based on the unified archaeal protein matrix were calculated and visualized in Qiime2 (ref. 116 ) and Calypso 117 . Venn diagrams were created with creately (https://creately.com). Alluvial plots, circle packing plots and contour plots were generated with RAWGraphs (https://app.rawgraphs.io). Strip charts were created with Calypso. Dendrograms, based on the ANI tree matrix, were annotated using the iTOL tool. All figure panels were created using InkScape.
Quantification and statistical analysis. All statistical analyses were conducted using R, Qiime2 116 , Calypso 117 and MaAsLin2 (ref. 118 ). Where applicable, data distribution was tested using Shapiro-Wilk normality tests. Statistical significance was determined by nonparametric tests including Spearman's rank correlations, PERMANOVA and Wilcoxon's rank-sum tests for pairwise analysis, Mann-Whitney U-tests for unpaired data and Kruskal-Wallis tests if the significance had to be determined for all groups. Significance was considered at an α < 0.05 after 999 permutations. P values were corrected for multi-hypothesis testing using the false discovery rate. To control for potential batch effects resulting from different isolation methods, DNA extraction protocols, assembly methods and/or sampling depth, etc., the study accession was set as a random effect in MaAsLin2 analysis. In addition, linear mixed effect models 81 were calculated to test whether Bray-Curtis distances and α diversity (Shannon's diversity index) of the mapped archaeal protein matrix changed over age, BMI, genome completeness or growth rate (GRiD), and in response to the use of antibiotics, geography (continent or country), disease, sex, health status or lifestyle in the dataset.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All the recovered genomes are available for bulk download in an archived folder 'archaea_gut-genomes.tar.gz' in generic feature format at http://ftp.ebi.ac.uk/pub/ databases/metagenomics/genome_sets. All used genomes and metagenomes in the present study are publicly available on NCBI and MGnify resource. Accession no. details and paper references of used genomes and metagenomes are summarized in Supplementary Table 1a-f.

Code availability
The present study did not generate code, and mentioned tools used for the data analysis were applied with default parameters unless specified otherwise. Fig. 1 | Methodology. Flow chart covering the major analysis steps of the study. Colored boxes show the source data (green), main input for the analysis (magenta), downstream analysis (red) and the taxonomic analysis of the presented data set (yellow). Different steps are connected by arrows highlighting a selection of used bioinformatic tools for each step. For details on the genomes, software and databases used, please refer to Supplementary Table 1 and Supplementary Table 14.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection Data collection is described in "Dataset description"

Data analysis
All the other considered genomes and metagenomes are publicly available in NCBI, and referenced. This study did not generate code, mentioned tools used for the data analysis were applied with default parameters unless specified otherwise. All software and algorithms used are listed in the key resources table.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability All the recovered genomes are available at [http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/archaea_gut-genomes.tar.gz]. All the other considered genomes and metagenomes are publicly available in NCBI, and referenced.