Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut

Li, Ran; Wang, Yongming; Hu, Han; Tan, Yan; Ma, Yingfei

doi:10.1038/s41467-022-35735-y

Download PDF

Article
Open access
Published: 29 December 2022

Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut

Ran Li^1,2^na1,
Yongming Wang¹^na1,
Han Hu³,
Yan Tan³ &
…
Yingfei Ma ORCID: orcid.org/0000-0002-2563-5390¹

Nature Communications volume 13, Article number: 7978 (2022) Cite this article

9430 Accesses
18 Citations
29 Altmetric
Metrics details

Subjects

Metagenomics

Matters Arising to this article was published on 17 July 2024

Abstract

The human gut microbiome has been extensively explored, while the archaeal viruses remain largely unknown. Here, we present a comprehensive analysis of the archaeal viruses from the human gut metagenomes and the existing virus collections using the CRISPR spacer and viral signature-based approach. This results in 1279 viral species, of which, 95.2% infect Methanobrevibacteria_A, 56.5% shared high identity (>95%) with the archaeal proviruses, 37.2% have a host range across archaeal species, and 55.7% are highly prevalent in the human population (>1%). A methanogenic archaeal virus-specific gene for pseudomurein endoisopeptidase (PeiW) frequently occurs in the viral sequences (n = 150). Analysis of 33 Caudoviricetes viruses with a complete genome often discovers the genes (integrase, n = 29; mazE, n = 10) regulating the viral lysogenic-lytic cycle, implying the dominance of temperate viruses in the archaeal virome. Together, our work uncovers the unexplored diversity of archaeal viruses, revealing the novel facet of the human gut microbiome.

A compendium of viruses from methanogenic archaea reveals their diversity and adaptations to the gut environment

Article 25 September 2023

Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome

Article Open access 24 June 2021

Analysis of metagenome-assembled viral genomes from the human gut reveals diverse putative CrAss-like phages with unique genomic features

Article Open access 16 February 2021

Introduction

The human gut microbiome is closely linked with human health¹. In addition to the predominant bacterial component, non-bacterial members of the gut microbiota (archaea, fungi, and viruses) are known to play important roles in microbiome dynamics and human physiology, immunity, disease, etc.². Archaea are also among the commensal microorganisms inhabiting other organ systems of the human body, such archaea are regularly detected in the respiratory tract, the oral cavity, and the skin³. Nevertheless, human-associated archaea are often overlooked and remain unconsidered, since archaea are relatively low in abundance as compared to bacteria and mostly are unculturable. As such, culture-independent methods, such as next-generation sequencing, can help capture their identity and allow a broad assessment of the human archaeome as well as the archaeal virome.

Microbial viruses exert control over the composition and metabolism of microbial communities. The dynamics of bacterial viruses in the human gut have been studied in detail so far^4,5, while few studies report the detection of the human gut archaeal viruses^6,7. Viruses infecting archaea are notoriously diverse both in terms of their genome sequences and virion structures^8,9. Most archaeal viruses have been thus far isolated from hyperthermophilic or halophilic hosts, with only a handful of virus species described for methanogenic and ammonia-oxidizing archaea¹⁰. Recent exhaustive metagenomic surveys aided the discovery of novel archaeal viruses from multiple ecosystems, including the ocean, fresh water, hot spring, and soil habitats⁸. In human feces, smacoviruses were once thought to infect eukaryotes. Recently, they were found to infect the methanogenic archaeon Candidatus Methanomassiliicoccus intestinalis using a CRISPR spacer-based host prediction method^11,12. Archaeal viruses in the human gut remain highly enigmatic. Analysis of the CRISPR-Cas systems encoded by archaea revealed that 90% of all sequenced archaeal genomes hold CRISPR loci, implying a rich archaeal virome in this ecosystem¹³.

The knowledge gap on archaeal viruses is fostered by the lack of their genome entries in public databases, missing conserved marker genes for viruses⁷. Only 250 archaeal viruses that infect 23 host genera have been described and publicly available to date¹⁴. These archaeal viruses are greatly diverse and the encoded proteins display very low levels of sequence homology to those in the public database¹⁵. Prokaryotes harbor CRISPRs to foster immunity against viruses and other invasive genetic elements, making it possible to uncover the associations between viruses and their hosts¹⁶. Indeed, the approach of matching the CRISPR spacers from a known organism to viruses for assigning a virus discovered by metagenomics to a host is highly reliable¹⁷. When viral genomic data can be linked to a specific host organism, it becomes possible to uncover novel viruses and study how they interact with their hosts within various ecosystems.

Here, we harness spacer sequences from the archaeal CRISPR-Cas systems and viral signatures to search for archaeal viruses in the human gut. First, we performed large-scale identification of archaeal genomic contigs from 2971 metagenomes derived from previously published studies (Supplementary information). Then, we obtained the spacers from the identified archaeal genomic contigs and the 1162 archaeal genomes of UHGG (Unified Human Gastrointestinal Genome)¹⁸. Based on the archaeal spacer collection and the signatures of protein homology present in the archaeal viruses, we established a pipeline for archaeal viral detection and obtained 1279 archaeal viral species in the human gut. This effort will contribute to a better characterization of archaeal viruses and their archaeal hosts in the human gut and provide a complementary view of the human gut microbiome.

Result

The human gut carries a complex, previously unexplored virome

To perform a comprehensive search for human gut archaeal viruses, first, we constructed a Human Gut Associated Archaeal Spacer Database (HGASDB) including 13,021 nonredundant CRISPR spacers recruited from the identified archaeal genomic contigs and the 1162 archaeal genomes of UHGG (Supplementary Fig. 1–3 and Supplementary Information)¹⁸. These spacers were derived from the contigs and genomes of different archaeal lineages, with the genus Methanobrevibacter_A contributing to the largest number of spacers (89.82%). In particular, 8962 spacers specifically were derived from Methanobrevibacter_A. smithii, 2549 spacers from Methanobrevibacter_A smithii_A, and 185 spacers from other three species (Methanobrevibacter_A woesei, Methanobrevibacter_A orals, and Methanobrevibacter_A millerae) (Supplementary Fig. 2d and Supplementary Data 1–5). A small number (n = 1325; 10.18%) of spacers were derived from other archaeal genera. We then identified 16,234 sequences that matched to these spacers from the 2271 assembled total community metagenomic datasets and the publicly available human gut virus collections (Fig. 1a). After we filtered out archaeal and bacterial genomic contamination and the sequences not encoding the viral signatures (i.e., hallmark genes for the known archaeal viruses) (Supplementary Fig. 4, see in Methods), these sequences were ultimately clustered (95% identity over 85% sequence) into 1279 nonredundant viral species, and the longest sequences within each species were selected as the representative in the Human Gut Archaeal Virome Database (HGAVD), for further analysis. In particular, 1080 archaeal viral representative sequences in HGAVD were detected from the assembled metagenomic datasets and 199 from other publicly available human gut viruse collections (89 from IMG/VR¹⁹, 92 from GPD²⁰, 14 from GVD⁷, 2 from HGV⁴, 1 from EVP²¹, and 1 from GL-UVAB²²). CheckV²³ analysis resulted in 12% of the sequences were classified into complete genomes (3%) and high-quality (9%) (Fig. 1b and Supplementary Data 6).

**Fig. 1: Identification of archaeal viruses from the human gut.**

To further explore the extent to which the HGAVD viral species were homologous to the known archaeal viruses in the RefSeq database (v201) (built-in database of vConTACT2) and thereby taxonomically classify these viruses, we constructed the gene sharing networks generated by vConTACT2, where viral clusters (VCs) approximate genus level taxonomy²⁴. With the sequences from the archaeal viral genomes in the database RefSeq and the 1,279 archaeal viral species, this analysis clustered 735 HGAVD species into 61 VCs, 391 viral species into outliers (where contigs were assigned to a VC but shared fewer similar proteins than the bulk of the cluster), and 153 viral species into singletons (sequences that did not cluster with any other sequences). Only 2 VCs included one known reference viral sequence, respectively. This suggests that the majority of the VCs derived from the human gut likely represent viral genera that were novel to the viruses in RefSeq (Supplementary Data 7). Moreover, in agreement with the previous gut virome studies^20,25, the majority (68.4%) of the HGAVD viral species can’t be taxonomically classified into any known viral order. Less than half of the species (n = 404, 31.6%) were taxonomically classified into, specifically, the Caudoviricetes class (n = 389) (tailed virus), the Cremevirales order (n = 13), and the Haloruvirales order (n = 2) (Fig. 1c). The Cremevirales viruses were predicted to infect M. intestinalis and Methanomassiliicoccus_A intestinalis, the Haloruvirales viruses were predicted to infect Haloferax massiliensis, while most of (305/389 = 78.4%) the Caudoviricetes species in HGAVD connected to the host of Methanobrevibacter_A smithii.

We further compared the HGAVD viruses to those of the publicly available virus collections (detailed in Method) (Fig. 1d; Supplementary Data 8, and Supplementary Fig. 5). First, we aligned the HGAVD species with the 85 nonredundant proviruses derived from 557 (50–100% completeness) of the 1162 gut archaeal genomes in UHGG¹⁸, resulting in 56.5% (n = 723) of the 1279 species sharing identity >95% with those proviruses. The MGV (Metagenomic Gut Virus) catalog²⁵ is the newest human gut viral database and contains extensive viral genomic diversity, in particular, 102 of which was assigned to archaeal viruses. The vConTACT2 network analysis clustered the HGAVD viruses into 68 VCs, while 102 MGV archaeal virus sequences were clustered into 15 VCs, and 37 proviruses derived from the archaeal genomes in UHGG were only clustered into 9 VCs, reflecting the greater diversity of the gut archaeal virus taxa represented by HGAVD at the genus level than other virus collections. We found that a majority of the HGAVD viral species (n = 1097; 86%) were not clustered with any viral genomes from other collections (Fig. 1d), while a majority of 37 archaeal proviruses (78.4%) and the MGVarchaeal viral sequences (83.3%) were grouped with the HGAVD viruses, indicating that HGAVD can represent most of the archaeal viruses in other gut virus collections. Taken together, HGAVD considerably expanded the previously unknown archaeal viral diversity in the human gut.

Archaeal viruses are highly prevalent in the human gut

We estimated the abundance of the HGAVD viral species in the human gut samples by metagenomic read recruitment (Supplementary Data 9) and accordingly performed the principal coordinate analysis (PCoA). No significant difference in the human gut archaeal viral composition was observed between male and female sex (ANOSIM, r = 0.002, p = 0.306) or according to BMI distribution (ANOSIM, r = 0.011, p = 0.201) (Supplementary Fig. 6). Nevertheless, when the analysis was stratified by country, we observed that the diversity of these archaeal viruses was distinct in the samples of different locations. In particular, the archaeal viral communities between the Tanzanian and the populations from China, America, and the UK displayed significant differences, respectively (ANOSIM, R > 0.7, p < 0.001; Fig. 2a and Supplementary Data 10).

**Fig. 2: Protein clustering network and global distribution of the HGAVD viruses in the human gut.**

Based on the abundance determined by the reads mapping, we further investigated the prevalence of these viruses among the human populations. The result indicated that the prevalence of 7 archaeal viral species was >10% across the human populations. These viruses belonged to 7 different VCs (Fig. 2b and Supplementary Data 7). These 7 viral species all were predicted to infect Methanobrevibacter_A smithii and had a higher prevalence in Asian, European, and American populations than in the African population. Moreover, 712 archaeal viral species were prevalent in 1% of the human population. Remarkably, the virus IMG|UGV-GENOME-0271153, one putative medium-quality viral genome (40.51 kbp, CheckV²³), had the highest prevalence (72.16%) among the human populations and was predicted to infect Methanobrevibacter_A smithii. This virus genome encodes 46 genes and 8 of them were predicted for the Caudoviricetes species functional proteins (Fig. 2c and Supplementary Data 11a). Furthermore, all the viral sequences (23–55 kbp in length) in the same VC with this virus had the host of Methanobrevibacter_A smithii (Fig. 1d) and were derived from the samples of United Kindom, Sweden, Austria, United States, China, Spain, and Madagascar, respectively, further suggesting the wide distribution of this virus among the global population. In particular, another highly prevalent Caudoviricetes viruses (10.7%) IMG|UGV-GENOME-0263128 encoding 51 genes was detected more frequently in the African population than IMG|UGV-GENOME-0271153 (Fig. 2b). The viral sequences in the IMG|UGV-GENOME-0263128-contained VC were from 19 kbp to 56 kbp in size and were predicted to infect the hosts of Methanobrevibacter_A smithii and Methanobrevibacter_A smithii_A (Fig. 1d). These two highly prevalent viruses likely are temperate because integrase gene was detected on the genome of the virus (IMG|UGV-GENOME-0263128) or the genomes of other viruses within the same VC (IMG|UGV-GENOME-0271153) (Fig. 2c and Supplementary Data 11b).

It is worth mentioning that 13 smacovirus species were identified and were clustered into 3 VCs with lengths ranging from 2.0 to 2.5 kbp in HGAVD, reflecting the diversity of these small viruses in the human gut. Smacovirus in the order of Cremevirales has a small circular single-stranded DNA genome and had been identified in fecal samples (both feces and rectal swabs) of various animals^12,26. These HGAVD smacoviruses were targeted by 7 spacers derived from the archaeal genomes in UHGG and they were predicted to infect Methanomassiliicoccus intestinalis or Methanomassiliicoccus_A intestinalis. Compared with the cohort of Asia and America, the prevalence of smacovirus was higher in African and European populations (Fig. 2d).

Viruses infecting Methanobrevibacter_A smithii are a major component of the archaeal virome in the human gut

To accurately investigate the diverse virus-host interactions, we particularly screened for the CRISPR spacers present in the archaeal genomes of UHGG to target the HGAVD viral sequences. As expected, a majority (n = 1217; 95.2%) of the viral species connected to the genus Methanobrevibacteria_A, which is dominant in the human gut archaeaome (Fig. 3a). We then measured viral diversity by determining the number of VCs for each archaeal genera, revealing that the genus Methanobrevibacter_A harbored a significantly higher viral diversity than those of other archaeal genera (Fig. 3b), with 51 VCs assigned to this genus. Among the 51 VCs, 47 VCs were specific to Methanobrevibacter_A smithii, only 17 VCs were specific to Methanobrevibacter_A smithii_A, and 13 VCs were linked to both these two archaeal species, reflecting archaeal viruses can infect their hosts cross-species. To show this in detail, we constructed the network of host-virus by matching the HGAVD viruses with the CRISPR spacers derived from the UHGG archaeal genomes, indicating that approximately one-third of HGAVD viral species had a broad host range (Fig. 3c). Namely, 434 viral species had a host range spanning 2 archaeal species (Methanobrevibacter_A smithii and Methanobrevibacter_A smithii_A) and 12 viral species had a host range across 3 archaeal species (Methanobrevibacter_A smithii, Methanobrevibacter_A smithii_A, and Methanobrevibacter_A woesei). These analyses provide a comprehensive blueprint of archaeal virus-mediated gene flow networks in the human gut microbiome.

**Fig. 3: Archaeal viral host assignment and host range determination.**

To further show the diversity of the tailed archaeal viruses, we searched the large subunit terminases (LST) (the marker gene for the Caudoviricetes viruses) from the HGAVD archaeal viral sequences and the closely related reference archaeal viruses (RefSeq database, v201) using the Pfam database, resulting in 85 LSTs derived from HGAVD viruses belonging to at least 10 VCs and 6 homologs from 6 reference archeal viral genomes. These HGAVD LSTs were detected with 5 difference Pfam domains. The majority (68/85 = 80%) of LSTs were found encoded by the HGAVD viruses infecting the species Methanobrevibacter_A smithii, with 33 belonging to the Terminase_6 (PF03237) domain, 31 to Terminase_3 (PF04466), 3 to Terminase_6C (PF17289), and 1 to Terminase_1 (PF03354). Phylogenetic analysis of these LSTs (Fig. 3d), revealed four large gut archaeal viral clades infecting the species Methanobrevibacter_A smithii. Clade I and II without reference viruses can be defined novel clades including the largest number of HGAVD archaeal viruses. Clade III and IV had reference viruses that belong to the families Druskaviridae and Leisingerviridae, respectively, in the Caudoviricetes class. In conclusion, the LST phylogeny expanded the diversity of the archaeal viruses that infect Methanobrevibacter_A smithii and suggested new archaeal viral taxonomies in the human gut.

Archaeal virus genomes encode an extensive functional repertoire

The functional potential of human gut archaea has been extensively studied⁶. HGAVD enables us to explore the functional potential of the archaeal virome in the human gut. To do this, we identified 97,208 protein-coding genes on the representative sequences of these 1279 viral species. Overall, 40% (n = 39,268) of the viral genes did not have significant matches (cutoff: e-value \( < \) 1e-5, score \( > \) 50) in the Pfam(v32) database and were not assigned to any biological functions. Only 10.8% and 17.4% of these genes had hits in pVOG²⁷ and PHROG²⁸, respectively, indicating that remarkably little is known about the functional potential of human gut archaeal viruses (Fig. 4a and Supplementary Fig. 7).

**Fig. 4: Functional landscape of the HGAVD viruses.**

The viruses of Methanobrevibacter_A smithii contained the most functional diversity with proteins homologous to 1,034 different kinds of tailed-virus-specific proteins in the Pfam database (only the proteins assigned biological function were taken into consideration), such as prohead protein, baseplate J, portal protein, tail fibers, and terminase large subunit, whereas other archaeal viruses lacked some of these genes (Fig. 4b and Supplementary Data 12). For example, except for the viruses infecting Methanobrevibacter_A smithii, the remainder had no proteins annotated for lysis-related functions. In particular, the genes encoding HNH endonuclease were observed on the viral genomes of both Methanobrevibacter_A smithii and Methanobrevibacter_A woesei. This protein potentially cleaves DNA into genome-length units during packaging and may operate in concert with their terminase large subunit and portal proteins²⁹.

The representative sequences of 36 archaeal viral species in HGAVD were measured as complete genomes by CheckV²³. They were clustered into 7 different VCs and taxonomically classified to Caudoviricetes (n = 33, 6 VCs) and Cremevirales (n = 3, 1 VC). Analysis of these whole viral genomes in the class Caudoviricetes (Supplementary Data 13) resulted in an interesting finding that a gene encoding the protein homologous to pseudomurein endoisopeptidase (PeiW) frequently occurred on many viral genomes (n = 23). The prototype PeiW is found in the archaeal prophage psiM100 as an autolytic enzyme produced by the thermophilic methanoarchaeon Methanothermobacter wolfeii to cleave pseudomurein cell-wall sacculi of archaeal methanogens³⁰. The phylogenetic analysis of PeiW revealed that except for the viruses of M. wolfeii, other archaeal viruses also were the carrier of peiW, such as the viruses of Methanobrevibacter_A smithii and Methanobrevibacter olleyae (Supplementary Fig. 8). When extending this analysis to all HGAVD viruses, 150 viruses encoded the genes of PeiW (Supplementary Fig. 9), suggesting the importance of this gene for the archaeal viruses in infecting methanogenic archaea.

In the analysis of these complete Caudoviricetes viral genomes, 29 of 33 encoded the genes for phage integrase protein. However, only 9 genomes were predicted as proviruses, and 20 were not flanked by host DNA by by CheckV²³. In particular, we observed that 10 genomes infecting Methanobrevibacter_A smithii or M. olleyae encoded proteins belonging to the antitoxin MazE superfamily. The toxin-antitoxin system on a temperate virus acts as an addiction system, preventing the host from curing itself from the provirus³¹. Accordingly, the presence of the antitoxin MazE protein on the HGAVD archaeal viruses might highlight an arms race between the gut archaea and their viruses. Further, we performed a phylogenetic analysis based on the MazE antitoxin protein sequences detected in these viral genomes. The phylogenetic tree shows that (Supplementary Fig. 10) the viruses predicted to infect Methanobrevibacter_A smithii and M. olleyae were separated into different clades. We performed comparative genomic analysis on the representative sequences selected for each VC of the complete HGAVD sequences (Fig. 4c), they were shown divergent in genomic sequence and most of the genes encoding for hypothetical proteins. Moreover, CheckV determined that only 9 genomes were predicted as proviruses and 20 were not flanked by host DNA²³, implying that most of the archaeal viruses detected in this study likely were undergoing lytic replication cycle. Overall, the analysis on these complete HGAVD viral genomes implied that temperate archaeal viruses were dominant in the human gut, similar to the human gut bacterial phages^32,33.

Discussion

In this study, taking advantage of the metagenomic sequencing data, we conducted a comprehensive analysis of the human-associated archaeal viruses recovered from the human gut metagenomes collected worldwide, showing that the archaeal viruses were widespread in the human gut ecosystem. The results obtained in this study based on the metagenomic sequencing datasets were well-complemented with the previous study of 1167 nonredundant archaeal genomes⁶. Based on the Minimum Information about an Uncultivated Virus Genome (MIUViG) standards³⁴, we report the archaeal viruses related to virus origin, genome quality, functional annotation, taxonomic classification, biogeographic distribution, and host prediction. We also estimated that the average fraction of these HGAVD viruses in the human gut virome was around 0.50% (Supplementary Data 14). It has been estimated that around 1.2% of all anaerobes are human-associated archaea⁶. While the ratio of microbe:viruses is around 1:1-10 in the human gut³⁵, our estimation of the fraction of the HGAVD viruses in the human gut virome implied that a considerable proportion of the archaeal viruses still remain unexplored.

To date, compared to the bacterial phages, fewer archaeal viral genomes derived from the human gut were available. In the database GVD, 24 viral populations (equal to species in this study) were predicted as archaeal viruses⁷; the study related to gut archaeome reported 94 proviruses derived from the archaeal genomes⁶. These large-scale gut virus collections were conducted using several popular bioinformatic tools, such as VirSorter³⁶ v1.0.3, VirFinder³⁷ v1.1, etc. In this study the CRISPR spacer-based method, which has been widely used for linking viral and host genomes in various studies, have a better recall for the identification of previously unknown archaeal viruses^17,38,39. In particular, analysis of previous studies indicated that more than 90% of archaea genomes harbor the CRISPR system as compared to 50% of the human gut bacterial genomes¹³. In this study, CRISPR loci were identified in 53% of the human gut archaeal genomes (including MAGs) and 80% of the isolated human gut archaeal genomes. Our stringent workflow showed a high sensitivity in identifying genome fragments for diverse gut viruses. This was evident by the detection of smacoviruses which are very small (2.5 kbp) and low abundant in the human gut microbiome. In particular, we did not detect plasmid signatures using PlasForest⁴⁰ and two sequences encoded both transposase genes and viral signatures in the HGAVD viral sequences.

While some non-viral mobile elements, such as transposons and plasmids, can also perfectly match to the spacers, these sequences were largely excluded and were not included in the HGAVD database in our workflow (Fig. 1a). In total, 847 sequences that matched to the spacers were not detected encoding genes homologous to the viral hallmark genes, 2 of which were identified as plasmid sequences, suggesting these sequences likely were derived from transponsons or plasmids. Notwithstanding this, some of these excluded sequences that matched the spacers also likely represent additional families of as-yet-unidentified viruses. These novel viruses could not be identified by metagenomic approaches due to the lack of knowledge and must be determined by establishing a culture-dependent method. The isolated archaeal viruses may in turn improve the bioinformatic methods for identifying archaeal viruses to recover more novel archaea viruses.

Taking together, in this study, we conducted a comprehensive metagenomic data mining of the archaea and the archaeal viruses in the human gut. The result revealed the diversity of the archaeal viruses and the archaea in the human gut. Considerable diversity of the unexplored archaeal viruses in the human gut and the novel viral species in HGAVD can exactly fill in the gaps in this field and serve as an expansion of the human gut archaeal viruses. Our data, together with the bacteria and bacterial phages, will provide a complementary view of the human gut virome and thus help us better understand the human gut ecosystem.

Methods

Collection of metagenomic sequencing data sets used for this study

Here, we collected and curated 12 human microbial metagenomic datasets consisting of 3971 human metagenomes from 1904 individuals across rural and urban populations from 13 countries (Supplementary Data 1, publicly available as of January 2021). Sequencing reads of the human gut metagenomes and the associated metadata were obtained from their respective hosting databases (e.g. SRA, iVirus, or MG-RAST). Reads were then assembled using SPAdes v3.10.0⁴¹ with option ‘-meta’. The assembled contig sequences of five body sites (including the gastrointestinal tract, mouth, airways, skin, and vagina) were directly downloaded from the HMP Data Portal (https://portal.hmpdacc.org/)⁴². All the sequencing data were downloaded from online repositories or links provided in the original publications. We did not include any studies which required additional ethics committee approvals or authorizations for access.

Detection of Archaeal genome contigs in the metagenomic sequencing datasets

The genes were predicted on the assembled contig sequences using Prodigal v2.6.3 (-p meta option)⁴³. The resulting protein sequences were aligned to the Genome Taxonomy Database R95 (GTDB, R95)⁴⁴ using DIAMOND (options:–e-value 1e-3–min-score 50)⁴⁵. According to the GTDB taxonomy system, the taxonomy of each protein was assigned based on the top hit in the database at each taxonomic rank (Phylum, order, family, genus, and species). Subsequently, Archaeal contigs were screened based on the following criteria⁴⁶: (i) the number of encoding proteins with hit derived from archaeal genomes \( > \) the number of encoding proteins with hit derived from bacterial genomes; and (ii) the number of encoding proteins with the hits from archaeal genomes \(\ge\)5 (Supplementary Fig. 2a). In summary, we detected 17,830 archaeal contigs from the whole gut metagenomes and 33 archaeal contigs from other body sites (23 from the oral, 5 from the skin, and 5 from the vagina) (taxonomic information of these 33 archaeal contigs are listed in Supplementary Data 15). Meanwhile, the taxonomy of an identified archaeal contig was assigned if the number of the proteins on the contig assigned to this taxonomy was higher than others. Then all curated gut archaeal contigs sharing identity ≥95% and coverage ≥85% were dereplicated by CD-HIT v4.6⁴⁷. Using this clustering strategy, we finally obtained 2948 nonredundant archaeal genome fragments with length \( > \)3 kbp for subsequent analysis.

Construction of phylogenetic tree for archaeal genomes

To compare these archaeal contig sequences to the known archaeal genomes derived from the human gut, these 17,830 archaeal contigs were mapped to 1162 species-level gut archaeal genomes derived from the UHGG¹⁸ using BLASTn (e-value ≤ 10-5, coverage ≥ 0.5)⁴⁸. UHGG contains 286,997 genomes, representing 4644 species of Bacteria and Archaea from the human gut that are taxonomically annotated using GTDB-tk v.0.3.1 (GTDB R89). Taxonomy of these genomes was assigned using GTDB-Tk v0.3.3⁴⁹ based on the Genome Taxonomy Database R202 (GTDB, http://gtdb.ecogenomic.org) taxonomy. We evaluated the quality of the genomes with CheckM⁵⁰ v1.0.11 using the ‘lineage_wf’ workflow. The results were further refined using maximum-likelihood phylogeny inferred from a concatenation of 122 archaeal marker genes produced by GTDB-Tk. The archaeal tree was built using RAxML v8⁵¹ called as follows: raxmlHPCHYBRID -f a -n result -s ge input -c 25 -N 100 -p 12345 -m ROTCATLG -x 12345 and Newick tree output files were visualized with iTOL v6⁵² (https://itol.embl.de/).

Establishment of Human Gut Associated Archaeal Spacer Database (HGASDB)

The CRISPR spacer sequences were derived from two databases: (i) 17,830 gut archaeal contigs detected from the gut metagenomes, (ii) 1162 species-level archaeal genomes from the UHGG catalogue. Spacer sequences were predicted using the CRISPR Recognition Tool v1.1 (CRT)⁵³ with default parameters. In total, 19,055 and 6553 CRISPR spacer sequences were predicted from 1162 UHGG archaeal genomes and the 17,830 gut archaeal contigs, respectively. Redundant spacer sequences were dereplicated using CD-HIT (parameters: -c = 1, -aS = 1, -aL = 1, -g = 1), resulting in 13,021 nonredundant CRISPR spacers sequences.

Collection reference of archaeal viral genomes

We collected a database for 202 Archaeal Viral Genomes as a reference from 3 sources:

(i)
97 reference archaeal viral genomes available in NCBI RefSeq as of December 2020.
(ii)
102 archaeal virus genomes provided in the studies of Iranzo et al.⁵⁴. The 59 duplicated genomes compared to the genomes in (i) were removed. What’s more, there were 16 genomes were labeled as “Proviruses” by Iranzo et.al. However, sequences of these proviruses have not been provided by the authors, for which reason, we used VirSorter³⁶ to predict the provirus from the 16 genomes. By this means, 14 proviruses have been extracted from 14 genomes. Taken together, we got 41 archaeal virus genomes from this source.
(iii)
To complete the archaeal viral dataset, we included genomes of Methanobacterium virus Drs3⁵⁵, 43 new putative archaeal virus genomes identified from two depth profiles in the Eastern Tropical North Pacific (ETNP) oxygen minimum zone⁵⁶, 24 unknown archaeal viral populations detected by GVD⁷ and 8 genomes of smacoviruses that were found to infect Archaea¹¹.

In total, the final archaeal virus database consisted of 202 archaeal viral genomes or fragments.

Selection of hallmark genes for archaeal viruses

Firstly, we predicted genes from the 202 archaeal virus genomes using Prodigal v2.6.3 (default parameters) and obtained 21,985 proteins encoded by these genes. Subsequently, functional annotations were assigned to the proteins using the hmmsearch command in HMMER3 (e-value cutoff set to 1e-5)⁵⁷ against the Pfam. v. 32 database⁵⁸, a custom comprehensive viral HMM database including viral protein families (VPF) from JGI Earth’s virome project²¹ and the Virus Orthologous Groups (VOG) (release 202, http://vogdb.org) containing orthologous groups of numerous viruses. Then the database of archaeal viral hallmark genes was composed of the following four parts (Supplementary Fig. 4):

(1)
Exclusive archaeal viral proteins based on the annotations in the Pfam database
1. (i)
  We collected 35 genomes of archaeal isolates from UHGG catalog and each protein encoded by the genomes was annotated in the Pfam database. We selected the proteins (n = 1523) with the Pfam homologs only occurring on the 202 archaeal viral genomes as hallmark genes.
2. (ii)
  If any proteins encoded by the archaeal virus genomes and the 35 isolated archaeal genomes were annotated in the Pfam database with the keywords including portal, terminase, spike, capsid, sheath, tail, coat, virion, lysin, holin, baseplate, lysozyme, head, fiber, whisker, neck, lysis, tapemeasure or structural, then these (n = 164) were added to the collection of hallmark genes for archaeal viruses.
(2)
To include the proviruses in the archaeal genomes, we collected 11 proviruses predicted from the 35 isolated archaeal genomes in UHGG by CheckV²³ v0.6.0, and then the 249 proteins predicted from the provirus were added to the collection of the hallmark genes for archaeal viruses.
(3)
The 5907 archaeal virus proteins with the best hit to the members of the VOG database were selected.

The 3368 archaeal virus proteins with the best hit to the members of the VPF database were selected.

After combining and de-replicating the proteins from these four sources, in total, 8485 proteins were selected as the hallmark genes for archaeal viruses.

Development of archaeal viral detection workflow

To perform a comprehensive search for human gut archaeal viruses, sequences for archaeal virus detection were derived from two sources: (1) the assembled contigs of the metagenomic sequencing data we described above; 2) viral genomes identified in the published viral databases (Fig. 1a), including 125,842 partial DNA viral genomes obtained from the Earth’s Virome (hereafter ‘EVP’)²¹, 57,721 viral contigs from the Human Gut Virome database (HGV)⁴, 195,698 viral contigs from Uncultured Viral Database of Archaeal and Bacteria (hereafter ‘GL-UVAB’)²², 33,243 viral sequences obtained from GVD⁷, 142,809 nonredundant phage genomes from GPD²⁰ and 2,332,702 viral genomes from IMG/VR v3¹⁹. To identify archaeal viral sequences from these data, we developed a viral detection workflow as follows:

(1)
All the assembled metagenomic contigs were searched against HGASDB using blastn from the blast+ package v.2.2.31 (e-value < 1e-5), and 16,234 contigs that matched to the spacers were assigned as archaeal virus candidate I. These contigs were further dereplicated using the CD-HIT v4.6 with the parameters “-aS 0.85 -c 0.95”. Multiple reports^7,34 have revealed that \( > \)95% ANI (Average Nucleotide Identity) was a suitable threshold for defining a set of closely related discrete ‘viral group’; follow-on studies suggest that this cut-off establishes populations that are largely concordant with a biologically relevant “viral species” definition⁵⁹. Thus, this clustering strategy resulted in 2238 viral species (represented by the longest contig within each viral species) in archaeal virus candidate I.
(2)
To remove potential bacterial genome contamination, sequences of archaeal virus candidate I were queried against 16,234 isolated bacterial genomes from UHGG collection using blastn. The cutoffs defined for these searches were the minimum identity of 50%, and minimum query coverage of 80%, with a maximum e-value of 10^-5. Thus 10 contigs were filtered out from candidate I and 2228 viral species remained for candidate II.
(3)
To remove the contamination of archaeal genomes, sequences of archaeal virus candidate II were performed blastn against 35 isolated archaeal genomes from the UHGG collection. The cutoffs defined for these searches were the minimum identity of 50%, minimum query coverage of 100%, with maximum e-value of 10^-5, Thus 102 contigs were removed from candidate II and 2126 viral species remained for candidate III.
(4)
Protein sequences derived from the contigs in candidate III were compared with the protein sequences of the archaeal viral hallmark genes (identified in Selection of hallmark Genes for Archaeal Viruses) using DIAMOND. Any contigs containing best hits with a maximum e-value of 10-5 were picked. Finally, 1279 viral species were retained for the Human Gut Archaeal Virome Database (HGAVD).
(5)
For these viral species, CheckV was used to detect proviruses boundaries, remove contamination from host-derived sequences, and determine the completeness. This most recently developed tool classifies each sequence into one of five quality tiers: complete, high quality (>90% completeness), medium quality (50–90% completeness), low quality (0–50% completeness) or undetermined quality (no completeness estimated available), resulting in 12% of the sequences were classified into complete genomes (3%) and high-quality (9%) (Fig. 1b and Supplementary Data 6). In addition, we applied VirSorter (categories 1–6)³⁶, VirFinder (score \(\ge\) 0.7 and p \( < \) 0.05)³⁷, VirSorter2 v2.2.3 (categories 1-6)⁶⁰ and DeepVirFinder v1.0 (score \(\ge\) 0.9 and p \( < \) 0.05)⁶¹ on the sequences in HGAVD, and in total 537 HGAVD sequences (Supplementary Data 6) were classified as viral sequences by these tools.

Taxonomic classification of gut archaeal viruses

Two complementary approaches were used for the taxonomic classification of the 1279 archaeal viral species. First, for 1279 representative contigs of these archaeal viral species, genes were predicted using Prodigal v2.6.3 with the -p meta option. Then these predicted genes were used to cluster the 1279 archaeal viral contigs with the prokaryotic viral Refseq v201 using vConTACT v.2.0²⁴ with default parameters (The Refseq were supplied by the built-in database of vConTACT2). Thus, we leveraged the taxonomic information provided by the viral Refseq to taxonomically classify the contigs in these VCs. For example, if one contig in a VC is classified to the Caudoviricetes class, the rest contigs in this VC will also be assigned to the virus of the Caudoviricetes class.

Second, we used taxonomical informative profiles from the VOG database (http://vogdb.org) and eggNOG (v5.0) database⁶² to find out viruses likely to be the members of the Caudoviricetes viral class. Specifically, we first picked out the VOGs with annotation containing the keywords (portal, terminase, spike, capsid, sheath, tail, base plate, fiber, and tape measure) and named them as Hallmark VOGs. Then the predicted proteins from the archaeal viral contigs were compared to the VOG HMM profiles and the eggNOG database using hmmsearch v3.2.1 and eggNOG-mapper v.2.0.0⁶³ respectively. During this process, the minimum score and maximum E-value were set to 40 and 1e-5. If the viral contig encoding genes with hits against the Hallmark VOGs or eggNOGs whose annotation contains the keywords mentioned above, then this contig will be classified into the Caudoviricetes viral class (Fig. 1c and Supplementary Data 6 for Order/Class-level taxonomy).

Comparison of the viral species to other gut viral databases

A comparison between HGAVD and the viruses in publicly available databases derived from the gut microbiome was performed based on the following databases:

(i)
Metagenomic Gut Virus (MGV) catalogue²⁵, the newest gut virus collection, contains 189,680 viral draft genomes estimated to be \( > \)50% complete and representing 54,118 candidate viral species. The protein sequences of the representative archaeal viral contigs were used as queries in a BLAST search in the MGV database with a threshold of e-value \(\le \) 1e-3. Only the sequences in the MGV database encoding at least one protein sequence with the hits to those of the archaeal viral contigs were retained for network analysis. (11,827/189,680 = 0.06).
(ii)
Proviruses detected from 1162 gut archaeal genomes. 118 proviruses were predicted by CheckV from the 557 archaeal genome contigs in UHGG with a quality assignment of medium quality (50–90% completeness) and high quality (>90% completeness) or were complete. These proviruses were then clustered at 95% identity and 80% coverage, resulting in 85 nonredundant viral species. We further clustered the 85 proviruses with the viruses in HGAVD. Only the 37 proviruses sharing identity \(\le\) 95% with the 1279 viral contigs in HGAVD were considered for further analysis.
(iii)
The Prokaryotic Viral Refseq (V201) Database supplied by vConTACT2.

Estimation of the relative abundance of viruses and hosts

First, we mapped all reads of the metagenomic sequencing data to the identified archaeal contigs and archaeal viral contigs by the software Soap2⁶⁴ v2.21, only the contigs with >30% breadth of coverage were counted. Second, the number of the reads corresponding to each of the identified archaeal genome contigs and archaeal viral contigs was normalized by the total number of the reads of each sample; the normalized value thereby represents the relative abundance of the contig in the sample.

Estimation of the fraction of HGAVD viruses in the human gut virome

To explore the fration of the archaeal viruses in the human gut virome, we mapped raw reads collected from the 1904 samples to the 33218 non-archaeal viral sequences derived from the GVD database⁷ and the HGAVD sequences by the software Soap2. The abundance of these viruses in each sample was calculated as the descriptions in the subsection of Methods “Estimation of the relative abundance of viruses and hosts.”Then, we summed the abundance of archaeal viruses and bacteria viruses, respectively, and calculate the archaeal viral relative abundance in human gut virome for each sample. The average fraction of archaeal viruses in human gut virome was estimated by taking an average of the archaeal viral percentage depicted above (average: 0.50%).

Statistical analyses

All statistical analyses were performed in R version 4.0.5. Based on Bray–Curtis dissimilarity matrices, which were calculated using the VEGAN⁶⁵ function vegdist, principal coordinate analysis (PCoA) was performed using the pcoa function in the APE package, and significant difference (p) and the degree of separation (Global R) between groups were tested by the analysis of similarities (ANOSIM) using the VEGAN function anosim. Global R ranges between 0 and 1, with Global R = 0 indicating no separation and Global R = 1 indicating complete separation. The number of permutations of anosim is 999.

Virus-host prediction

Host-virus interactions were resolved by searching CRISPR spacer sequences in the hosts and the viral contigs. To accurately investigate the gut archaeal viruses that have a broad host range, we particularly predicted CRISPR spacers from the 1,162 archaeal genomes in the UHGG database¹⁸ based on the following criteria: (i) CRISPR arrays were identified on the archaeal genomes longer than 10 kb using CRT⁵³; (ii) To minimize spurious predictions, we dropped arrays with fewer than three spacers; (iii) CRISPR spacers were longer than 25 bp. The retained CRISPR spacers were aligned with the archaeal viral contigs using BLASTn to identify spacers present in the viral contigs, and matches satisfying the thresholds of 100% identity were selected (settings: -task blastn-short, - gapopen 10, -gapextend 2, -penalty 1, -word_size 7 -perc_identity 100).

Phylogenetic tree analysis of genes

To construct the phylogenetic trees for large terminase subunit, PeiW and MazE-antitoxin, amino acid sequences were aligned using the MUSCLE algorithm⁶⁶ included in MEGA X⁶⁷. The maximum-likelihood phylogenetic tree was constructed using IQ-TREE v1.6.12⁶⁸ with the automatic optimal model selection. The final consensus tree was visualized and beautified in iTOL⁵².

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The annotated nucleotide sequences of archaeal viruses (FASTA + GFF) generated in this study, archaeal viral hallmark genes, accompanied with the metadata file describing the origin of each contig, taxonomy, including VC, host prediction information, completeness score are available in the link https://doi.org/10.6084/m9.figshare.21152404.v3. The accession codes for the sequencing data used in this study are provided in Supplementary Data 1. Source data are provided with this paper.

Code availability

The present study did not generate code, and mentioned tools used for the data analysis were applied with default parameters unless specified otherwise.

References

Borrel, G., Brugere, J. F., Gribaldo, S., Schmitz, R. A. & Moissl-Eichinger, C. The host-associated archaeome. Nat. Rev. Microbiol 18, 622–636 (2020).
Article CAS PubMed Google Scholar
Coker, O. O., Wu, W. K. K., Wong, S. H., Sung, J. J. Y. & Yu, J. Altered gut archaea composition and interaction with bacteria are associated with colorectal cancer. Gastroenterology 159, 1459–1470.e1455 (2020).
Article CAS PubMed Google Scholar
Koskinen, K. et al. First insights into the diverse human archaeome: Specific detection of Archaea in the gastrointestinal tract, lung, and nose and on skin. mBio 8, e00824–00817 (2017).
Article CAS PubMed PubMed Central Google Scholar
Shkoporov, A. N. et al. The human gut virome is highly diverse, stable, and individual specific. Cell Host Microbe 26, 527–541.e525 (2019).
Article CAS PubMed Google Scholar
Clooney, A. G. et al. Whole-virome analysis sheds light on viral dark matter in inflammatory bowel disease. Cell Host Microbe 26, 764–778.e765 (2019).
Article CAS PubMed Google Scholar
Chibani, C. M. et al. A catalogue of 1,167 genomes from the human gut archaeome. Nat. Microbiol. 7, 48–61 (2022).
Article CAS PubMed Google Scholar
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740 e728 (2020).
Article CAS PubMed PubMed Central Google Scholar
Krupovic, M., Cvirkaite-Krupovic, V., Iranzo, J., Prangishvili, D. & Koonin, E. V. Viruses of archaea: Structural, functional, environmental and evolutionary genomics. Virus Res. 244, 181–193 (2018).
Article CAS PubMed Google Scholar
Wirth, J. & Young, M. The intriguing world of archaeal viruses. PLoS Pathog. 16, e1008574 (2020).
Article CAS PubMed PubMed Central Google Scholar
Prangishvili, D. et al. The enigmatic archaeal virosphere. Nat. Rev. Microbiol 15, 724–739 (2017).
Article CAS PubMed Google Scholar
Diez-Villasenor, C. & Rodriguez-Valera, F. CRISPR analysis suggests that small circular single-stranded DNA smacoviruses infect Archaea instead of humans. Nat. Commun. 10, 294 (2019).
Article ADS PubMed PubMed Central Google Scholar
Krupovic, M. et al. Cressdnaviricota: A virus phylum unifying seven families of rep-encoding viruses with single-stranded, circular DNA genomes. J. Virol. 94, e00582–20 (2020).
Sorek, R., Lawrence, C. M. & Wiedenheft, B. CRISPR-mediated adaptive immune systems in bacteria and archaea. Annu Rev. Biochem. 82, 237–266 (2013).
Article CAS PubMed Google Scholar
Dion, M. B. et al. Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter. Nucleic Acids Res. 49, 3127–3138 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dellas, N., Snyder, J. C., Bolduc, B. & Young, M. J. Archaeal Viruses: Diversity, Replication, and Structure. Annu Rev. Virol. 1, 399–426 (2014).
Article PubMed Google Scholar
Edwards, R. A., McNair, K., Faust, K., Raes, J. & Dutilh, B. E. Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev. 40, 258–272 (2016).
Article CAS PubMed Google Scholar
Rahlff, J. et al. Lytic archaeal viruses infect abundant primary producers in Earth’s crust. Nat. Commun. 12, 4642 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
Article CAS PubMed Google Scholar
Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2021).
Article CAS PubMed Google Scholar
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e1099 (2021).
Article CAS PubMed PubMed Central Google Scholar
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
Article ADS CAS PubMed Google Scholar
Coutinho, F. H., Edwards, R. A. & Rodriguez-Valera, F. Charting the diversity of uncultured viruses of Archaea and Bacteria. BMC Biol. 17, 109 (2019).
Article CAS PubMed PubMed Central Google Scholar
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021).
Article CAS PubMed Google Scholar
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).
Article Google Scholar
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol 6, 960–970 (2021).
Article CAS PubMed PubMed Central Google Scholar
Varsani, A. & Krupovic, M. Smacoviridae: a new family of animal-associated single-stranded DNA viruses. Arch. Virol. 163, 2005–2015 (2018).
Article CAS PubMed Google Scholar
Grazziotin, A. L., Koonin, E. V. & Kristensen, D. M. Prokaryotic Virus Orthologous Groups (pVOGs): A resource for comparative genomics and protein family annotation. Nucleic Acids Res. 45, D491–D498 (2017).
Article CAS PubMed Google Scholar
Terzian, P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 3, lqab067 (2021).
Article PubMed PubMed Central Google Scholar
Kala, S. et al. HNH proteins are a widespread component of phage DNA packaging machines. Proc. Natl Acad. Sci. USA 111, 6022–6027 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Luo, Y., Pfister, P., Leisinger, T. & Wasserfallen, A. Pseudomurein endoisopeptidases PeiW and PeiP, two moderately related members of a novel family of proteases produced in Methanothermobacter strains. FEMS Microbiol. Lett. 208, 47–51 (2002).
Article CAS PubMed Google Scholar
Chen, B. et al. ORF4 of the temperate archaeal virus SNJ1 governs the lysis-lysogeny switch and superinfection immunity. J. Virol. 94, e00841–00820 (2020).
Article CAS PubMed PubMed Central Google Scholar
Canchaya, C., Fournous, G. & Brussow, H. The impact of prophages on bacterial chromosomes. Mol. Microbiol. 53, 9–18 (2004).
Article CAS PubMed Google Scholar
Rambo, I. M., Langwig, M. V., Leao, P., De Anda, V. & Baker, B. J. Genomes of six viruses that infect Asgard archaea from deep-sea sediments. Nat. Microbiol. 7, 953–961 (2022).
Article CAS PubMed Google Scholar
Roux, S. et al. Minimum Information about an Uncultivated Virus Genome (MIUViG). Nat. Biotechnol. 37, 29–37 (2019).
Article CAS PubMed Google Scholar
Reyes, A., Semenkovich, N. P., Whiteson, K., Rohwer, F. & Gordon, J. I. Going viral: next-generation sequencing applied to phage populations in the human gut. Nat. Rev. Microbiol 10, 607–617 (2012).
Article CAS PubMed PubMed Central Google Scholar
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: Mining viral signal from microbial genomic data. PeerJ. 3, e985 (2015).
Article PubMed PubMed Central Google Scholar
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: A novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).
Article PubMed PubMed Central Google Scholar
Jian, H. et al. Diversity and distribution of viruses inhabiting the deepest ocean on Earth. ISME J. 15, 3094–3110 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, Z. et al. Deep sea sediments associated with cold seeps are a subsurface reservoir of viral diversity. ISME J. 15, 2366–2378 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pradier, L., Tissot, T., Fiston-Lavier, A. S. & Bedhomme, S. PlasForest: a homology-based random forest classifier for plasmid detection in genomic datasets. BMC Bioinforma. 22, 349 (2021).
Article CAS Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput Biol. 19, 455–477 (2012).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Turnbaugh, P. J. et al. The human microbiome project. Nature 449, 804–810 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 1–11 (2010).
Article Google Scholar
Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 38, 1079–1086 (2020).
Article CAS PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Article CAS PubMed Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Johnson, M. et al. NCBI BLAST: a better web interface. Nucleic Acids Res. 36, W5–W9 (2008).
Article CAS PubMed PubMed Central Google Scholar
Chaumeil, P.A., Mussig, A.J., Hugenholtz, P. & Parks, D.H. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927, (2019).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bland, C. et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinforma. 8, 209 (2007).
Article Google Scholar
Iranzo, J., Koonin, E. V., Prangishvili, D. & Krupovic, M. Bipartite network analysis of the archaeal virosphere: Evolutionary connections between viruses and capsidless mobile elements. J. Virol. 90, 11043–11055 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wolf, S. et al. Characterization of the lytic archaeal virus Drs3 infecting Methanobacterium formicicum. Arch. Virol. 164, 667–674 (2019).
Article CAS PubMed Google Scholar
Vik, D. R. et al. Putative archaeal viruses from the mesopelagic ocean. PeerJ. 5, e3428 (2017).
Article PubMed PubMed Central Google Scholar
Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput Biol. 7, e1002195 (2011).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
Article CAS PubMed Google Scholar
Bobay, L. M. & Ochman, H. Biological species in the viral world. Proc. Natl Acad. Sci. USA 115, 6040–6045 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Guo, J. et al. VirSorter2: A multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
Article PubMed PubMed Central Google Scholar
Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 4.5: A hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Article CAS PubMed Google Scholar
Oksanen, J. et al. vegan: community ecology package. R package version 2.5-6. https://cran.r-project.org/web/packages/vegan/index.html (2019).
Edgar, R. C. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinforma. 5, 113 (2004).
Article Google Scholar
Kumar, S., Stecher, G., Li, M., Knyaz, C. & Tamura, K. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549 (2018).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work received support from the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB29050500); Guangdong Provincial Key Laboratory of Synthetic Genomics (2019B030301006); Shenzhen Key Laboratory of Synthetic Genomics (ZDSYS201802061806209); Shenzhen Institute of Synthetic Biology Scientific Research Program (Grant no. JCHZ20200001)

Author information

These authors contributed equally: Ran Li, Yongming Wang.

Authors and Affiliations

Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
Ran Li, Yongming Wang & Yingfei Ma
University of Chinese Academy of Sciences, Beijing, 100049, China
Ran Li
Xbiome, Scientific Research Building, Tsinghua High-Tech Park, Shenzhen, China
Han Hu & Yan Tan

Authors

Ran Li
View author publications
You can also search for this author in PubMed Google Scholar
Yongming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Han Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Yingfei Ma
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.M., Y.W., and R.L. designed the study. R.L. and Y.W. performed the metagenomic analysis. H.H. and Y.T. provided suggestions. Y.M., R.L., and Y.W. contributed to the scientific discussion and preparation of the manuscript.

Corresponding author

Correspondence to Yingfei Ma.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Francisco Rodriguez-Valera and the other, anonymous, reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Description of Additional Supplementary Files

Data1

Data2

Data 3

Data 4

Data 5

Data 6

Data 7

Data 8

Data 9

Data 10

Data 11

Data 12

Data 13

Data 14

Data 15

Reporting Summary

Source data

Source Data file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, R., Wang, Y., Hu, H. et al. Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut. Nat Commun 13, 7978 (2022). https://doi.org/10.1038/s41467-022-35735-y

Download citation

Received: 17 May 2022
Accepted: 19 December 2022
Published: 29 December 2022
DOI: https://doi.org/10.1038/s41467-022-35735-y

This article is cited by

A metagenomic catalog of the early-life human gut virome
- Shuqin Zeng
- Alexandre Almeida
- Shaopu Wang
Nature Communications (2024)
Massive expansion of the pig gut virome based on global metagenomic mining
- Jiandui Mi
- Xiaoping Jing
- Haixue Zheng
npj Biofilms and Microbiomes (2024)
Stable coexistence between an archaeal virus and the dominant methanogen of the human gut
- Diana P. Baquero
- Sofia Medvedeva
- Mart Krupovic
Nature Communications (2024)
Inaccurate viral prediction leads to overestimated diversity of the archaeal virome in the human gut
- Cynthia M. Chibani
- Shiraz A. Shah
- Stephen Nayfach
Nature Communications (2024)
Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2
- Yunlong Liu
- Morteza H. Ghaffari
- Yan Tu
aBIOTECH (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.