Introduction

Bacteriophages are the second most common biological entities in the human gut microbiome, after their bacterial targets1. As obligate parasites of bacteria, phages play a key role in regulating the structure and function of the gut microbiome in health and disease2. For example, phage community alterations have been linked to diseases such as inflammatory bowel disease3,4, metabolic syndrome5, and malnutrition6. Individual human gut phages have further been found to mediate diet-induced bacterial lysis7, bacterial bile acid metabolism8, and gut colonization9. But while phages are evidently crucial to the functioning of the gut microbiome, their study at higher taxonomic levels was long restricted to morphology- rather than phylogeny-based taxonomy.

This knowledge gap is partly due to the absence of a universal viral marker gene akin to ribosomal RNA in cellular organisms. Bacteriophage taxonomy was originally based on nucleic acid composition (with the majority being dsDNA viruses) and viral particle morphology10, with the bulk of known bacteriophage diversity being classified as Siphoviridae, Myoviridae, or Podoviridae. These families, as well as the order Caudovirales, proved to be so genomically diverse that they were recently abolished and reclassified in the class Caudoviricetes11,12. A striking example of the discovery of higher taxonomic lineage was the highly prevalent and abundant human gut phage crAssphage13 and its relatives14 in metagenomics data. These viruses are now accepted as the phylogenetics-based Crassvirales order by the International Committee for the Taxonomy of Viruses (ICTV). But while the description of the Crassvirales is a step forward in the taxonomy of the human gut virome, many other important lineages remain unclassified.

We recently identified the putative phage family Ca. Heliusviridae in a study of gut virome perturbations associated with metabolic syndrome, a collection of risk factors for cardiometabolic disease5. The expansive family was detected in over 90% of 196 study participants and may thus be part of a persistent core of human gut bacteriophage lineages15,16. But as our previous study focused on a single cohort, it remains unknown how widespread these phages are in the general population, as well as in other environments. Furthermore, their relation to human health beyond metabolic syndrome is yet to be elucidated. Their further adoption in virome studies necessitates a robust taxonomic classification based on comprehensive phylogenetic analysis of sequences from diverse sources17,18,19,20.

Here, we present a comprehensive study of heliusvirus genomic and phylogenetic diversity. Comparisons of member phages derived from several large metagenomic databases show both the distinctness of this family from known ICTV taxonomic lineages, and their phylogenetic structure. We further provide evidence that increasing ecological richness of this phage family in the gut is linked to both urbanized lifestyles and several (cardiometabolic) diseases.

Results

Heliusviruses are ubiquitous in bacteriophage databases

We previously reported an expansive and widespread clade of human gut phages, tentatively named as the family Ca. Heliusviridae, after the HELIUS (Healthy Life In an Urban Setting) cohort that we studied5. These phages shared nine marker genes: four structural proteins (terminase large subunit or TerL, portal protein, major capsid protein, and a head maturation protease), three replication-related ones (DNA polymerase I, helicase, and nuclease), and two proteins of unknown function (Fig. 1a)5. To study these phages more extensively, we analyzed 842,163 contigs of over 30 kbp from seven phage databases with profile hidden Markov models (HMMs) of all nine marker genes (Fig. 1b). From these, we selected 249,366 complete phage genomes, as predicted by checkv21. Among the complete genomes, 33,356 had a hit against the terminase large subunit (TerL) marker gene (bit-score ≥50), of which 25,699 remained after removing identical genomes. The TerL, which encodes an ATP-driven molecular motor involved in genome translocation into the viral capsid, is highly conserved among Caudoviricetes phages and consequently frequently used as marker gene12,14. A phylogenetic tree of TerL genes from these 25,699 genomes revealed that 1308 genomes that mostly had high similarity to the heliusvirus TerL (bit-score >700) and hits against all nine marker genes were contained in a distinct clade (Fig. 1c, Supplementary Fig. 1a). The TerL genes from these genomes in this clade also formed a monophyletic clade when compared to TerL genes from International Committee on the Taxonomy of Viruses (ICTV)-recognized virus (Fig. 1d, Supplementary Fig. 1b). The highest TerL bit-score of an ICTV-recognized virus was 515.4 against Dragolirvirus dragolir (Gochnauervirinae, MG727697). Manual curation to remove potential chimeric sequences (e.g., with multiple non-consecutive TerL genes) left 1032 heliusvirus genomes for further study. Phage-specific gene annotation with pharokka22 refined the function of several marker genes: the DNA polymerase I was revealed to instead be an exonuclease, while the ones of unknown function encode a DNA polymerase and ssDNA-binding protein (Fig. 1a). Presence of Hc1- and Ad1-type head-completion proteins indicated these phages have siphovirus morphology23, as does the fact that their closest relatives in the Gochnauervirinae24 and Orlajensenviridae11 are siphoviruses. We next determined which genes were most common among these phages by forming clusters of homologous proteins (i.e., protein clusters). The only universal protein cluster was the TerL, and the other previously described marker genes were all among the 11 most common protein clusters (Fig. 1e, Supplementary Data 2). Genes with hits against the HMMs of the major capsid protein and helicase were spread out over multiple protein clusters. Pairwise BLASTp searches of these proteins and comparisons of their hmmsearch bit-scores showed that these disparate protein clusters were highly distinct, indicating that in rare cases phages have acquired alternate versions of these genes (Supplementary Fig. 2). These proteins were clustered together, with 27/32 and 33/70 of the smaller PCs for the major capsid protein and helicase belonging to the same clade. A single genome also had a ssDNA-binding protein that was distinct from other heliusviruses. Additional highly common protein clusters were a DNA methyltransferase, likely used to defend against host restriction/modification systems25, and a DNA invertase, which in some phages is involved with host-range switching26.

Fig. 1: Presence of heliusvirus marker genes is linked to TerL homology.
figure 1

a Annotation of NatCom_991. The nine marker genes are denoted above the genome line, with re-annotated functions in parentheses. Other annotated genes are below the genome line. Full annotation is reported in Supplementary Data 1. b Overview of the procedure by which genomes were selected from seven phage databases. c A midpoint rooted approximate maximum-likelihood tree of all 25,699 non-redundant TerL sequences from complete phage genomes. Validation trees are in Supplementary Fig. 1ad A rooted approximate maximum-likelihood tree of heliusvirus TerL genes and its closest relatives from ICTV lineages. The tree is trimmed, for the full tree with all ICTV-derived proteins, see Supplementary Fig. 1b. e Prevalence of the most common protein clusters. Red denotes marker genes, gray others. Source data are provided as a Source Data file.

Overall, all nine marker genes were found in 827/1032 (80.1%) heliusvirus genomes and 1029/1032 (99.8%) had a TerL hmm-hit with a bit-score above 700 (Supplementary Data 3). Among the 24,391 non-heliusvirus genomes in Fig. 1c, 14 genomes contained all nine marker genes and 5 had an hmm-hit against the TerL with a bit-score over 700. Thus, the nine originally identified marker genes are very common, though not universal, among heliusviruses.

Sub-taxa with distinct genomic characteristics reflect ecologically divergent heliusvirus lineages

We next sought to establish more detailed phylogenies of these phages. We previously divided the heliusviruses into alpha, beta, and gamma subgroups. In an approximate maximum-likelihood tree of TerL genes, we again discerned three large well-supported clades (Shimodaira–Hasegawa-like approximate likelihood ratio test27 ≥80 and ultrafast bootstrap approximation28 ≥95, Fig. 2a), but these did not completely equate to the previous putative groups. The largest clade contained all but one of the partial genomes from the previously proposed gamma-group, but groups alpha and beta were spread out across the other two clades. This reflects the two main clades in group beta that were evident in the earlier analysis5, while their different architecture in the TerL-based tree versus the earlier concatenated tree may reflect horizontal gene transfer of at least some of the core genes among these groups. Furthermore, large subclades in all three clades contained no representative of the earlier groups, and neither did a number of smaller clades across the tree which were not part of the three main clades. In the face of this expanded biodiversity, we set out to draw more detailed taxonomic boundaries to combine related heliusvirus lineages, allowing for ecological and evolutionary interpretations.

Fig. 2: The Ca. Heliusvirales are a large order of gut phages with distinct subgroups.
figure 2

a A rooted approximate maximum-likelihood tree of Ca. Heliusvirales TerL genes. The outgroup at the extreme left of the tree is Dragolirvirus dragolir. The symbols for the presence of the marker genes are, from inside to outside: the exonuclease, ssDNA binding protein, DNA polymerase, endonuclease, helicase, TerL, portal protein, head maturation protease, and major capsid protein. b Chart depicting the pairwise percent of shared genes among the three main families (top) and their subfamilies (bottom). The n reflects the number of genomes in the lineage. For an overview of all families, see Supplementary Fig. 2. Box plots show the median (middle line), 25th, and 75th percentile (box), with the 25th percentile minus and the 75th percentile plus 1.5 times the interquartile range (whiskers), and outliers (single points). c vContact2 protein-sharing network of all sequences in (a). Node edge color denotes family, while fill denote subfamilies, both according to the same colors as use in (a). Source data are provided as a Source Data file.

The TerL tree also made it evident that these phages are better classified at the order-level. The reasoning behind this is as follows: the three largest clades were divisible into smaller clades that at minimum shared 5–57% of their protein content, too low for them to be considered different viral genera (Fig. 2b, Supplementary Fig. 3)29. In line with the viral taxonomic hierarchy30, we classify these clades as subfamilies. The higher-level clades are then necessarily families, and the entirety of the tree an order or suborder. Since suborders are strictly optional levels, and there is no clear distinction between them and orders30, we tentatively reclassify the phages studied here at the order level. Consequently, we renamed the Ca. Heliusviridae as the Candidatus viral order Heliusvirales. Because Helios was the ancient Greek solar deity, we named families after various solar deities. The three largest families are the Ca. Utuviridae (Mesopotamian), Ca. Aineviridae (Irish), and Ca. Hathorviridae (Egyptian). Several smaller well-supported and high-level clades were also placed at the family rank due to their very low (<15%) shared protein content (Supplementary Fig. 3). The subfamilies within the three largest families, mostly had a median shared protein content of 25% (Fig. 2b). All subfamilies were named after words for sun in languages from around the world (Fig. 2a, Supplementary Data 4).

In order for a group of phages to be deemed a defined lineage, they need to be a monophyletic group that is distinct from established taxonomy29. To establish this, we performed proteomic-based analyses on the 1032 curated genomes against phages classified by the ICTV. In a genomic protein-sharing network, our genomes formed a cluster separate from RefSeq phages (Supplementary Fig. 4a). This was confirmed in a phage proteomic tree, were they also formed a separate monophyletic clade distinct from ICTV-recognized sequences (Supplementary Fig. 4b). Together, these results show that the Ca. Heliusvirales are a cohesive lineage that is distinct from the current ICTV-ratified families.

A genomic network of the Ca. Heliusvirales showed that most subfamilies are highly similar in protein content (Fig. 2c), a notion supported by conservation of the marker genes and gene synteny, as well as further shared genes between subfamilies, from pairwise comparison between the proposed families and subfamilies (Fig. 3). Nevertheless, several subfamilies formed distinct clusters in the genomic network, most evidently the Ca. Mehrvirinae (Ca. Aineviridae) and Ca. Nitavirinae (Ca. Utuviridae), and to a lesser extent the Ca. Mataharivirinae (Ca. Hathorviridae) (Fig. 2c). This indicates that they are distinct in protein content, which is reflected by the environments in which they were found and their predicted bacterial hosts (Fig. 4a, b), as well as their divergent GC content when compared to other members of their family (Supplementary Fig. 5).

Fig. 3: Pairwise whole-genome comparisons of genomes from each of the families and subfamilies, with vertical lines showing tBLASTx similarity.
figure 3

The tree on the left is a collapsed version of the one in Fig. 2a. Genes are colored according to their function, and homologs of the nine originally identified marker genes have dashed red outlines. For the names of the genomes depicted, see Supplementary Data 4. Source data are provided as a Source Data file.

Fig. 4: High diversity in environmental origin and predicted host species among Ca. Heliusvirales genomes.
figure 4

The 1032 complete HQ Ca. Heliusvirales genomes by subfamily according to their (a) environmental origins, and (b) Iphop-based bacterial host predictions. Detailed results can be found in Supplementary Fig. 6 and Supplementary Data 5. c Pathway assignment of putative auxiliary metabolic genes (AMGs) and morons encoded by the Ca. Heliusvirales. For brevity, the 20 pathways that were most commonly found were selected. Source data are provided as a Source Data file.

Out of the 1032 curated genomes, no host was predicted for 490. Of the 542 Ca. Heliusvirales with a predicted host, 507 (93.4%) had a predicted host within the Firmicutes or one of its sub-phyla in the genome taxonomy database (GTDB). The most commonly predicted host families were the Lachnospiraceae (164 phages, 30.26% of predictions), Streptococcaceae (97 phages, 17.9%), and Acutalibacteraceae (38 phages, 7.0%). As to environmental origin, most of the complete Ca. Heliusvirales genomes originated from human gut samples (645/1032 or 62.5%), followed by the intestinal tracts of other animals, mostly mammals (194 genomes or 18.6%). These findings are consistent with the observation that Heliusvirales are prevalent in the human gut, as the top ten most commonly predicted host genera included several Lachnospiraceae bacteria that are typical to the human gut (e.g., like Blautia, Ruminococcus, Dorea, Coprococcus, and Faecalimonas, see Supplementary Data 5), and this predicted host family was the most common for human gut-derived Ca. Heliusvirales genomes (Supplementary Fig. 6).

Several subfamilies were distinct in their environmental origins or predicted hosts. The Ca. Mehrvirinae were enriched in phages from the human oral cavity (hypergeometric test, p = 3.19 × 10−20) an environmental preference also reflected in their most commonly predicted hosts: the facultatively anaerobic genus Streptococcus (family Streptococcaceae). The distinctness of the Ca. Nitavirinae similarly reflect their predicted hosts in the Actinomycetaceae rather than the Firmicutes infected by other Ca. Heliusvirales. The Lachnospiraceae phages in the Ca. Mataharivirinae were enriched in human gut-derived phages (hypergeometric test, p = 1.28 × 10−20). Within the Lachnospiraceae, the most commonly predicted Ca. Matahrivirinae hosts were the genera Coprococcus, Agathobacter, TF01-11, and Dorea_A, which are all common human gut bacteria (Supplementary Fig. 7). Finally, the Ca. Zonviridae, although less distinct in the genomic network (Fig. 2c), were peculiar in their predicted hosts. These were in the Negativicutes, with their mostly predicted hosts being in the Acidaminococcaceae and Megasphaeraceae families, which unlike other Firmicutes possess diderm cell envelopes31. Unlike the recently defined Crassvirales, which are all thought to infect hosts within the Bacteroidetes phylum32, the Ca. Heliusvirales thus seem to infect a wide variety of hosts across multiple phyla.

Almost all complete Ca. Heliusvirales genomes (97.9%) carried identifiable genes involved with integration and excision, meaning they were temperate phages (Supplementary Fig. 8). While the remainder of the complete genomes might belong to obligately virulent phages, it might also be that they contain divergent integration genes that evaded annotation or have other chronic lifestyles. The only lineages of which less than 75% or genomes was identifiably temperate was the Willkavirinae, which contained 5 genomes, indicating that they are rare. Thus, while Ca. Heliusvirales phages are predominantly temperate, eco-evolutionary pressures on phage lifestyle contributed to the divergence of some lineages contained within it.

Putative auxiliary metabolic genes were common among Ca. Heliusvirales, being present in 78.1% of genomes (Fig. 4c), although we found no clear family- or subfamily specific auxiliary metabolic genes. Many were part of cysteine and methionine metabolism, with the most common genes being DNA (cytosine-5)-methyltransferase 1 (K00558, present in 67% of genomes) and S-adenosylmethionine synthetase (K00789, present in 58.9% of genomes).

Helius-phages are prevalent in the human gut and human-associated since ancient times

Across our datasets, we found Ca. Heliusvirales TerL genes on every continent (Fig. 5a), indicating that they are widespread. Only one country had human gut-associated samples without any detected Ca. Heliusvirales: El Salvador (n = 1 sample). Indeed, analysis of 7166 human gut metagenomes of 5441 individuals from 38 studies33 found Ca. Heliusvirales TerL genes in 4467 individuals (82.1%, Fig. 5b). Outside the gut, prevalence of Ca. Heliusvirales was markedly lower: we detected their presence in 18/54 skin samples, 17/61 vaginal samples, and 115/312 oral samples. This is not due to differences in read depth, as Ca. Heliusvirales were less prevalent in oral samples (Fisher exact test p = 3.3 × 10−14), while those had a higher average read depth (Wilcoxon signed-rank test p = 3.3 × 10−10) than gut samples. While the Ca. Heliusvirales thus occur in various body sites, they are especially prevalent in the gut. This is in line with the finding that their most commonly predicted hosts are in the Firmicutes (e.g., the Lachnospiraceae), which is one of the main bacterial phyla in the human gut.

Fig. 5: the Ca. Heliusvirales are widespread and ancient.
figure 5

a Map showing in which countries Ca. Heliusvirales TerL-containing sequences were found. Data pertains to the seven bacteriophage databases described in the methods and samples assembled earlier by Pasolli et al.33. Red dots show the locations of ancient gut microbiomes that were analyzed for Ca. Heliusvirales TerL presence, with their dates, gene-, and sample-counts. b Correlation between prevalence of Ca. Heliusvirales TerL genes and number of sequenced bases among individuals of 47 gut metagenome studies. The red bar dashed line is the median prevalence. Correlation according to Spearman’s rank correlation coefficient. c Tip-dated phylogenetic tree of selected Ca. Hathorviridae sequences from modern and ancient samples. Bold sequences are from ancient samples and have validated DNA degradation patterns. Bar plots and values denote median and 95% confidence intervals. A phylogenetic tree of all Ca. Hathorviridae sequences is shown in Supplementary Fig. 8. d Boxplots of the number of Ca. Heliusvirales TerL genes found per person, per billion sequenced bases, with jittered points per study. The plot includes 13 ancient, 51 hunter gatherer, and 33 urban sequences. Significance is according to nested linear mixed-effect models where the study was used as a random effect. Box plots show the median (middle line), 25th, and 75th percentile (box), with the 25th percentile minus and the 75th percentile plus 1.5 times the interquartile range (whiskers). Comparisons of richness in urbanized versus hunter-gatherer populations within each study can be found in Supplementary Fig. 9a. e Non-metric multidimensional scaling (NMDS) on unweighted UniFrac distances, colored by lifestyle of the human host. f Prevalence of Ca. Heliusvirales subfamilies in ancient peoples, hunter-gatherers, and urbanized peoples, with significance according to two-sided Fisher exact tests for subfamilies with significant values. Columns represent individuals, with gray color meaning that no Ca. Heliusvirales TerL genes were identified. All p values were adjusted for multiple testing with the Benjamini-Hochberg procedure, and are denoted as follows *≤0.05, **≤0.01, ***≤0.001. Source data are provided as a Source Data file.

We hypothesized that the high Ca. Heliusvirales prevalence reflects their ancient association to the human lineage. To explore this, we re-analyzed whole metagenome shotgun sequencing datasets from paleofeces in Austria34 and North America35. Among the North-American samples, which were from three different locations across the USA and Mexico, and were dated between 10 and 920 CE, we found 34 Ca. Heliusvirales TerL genes across all eight of the studied samples (Fig. 5a). In the Austrian samples, we found tree TerL sequences in one of the four Austrian samples, which was dated at between 652-544 BCE. We furthermore analyzed gut and small-intestinal samples from an ancient European individual who lived around 3200 BCE36, but found no clear Ca. Heliusvirales sequences, although four genes with hmmsearch bit-scores of between 600 and 700 against the TerL were present. Considering that ICTV-recognized phage genomes with TerL bit-scores of between 500 and 600 are close relatives to the Ca. Heliusvirales (Fig. 1d), these ancient sequences could also represent such distant relatives, although this cannot be certain without more complete genome assemblies.

The presence of Ca. Heliusvirales in multiple pre-colonial North Americans implies that they were part of the human gut microbiome before human migration to the Americas (about 15,000 years ago37). We confirmed this hypothesis with a time-measured phylogenetic tree of the two main Ca. Hathorviridae subfamilies, which contains the strongly human-gut associated Ca. Mataharivirinae. A tree built with an optimized relaxed clock of ancient North-American sequences and a selection of modern human gut sequences showed that both the Ca. Mataharivirinae and Ca. Ravirinae started diversifying about between 210,000 and 250,000 years ago, albeit with large 95% highest posterior density (HPD) intervals (Ca. Mataharivirinae: 350,844 years ago (95% HPD 15,000–981,000), Ca. Ravirinae: 309,521 years ago (95% HPD 9200–894,000), Fig. 5c). This is around the time that Homo sapiens first emerged38, suggesting that these phages have been a part of the human gut ecosystem since our distant past.

Ancient gut microbiomes have similar levels of bacterial diversity to those of modern populations consuming non-urbanized diets35, which are more diverse than those consuming urbanized diets39,40. By extension, this is likely true for the gut viruses as well. To further explore the relation between ancient, non-urbanized, and urbanized Ca. Heliusvirales, we compared them to samples derived from urbanized and non-urbanized populations. To minimize inter-study batch effects, we selected two studies that included both hunter-gatherers (from Tanzania or Peru) and urbanized people (from Italy or the USA)39,40. Within both studies, we observed a significantly higher Ca. Heliusvirales richness among urbanized people (Wilcoxon signed-rank test, Benjamini-Hochberg-adjusted p = 0.026 and 0.01, Supplementary Fig. 10a). Combining the values of modern urbanized and hunter-gatherer populations with ancient ones showed significantly higher Ca. Heliusvirales richness in urbanized people than both hunter-gatherers (linear mixed-effect model, Benjamini-Hochberg-adjusted p = 1.2 × 10−4) and ancients (linear mixed-effect model, Benjamini-Hochberg-adjusted p = 2.9 × 10−5), but not between hunter-gatherers and ancients (linear mixed-effect model, Benjamini-Hochberg-adjusted p = 0.22, Fig. 5d). This interestingly is the reverse of overall gut bacterial richness, which decreases with urbanization41. Abundance, measured as the fraction of reads that mapped to Ca. Heliusvirales TerL genes, was not significantly different between the three populations (Supplementary Fig. 10b). Ca. Heliusvirales thus seem to diversify in tandem with urbanization.

We next analyzed Ca. Heliusvirales β-diversity among urbanized, hunter-gatherer, and ancient samples with a phylogenetic tree of TerL genes from these samples and those from complete genomes as depicted in Fig. 2a. Non-metric multidimensional scaling (NMDS) on unweighted UniFrac distances intriguingly showed that all three populations (urban, hunter-gatherers, ancients) had distinct Ca. Heliusvirales populations (permanova q = 0.001, Fig. 5e). Among ancient samples the small Ca. Xiheviridae family were particularly more prevalent than both modern hunter-gatherers and urban populations (Fisher’s exact test, Benjamini-Hochberg-adjusted p < 0.05, Fig. 5f). The Ca. Ghrianvirinae (Ca. Utuviridae) were meanwhile much more prevalent among both ancient and modern hunter-gatherers than urbanized populations. In general, the Ca. Heliusvirales subfamily diversity was lowest among ancient samples, while modern hunter-gatherers and urban populations were similar (Supplementary Fig. 10c). This indicates that modernity is linked to an expansion of Ca. Heliusvirales that is not only explained by life style. Our results could reflect that predicted Ca. Heliusvirales hosts in the Lachnospiraceae thrived in the urbanized gut microbiome.

Helius-phage richness is associated with disease

Our previous research5 focused on gut virome alterations in metabolic syndrome (MetS), and found an association between members of the Ca. Heliusvirales and this set of cardiometabolic risk factors. To study a wider range of illnesses, we now identified Ca. Heliusvirales sequences in fourteen studies of diverse human-derived samples33 by searching for the TerL marker gene. Within-study analyses identified significantly altered Ca. Heliusvirales richness in four out of twelve illnesses (Wilcoxon signed-rank test Benjamini-Hochberg adjusted p < 0.05): type 1 diabetes (T1D), type 2 diabetes (T2D), inflammatory bowel disease (IBD), and liver cirrhosis (Fig. 6a). In the former three, richness was elevated when compared to healthy controls, while in liver cirrhosis it was decreased.

Fig. 6: Ca. Heliusvirales phages are associated with a number of diseases.
figure 6

a Four studies in which a significant difference was found in the number of Ca. Heliusvirales TerL genes per person per billion sequenced bases. Significance is according to Wilcoxon signed-rank tests. Box plots show the median (middle line), 25th, and 75th percentile (box), with the 25th percentile minus and the 75th percentile plus 1.5 times the interquartile range (whiskers). Sample counts are: LiJ_2014; 10 controls and 31 diseased, NielsenHB_2014; 236 controls and 82 diseased, QinJ_2012 174 controls and 170 diseased, QinN_2014; 114 controls and 123 diseased. b Non-metric multidimensional scaling (NMDS) on unweighted UniFrac distances, colored by disease status. c The difference in prevalence between controls and people with disease. All p values were according to two-sided Fisher exact tests and adjusted for multiple testing with the Benjamini-Hochberg procedure. Source data are provided as a Source Data file.

Besides differences in Ca. Heliusvirales richness, the populations of these phages also differed in each of the four studies (unweighted UniFrac, PERMANOVA Benjamini-Hochberg adjusted p < 0.05, Fig. 6b). These population-wide differences between healthy controls and people with T1D, T2D, IBD, and cirrhosis expressed themselves in distinct patterns of prevalence at the subfamily level (Fig. 6c). Similar to differences between ancient, modern hunter-gatherer, and modern urbanized populations (Fig. 5f), the Ca. Xiheviridae phages were more prevalent in T1D- and T2D-patients than the corresponding healthy individuals (Fisher’s exact test, Benjamini-Hochberg adjusted p < 0.05), though not in IBD or cirrhosis. The Ca. Ghrianvirinae, which was less prevalent in modern urbanized populations than either modern hunter gatherer or ancient ones, was more prevalent in T2D patients than controls, but not in the other diseases. Besides these two, the Ca. Tilevirinae, was significantly more present in T1D and T2D than controls, and significantly less so in IBD and cirrhosis. The observation that similar Ca. Heliusvirales were scarce both in urbanized versus non-urbanized populations and in T1D/T2D patients versus healthy controls may represent a viral complement to earlier findings that certain bacterial clades are distinctly associated with increased urbanization42.

Discussion

Elucidating higher-order structures among gut phage lineages is essential for gaining a deeper understanding of the viromes in the human gut and beyond. This process is underway with the re-evaluation of viral taxonomy along genomic lines29. Recently, the prevalent Crassvirales phage order has undergone the arduous process of classification32,43, and here we add the Ca. Heliusvirales order as a second genome-based classification of phages that are widespread in the human gut. For this, we used a combination of marker gene phylogenies, genomic analyses, and proteomics-based approaches. As the TerL marker gene is conserved among the Duplodnaviria, to which dsDNA bacteriophages belong44, we considered the phylogeny built with this gene is the most reliable method for defining Ca. Heliusvirales taxonomy. Our usage of the TerL gene to define taxonomy is akin to the Crassvirales11, but unlike the concatenated phylogeny of marker genes used to define the Herelleviridae12. The absence of any other lineage-specific marker gene beside the TerL gene among the Ca. Heliusvirales precludes a concatenated alignment phylogeny approach. We recognize that homology search tools including mmseqs2 may under-cluster the protein families. Using a more sensitive tool would result in larger, more widely shared protein families and could increase the genomic coherence among the Ca. Heliusvirales subclades.

While some ICTV-recognized phage lineages, such as the Gochnauervirinae, Hendrixvirinae, and Orlajensenviridae, are closely related to the Ca. Heliusvirales is phylogenetic analyses of the TerL gene (Fig. 1d and Supplementary Fig. 1), they were highly distinct from the Ca. Heliusvirales in protein-content based analysis (Supplementary Fig. 3a) and the whole genome tree (Supplementary Fig. 3b). Perhaps these phages will together form a yet-to-be-established higher taxonomic clade.

The Ca. Heliusvirales order is highly diverse in environmental origin, geographic distribution, and predicted bacterial hosts. While we found Ca. Heliusvirales phages in terrestrial, marine, and gut microbiomes of both cattle and wild animals, the majority of genomes derived from human gut samples. While at face-value this may be assigned to anthropocentric sampling bias, this is unlikely because the IMG/VR database included in our analysis contains about six times more aquatic and terrestrial phages than human-derived ones45. Furthermore, non-animal-associated Ca. Heliusvirales phages are largely constrained to specific taxonomic families, indicating environmental differentiation.

The Ca. Heliusvirales were found to carry a large number of putative auxiliary metabolic genes. The most common ones were involved in cysteine and methionine metabolism. These genes may aid the phage in defending against bacterial restriction-modification defenses25,46. As cysteine and methionine are both sulfur-containing amino acids and other sulfur metabolism genes were the next most common category, another possibility is that the phages are involved in reprogramming bacterial sulfur metabolism for additional energy production during phage particle formation47.

We identified human gut-associated Ca. Heliusvirales phages in countries from every permanently inhabited continent, and in about 80% of the population. Thus, while no isolate of this bacteriophage order exists to our knowledge, their high prevalence across metagenomes clearly indicates them to be core members of the human gut ecosystem. On the one hand, this percentage could be an underestimation, as false negatives falling below the detection limit can never be ruled out, especially in samples with shallow sequencing depth. But on the other hand, identification of Ca. Heliusvirales contigs from assembled metagenomic datasets based on the presence of the TerL could have led to selection of some false positive sequences, e.g. due to chimeric assemblies or the presence of TerL sequences in defective prophages. As chimeric contigs are rare48, we consider it unlikely that this significantly impacted our results.

In either case, the Ca. Heliusvirales have been the case since the deep past of modern humans, as evidenced by the presence of these phages in palaeofaeces and our accompanying evolutionary studies. This has been the case since the deep past of modern humans, as evidenced by the presence of these phages in palaeofaeces and our accompanying evolutionary studies. Presence in palaeofaeces was not universal, as only one out of four European samples contained Ca. Heliusvirales. This might either reflect widespread absence of these phages from ancient Europeans, or incomplete assembly and DNA degradation. Supporting the latter hypothesis are three observations. Firstly, the Ca. Heliusvirales were universally present in North American samples dated to between the split between American and Afro-Eurasian populations (about 25,000 years ago37) and the re-establishment of contact four centuries ago. Secondly, we dated the diversification of the Ca. Mataharivirinae and Ca. Ravirinae subfamilies roughly to the emergence of Homo sapiens in Africa between around 300,000 years ago38. Thirdly, we find the Ca. Heliusvirales almost universally present in modern human populations. While the combinations of these facts might indicate that these phages became uncommonly rare in ancient Europeans, but not North Americans, and that they subsequently became more common over the past millennium, it seems more likely that Ca. Heliusvirales absence from the ancient European samples was due to differences in DNA degradation owing to the environmental circumstances when compared to North America49, or due to technical differences in sample handling and sequencing.

As stated, the Ca. Mataharivirinae and Ca. Ravirinae subfamilies diversified roughly 300,000 years ago, albeit with large confidence intervals due to limited ancient sequence data. Thus, these bacteriophage subfamilies are notably more ancient than the human-associated bacterial species Methanobrevibacter smithii35 and M. oralis50 which diverged 85,000 and 126,000 years ago, respectively. We hypothesize that these phages diversified in response to microbiome alterations influenced by changing human geography and diet, both main drivers of human microbiome composition51,52. Earlier studies showed co-linearity between human- and great ape-associated Crassvirales genomes53 and estimated a diversification of these phages in the past several centuries54. This latter estimate of much more recent diversification was, however, noted for its uncertainty and focused only on crAssphage rather than higher-order taxa, such as the subfamily-level analysis presented here.

We also found that higher Ca. Heliusvirales richness is associated with increased urbanization. This is remarkable, as microbiome diversity as a whole is lower with increasing urbanization39,40, andmany lineages have disappeared from modern urbanized populations, but some have also expanded. The observation that Ca. Heliusvirales have expanded could reflect that their bacterial hosts have expanded in urbanized microbiomes. Indeed, Hadza hunter-gatherers from Tanzania and Matses hunter-gatherers from Peru have lower levels of Lachnospiraceae bacteria such as Blautia and Ruminococcus than people in urbanized environments40,55.

Besides correlations with urbanization and modernity, we found correlations between Ca. Heliusvirales and several disease states. While the causative direction underlying these correlations is uncertain, these disease linkages are interesting in light of the predicted Lachnospiraceae hosts of many Ca. Heliusvirales, because this bacterial family includes species involved in short-chained fatty acid production (e.g., Roseburia intestinalis, Anaerobutyricum hallii, Coprococcus eutactus, and Blautia wexlerae, Supplementary Fig. 5)56, which are often associated with the healthy gut. Indeed, increased Lachnospiraceae abundance in T1D-57 and T2D-patients58 has been reported, though a study of in vitro incubations in a relatively small cohort size (n = 26) linked IBD to decreased Lachnospiraceae abundance59. For the former two diseases, this could explain the greater richness and prevalence in Ca. Heliusvirales phages among such patients. The associations at the order rank are notably counter to ASV-level analysis of the entire gut virome, which was found to be unchanged in T1D60, T2D61, and IBD4. This underscores the value of analyzing higher taxonomic ranks in the context of the gut virome.

While the Ca. Heliusvirales as a whole are near universally present in the human gut, no single viral genome in this putative order is. The linkages to lifestyle and disease demonstrated here are thus only evident when studying higher viral taxonomic levels. In line with the recent reassessment of viral taxonomy along genomic lines11, this study thus shows how future definition of additional viral orders and classes can provide key insights into the ecology and disease-associations of the human gut virome.

Methods

Data

All bioinformatic tools were used with default settings, unless explicitly stated otherwise. Statistical analyses were done in R v4.2.1. All p-values were adjusted for multiple testing with the Benjamini-Hochberg method.

The primary dataset of Heliusviridae marker genes were derived from 298 contigs that were used in our earlier study5, which studied the gut virome in the general population-based multi-ethnic Healthy Life in an Urban Setting (HELIUS) cohort62. Contig datasets that were used to identify Heliusviridae consisted of 63,481 genomes from this previous study. This included all viral contigs before removing redundant ones. Further data sources consisted of four recently constructed databases of gut bacteriophages: the gut phage database (GPD, 142,809 genomes)17, cenote-taker 2 human virome database (CHVD, 93,860 genomes)19, gut virome database (GVD, 33,242 genomes)18, and metagenomic gut virus catalog (MGV, 189,680 genomes)63. Further contigs were derived from fourth version of the integrated microbial genomes viral dataset (IMG/VR, 15,722,824 genomes)20 and bacterial viruses from the NCBI viral reference sequence dataset (RefSeq, 4220 genomes) release 216. For studies of urbanization- and disease-links, metagenome assemblies were obtained from Pasolli et al.33 of 7166 human gut metagenomes of 5441 individuals from 38 studies.

Identification of putative Heliusviridae sequences

Since previous findings showed Heliusviridae genomes to be around 50,000–100,000 bp long, we first selected contigs of at least 30,000 bp without any ambiguous bases from all datasets. This resulted in an effective search space of 842,163 contigs. Open reading frames (ORFs) were then predicted using prodigal v2.6.364 (option–meta). These contigs were analyzed for completion with checkv v1.0.121 (using database version 1.4), and we selected 249,366 contigs that had 100% completeness, no warnings, and were not recognized as prophages.

For each of the nine marker genes, a profile hidden Markov model (HMM) was constructed from protein sequences derived from our previous study using hmmbuild v3.365. The resulting profile HMMs were used to search against predicted ORFs using hmmsearch v3.365. The 33,356 complete contigs with an hmmsearch hit (bit-score ≥50) against the TerL gene were selected. Duplicates were removed with dedupe from BBTools v38.84 (option minidentity=100), resulting in a non-redundant dataset of 25,699 contigs.

The TerL genes from these genomes were aligned with MAFFT v4.753 (options –maxiterate, –localpair), after which positions with more than 90% gaps were trimmed with trimal v1.4.rev1566 (option -gt 0.1). Finally, a tree was built using VeryFastTree v4.0.367. Tree topology was confirmed by randomly selecting 20 sequences from the Heliusviridae branch and 180 from the rest of the tree. These were aligned and trimmed as described above, and a tree was built with IQTree v2.2.0.368, using model finder69 and 1000 iterations of the ultrafast bootstrap approximation28 and SH-like approximate likelihood ratio test. IQTree performed ten separate tree iterations and selected the one with the best log likelihood score (options -B 1000 -alrt 1000 –runs 10). The trees were visualized with the interactive tree of life (iTOL) webtool70. This confirmation was repeated on five different subsets of sequences.

Ca. Heliusvirales curation and annotation

Putative Ca. Heliusvirales genomes were manually curated to discard ones with multiple predicted TerL, portal, or major capsid proteins, or marker genes that were shorter or longer than 50% of the median length. For sequences with split marker gene sequences, gene prediction was repeated with alternative coding tables 4 and 15, but no instance was found where this improved annotation and thus these sequences were also discarded. This left 1032 high-quality Ca. Heliusvirales genomes for subsequent analyses.

HQ genomes were fully annotated with pharokka v1.2.122 (option-m), and subsequently phold v0.1.4 (https://github.com/gbouras13/phold). ORFs were grouped in protein clusters using mmseqs2 v14.7e28471 (options –min-seq-id 0.3, -c 0.5, –cov-mode 1). GC contents were derived from pharokka output, and temperate phages were defined as those with at least one ORF with a predicted function in the “integration and excision” category. Potential auxiliary metabolic genes encoded by the genomes were identified with VIBRANT v1.2.172.

A tree for defining Ca. Heliusvirales family- and subfamily-level taxonomy was constructed from TerL sequences of HQ genomes, with the Dragonlirvirus dragonlir TerL (AUS03408) as outgroup. Alignment and trimming were done with MAFFT and trimal, and the tree was constructed with IQTree using the same settings as described above. The tree was visualized with the interactive tree of life (iTOL) webtool70.

Viral sequence clustering and proteomic analysis

Separation between our phages and those in the NCBI refseq database was analyzed with a tree built with VeryFastTree v4.0.367 as described above, ViPTreeGen v1.1.373 and vContact2 v0.11.374. The vContact2 network was visualized with Cytoscape v3.9.175, while the ViPTreeGen tree was visualized with the interactive tree of life (iTOL) webtool70. To analyze the grouping of genomes among the high quality contigs, we used them to constructed a second vContact2 network.

Heliusvirales hosts

We predicted the hosts infected by the Heliusvirales with iPHoP v1.3.276, where host predictions with a score of at least 90 were considered valid. This resulted in genus-level host predictions for 796 phages.

Identification of Ca. Heliusvirales contigs in metagenomic samples

To identify Ca. Heliusvirales contigs from metagenomic assemblies, used in analyses of ancient, urban, hunter-gatherer, and diseased samples, proteins were predicted with prodigal as described above. An hmmsearch using the profile HMM of the Ca. Heliusvirales TerL gene constructed from our previous study was then done against all proteins. Sequences with an hmmsearch hit against the TerL (bitscore ≥50) were then selected and used to construct a phylogenetic tree with TerL sequences from all HQ Ca. Heliusvirales and ICTV-derived genomes. These trees were constructed using VeryFastTree v4.0.367. Trees were rooted at the midpoint with the phangorn v2.11.1 R package77, and the most recent common ancestor node of all Ca. Heliusvirales was determined and extracted with the getMRCA and extract.clade functions from the ape v5.7-1 R package78. Trees were visually inspected to confirm that all Ca. Heliusvirales were in the same clade. In a single case, that of the HeQ_2017 dataset from Pasolli et al., the Ca. Heliusvirales were at the root of the tree, and a reverse selection of sequences that were in non- Ca. Heliusvirales was performed.

Analysis of ancient samples

For analysis of ancient metagenomes, reads were downloaded from two separate studies34,35. These were trimmed with fastp v.0.23.279 (options –detect_adapter_for_pe) and error-corrected with tadpole from the bbmap package v38.90 (https://jgi.doe.gov/data-and-tools/bbtools,optionsmode=correct,ecc=t,prefilter=2). Error corrected reads were assembled with metaSPAdes v3.15.580. The ancient origins of assembled contigs were determined by mapping reads to contigs of ≥1000 bp with bowtie2 v2.4.281, and determining C-to-T degradation at the 5’ end of sequencing reads with PyDamage v0.7082. Contigs with a predicted accuracy of over 0.5 were considered of ancient origins.

Heliusvirales prevalence and relation to urbanization

To determine Heliusvirales prevalence, we firstly gathered geographic and environmental metadata from all genomes used to build the main phylogenetic tree. Second, contigs obtained from Pasolli et al.33 were analyzed for the presence of Heliusvirales TerL genes. This was done as described above under “Identification of Ca. Heliusvirales contigs in metagenomic samples”. The same approach was used to identify Heliusvirales TerL genes among ancient populations. Enrichment of phages from certain environments among selected subfamilies was calculated with the phyper function from base R.

To analyze the relation of Heliusvirales richness to urbanization, richness was calculated by dividing the number of identified TerL genes by the number of reads in billions. Significance testing of richness differences after combining studies was done with linear mixed effects models where the study was entered as a random effect, using the lme function from the nlme R package v3.1-162 and Anova function from the rstatix R package v0.7.2. The same approach was used to calculate significant richness differences among pooled disease-related cohorts.

Assignment of TerL sequences from ancient- and disease-related samples was done using the trees by which the classification had been done. Each sequence was assigned to the same taxonomy as its closest neighbor of which the taxonomy was known.

For the ancient, hunter-gatherers, and urban populations, taxonomic assignment at the family level was used to calculate weighted UniFrac distances, from which a non-metric multidimensional scaling (NMDS) was constructed. Permutated ANOVA significance levels were calculated with the adonis2 function in the vegan R package v2.6-4. To determine significant differences in subfamily prevalence between populations with different lifestyles, Fisher exact tests were calculated using the fisher.test function from base R.

Timed tree construction

To reconstruct an evolutionary timed tree of the Ca. Hathorviridae, we first constructed a regular tree on members of this family. For this, we selected the TerL genes of all Ca. Hathorviridae from both the complete genomes and the ancient samples. This was used to build a phylogenetic tree with IQTree as described above. From this tree, we selected thirteen complete genomes and all three ancient genome fragments that combined reflected the architecture of the Ca. Hathorviridae. All ORFs from these genomes were then clustered with mmseqs2 v14.7e28471, after which the six protein clusters that were universally present were selected. This number was relatively low because the ancient sequences were genome fragments of 15,076, 19,664, and 67,881 bp long. The six universal PCs were a TerL, head maturation protease, major head protein, portal protein, a hypothetical protein, and terminase small subunit. For each protein, an alignment was made with MAFFT v4.753 (options –maxiterate, –localpair)83, from which a timed tree was inferred using BEAST v2.7.384. BEAST was run with the following models: Birth death model, Coalescent constant population, Coalescent exponential population, Coalescent bayesian skyline, Coalescent extended bayesian skyline, each with both strict and optimized relaxed clock models. Analyses were run for 30 million iterations, at which point all models had converged and estimated sample sized were above 200. The coalescent constant population model with a relaxed clock was the best-fitting model. A full accounting of tree construction with the various evolutionary models is in Supplementary Data 6. BEAST was run with OBAMA v1.1.185 for amino-acid model averaging.

Statistics & reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.