Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort

Indigenous bacteriophage communities (virome) in the human gut have a huge impact on the structure and function of gut bacterial communities (bacteriome), but virome variation at a population scale is not fully investigated yet. Here, we analyse the gut dsDNA virome in the Japanese 4D cohort of 4198 deeply phenotyped individuals. By assembling metagenomic reads, we discover thousands of high-quality phage genomes including previously uncharacterised phage clades with different bacterial hosts than known major ones. The distribution of host bacteria is a strong determinant for the distribution of phages in the gut, and virome diversity is highly correlated with anti-viral defence mechanisms of the bacteriome, such as CRISPR-Cas and restriction-modification systems. We identify 97 various intrinsic/extrinsic factors that significantly affect the virome structure, including age, sex, lifestyle, and diet, most of which showed consistent associations with both phages and their predicted bacterial hosts. Among the metadata categories, disease and medication have the strongest effects on the virome structure. Overall, these results present a basis to understand the symbiotic communities of bacteria and their viruses in the human gut, which will facilitate the medical and industrial applications of indigenous viruses.


Introduction
Large numbers of diverse bacterial viruses (phages) reside in the human gut, and the indigenous phage community (virome or phageome) greatly affects the structure and function of the bacterial community (bacteriome) by lysing bacterial cells and facilitating horizontal gene transfer [1][2][3][4] . The most dominant members of the gut virome are double-stranded DNA (dsDNA) phages, and their number is estimated to be comparable to that of bacteria 1 .
Recent viral metagenomic studies have identified highly abundant and prevalent clades among human gut dsDNA phages [5][6][7][8][9][10][11] , such as crAss-like phages 6,7 including crAssphage 5 , Lak phage 8 , and Gubaphage 9 /Flandersviridae 10 . The genome sequences of these phages provide insights into their functional potential, unique biology, and ecology in the human gut 7,8 . Other studies targeting the virome structure have revealed high inter-individual diversity 12,13 , differences between populations 14 , and associations with several host factors 15,16 . Furthermore, altered virome structure has been associated with various diseases, such as inflammatory bowel disease, type 2 diabetes, and colorectal cancer [17][18][19] , suggesting a possible role for the altered virome in these diseases. However, these studies were conducted in relatively small-scale cohorts (n < 1,000), or used heterogeneous samples collected from independent studies using different methodologies, constraints limiting the understanding of the virome variation and its associations with various host and environmental factors. Moreover, the DNA amplification step frequently used in the previous studies could preferentially amplify certain types of viruses (e.g. circular single-stranded DNA phages) and systematically bias the viral profile 1 . To date, there have been no population-level studies that have collected and processed faecal samples uniformly without DNA amplification in order to analyse gut virome variations.
In this study, we present a large-scale analysis of viral profiles obtained from whole (bulk) gut metagenomes uniformly collected from 4,198 deeply phenotyped individuals. A newly developed pipeline revealed thousands of high-quality dsDNA phage genomes, including highly abundant but uncharacterised phage clades in the gut. Further comparative analysis of the virome with the bacteriome showed high correlations of their diversities and associations between the defence mechanisms of the bacteriome and virome diversity. Finally, a comprehensive association analysis, with various host and environmental factors, uncovered a large number of intrinsic and extrinsic factors significantly associated with the structure of the virome.

Construction of a phage genome catalogue
We collected faecal samples from 4,198 Table 1) 20 . Intrinsic and extrinsic factors (e.g. age, sex, body mass index [BMI], lifestyle, dietary habits, diseases, and medications) were exhaustively collected from these individuals through self-report questionnaires, face-to-face interviews, and medical records (Supplementary Table 2).
To establish a dsDNA phage genome catalogue using the whole metagenomes, we developed a pipeline in which candidate phage genomes were identified by detecting phage genome-specific signatures, such as the presence of virus hallmark genes and the absence of bacterial essential genes (Methods, Extended Data Fig. 1a, Supplementary Table 3). Since the majority of reads in whole metagenomes derives from non-phage entities, such as bacterial chromosomes and plasmids 21 , we used relatively strict criteria and designed the pipeline to reduce the false positives as much as possible (Methods). Comparison between our pipeline and other virus-detection tools [22][23][24][25][26][27][28] showed that the true positive rate of our pipeline (81.5%) was comparable with other pipelines' (40.2%-99.7%), whereas the false positive rate (0.4%) was considerably lower than others' (2.7%-54.2%) (Extended Data Fig. 1b, c). These results suggest that our pipeline, with the lowest false positive rate, was likely to be appropriate to construct a high-quality phage catalogue with minimum contamination by non-phage sequences.
Applying the pipeline to the 4,198 whole metagenomes, we identified 1,125 complete and 3,584 draft (>70% completeness) phage genomes 29 . Quality assessment by checkV revealed that 2,819 (59.9%), 1,836 (39.0%), and 54 (1.1%) of the genomes were complete/high-quality, medium-quality, and low-quality, respectively (Extended Data Fig.   1d, Supplementary Table 4). To assess to what extent this catalogue covered the dsDNA phages in the human gut, we sequenced virus-like particles (VLPs) in 24 faecal samples (Methods, Supplementary Table 5) and mapped the VLP reads to the catalogue, confirming that more than half of the VLP reads (57.1%), on average, were mapped to the catalogue.
Clustering of the 4,709 phage sequences with >95% sequence similarity generated 1,347 viral operational taxonomic units (vOTUs) 29 (corresponding to the species level) (Fig. was composed of 461 genomes and represented a cluster of crAssphage, the most prevalent and abundant phage in the human gut 5 . Comparative analysis of the vOTUs with known phage genomes in RefSeq and databases recently published 9,15,30,31 revealed that only 0.67%-44.6% of the sequences we found were aligned with known sequences (>95% identity) (Extended Data Fig. 3a-e). 667 of the vOTUs were not aligned to any known genomes in the databases, suggesting that they are novel genomes first identified in this study (Extended Data Fig. 3f).
To clarify the more distant relationship of the vOTUs, we further clustered vOTUs based on the proportion of shared proteins (>20%), providing a total of 223 viral clusters (VCs) corresponding to the family or order level 7,37 (Extended Data Fig. 2b). Rarefaction analysis showed that the number of VCs, but not the number of vOTUs, was saturated at the number of individuals in this study (n = 4,198) (Fig. 1e).

Identification of novel phage clades abundant in the human gut
Owing to the lack of a high-quality genome catalogue of human gut phages until recently, the major clades of phages in the human gut are still largely unknown, other than a few groups, such as crAss-like phages 5,7,8,10,38,39 . To explore the major and abundant phage clades in the human gut, we quantified the abundance of each vOTU/VC by mapping the metagenomic reads of the 4,198 individuals to the catalogue (Methods). On average, 1.8 ± 0.8% (mean ± s.d.) of the whole metagenomic reads were mapped to the catalogue, and 115 ± 40 vOTUs and 47 ± 12 VCs were detected from an individual. The accuracy of viral quantification based on whole metagenomic reads was confirmed by the comparison of viral profiles obtained from the VLPs with those from whole metagenomes, which showed a high correlation between the two profiles (average Spearman correlation [rs] = 0.74 and 0.76 at the vOTU and VC levels, respectively, Extended Data Fig. 4a). Additional cluster analysis showed clusters pairing each VLP with the whole metagenomes for the 24 samples (Extended Data Fig. 4b).
The most abundant VC in this cohort was a crAssphage-containing VC (VC_19), which included crAssphage and crAss-like phages (candidate genera I, III, IV, and IX). Interestingly, we also identified several VCs whose abundance was at the same order of magnitude as VC_19 (Fig. 2a). Particularly, 9 VCs other than VC_19 were highly abundant (reads per kilobase million [RPKM] >1) and prevalent (>100 genomes) across the 4,198 whole gut metagenomes in this cohort. Of these abundant VCs, VC_6 was a group of crAss-like families (candidate genera VII, VIII, and X), and the phages in VC_1 showed similarities to the recently proposed new clade of Gubaphage/Flandersviridae 9,10 . The other seven abundant VCs (VC_2, 24, 12, 15, 3, 44, and 18) did not show any similarity to known phages in RefSeq, suggesting that they were novel phage clades abundant and prevalent in the human gut. They were detected in 24.6%-90.2% of the individuals and were predicted to infect major gut species such as   Table 7).
Phylogenetic analysis based on large terminases, portal proteins, and major capsid proteins revealed that most of the 7 VCs were monophyletic and significantly different from those of the known reference phages in RefSeq (Fig. 2c, Extended Data Fig. 6). In addition, a comparison of genomic similarities among the phages (Methods) revealed that most of the phages in the same VC were clustered together and distinct from the phages in the other VCs ( Fig. 2b).
To investigate whether the VCs newly identified in this study exist as virus particles in the human gut, we explored them in the VLP dataset. We found that all of the 10 abundant VCs were detected in the 24 VLP dataset, and the relative abundance of these VCs was, on average, 4.5-1,115 times more abundant in the VLP metagenomes than in whole metagenomes prepared from the same faecal samples (Extended Data Fig. 7). We named the seven novel VCs according to the names of the cities in which the research institutes of the authors are located (VC_2, "Toyamaviridae"; VC_24, "Konodaiviridae"; VC_12, "Shinjukuviridae"; VC_15, "Okuboviridae"; VC_3, "Tsurumiviridae"; VC_44, "Suehiroviridae"; and VC_18, "Umezonoviridae").

Close interactions between the gut virome and bacteriome
It has been suggested that phages in the human gut largely affect the structure of the bacteriome through bacterial lysis and integration as prophages 1-3 , but the relationships between phages and bacteria in the gut environment are not well characterised. To explore this, we compared viral and bacterial profiles of the 4,198 individuals and found significant positive correlations for α-diversity (Shannon diversity) and β-diversity (Bray-Curtis distance) between them (rs = 0.73 and 0.46, respectively; Fig. 3a and b). We also found that the β-diversity of the virome was significantly higher than that of the bacteriome (Fig. 3b, Extended Data Fig. 7).
To further explore associations, we next examined one-to-one correlations between the relative abundance of each phage and its predicted host at the genus level (Methods). We found a positive correlation with an average rs of 0.18 between them (Fig. 3d), suggesting that phages and host bacterial species co-occur rather than being mutually exclusive in the human gut.
Among the genera examined, Megamonas, Escherichia, Prevotella, and Lactobacillus showed relatively higher correlation with the phages than other genera (Fig. 3c). These high correlations could be explained by the higher proportion of specialist phages that infect these genera as compared to other genera, such as Clostridium, Ruminococcus, and Tyzzerella, which are often infected by generalist phages (Fig. 3c). Indeed, the specialist phages showed a significantly higher correlation with their hosts (rs = 0.27 on average) than the generalist phages (rs = 0.08) (P < 2.2e-16, Fig. 3d). The higher correlation for specialist phages than generalist phages was reproduced with other correlation indexes such as Pearson correlation and Maximal information coefficient, but was not present using Bray-Curtis dissimilarity (Extended Data   Fig. 9a). Phage lifestyle did not affect the phage-host correlation so much, but lytic phages showed slightly but significantly stronger correlation (rs = 0.17) than temperate phages (rs = 0.15) (P = 0.025, Extended Data Fig. 9b).
Among the diverse genes encoded in the human gut bacteriome, the CRISPR-Cas, restriction-modification (RM), and abortive infection (Abi) systems are well-known defence mechanisms of prokaryotic species protecting against mobile genetic elements, including phage infection 40 . To explore associations between such antiviral genes and phages in the community, we quantified these genes and assessed their associations with the virome structure (Methods, Supplementary Table 8). We found that Shannon diversity of the virome was significantly higher in samples with abundant defence mechanisms than in those with low abundances of all three systems (Fig. 3e). Additionally, significant positive associations were also identified for various subtypes in the CRISPR-Cas and RM systems (Fig. 3f). Other genes such as integrase and spore germination proteins were also positively correlated with virome diversity (Supplementary Table 9).

Comprehensive identification of host and environmental factors associated with the virome
To investigate how the gut virome structure is associated with host factors, we conducted an association analysis with age and sex, which are strong determinants of the gut bacteriome structure 41 . Age showed a significant positive correlation with virome diversity (rs = 0.20, Pvalue < 2.2e-16, Fig. 4a), consistent with a positive correlation between age and bacteriome diversity (Extended Data Fig. 8a). At the level of the host bacteria, age had significant positive correlations with Proteobacteria phages and host unknown phages, but a negative correlation with Actinobacteria phages (rs = 0.10, 0.12 and -0.07, respectively, P-values < 0.05, Fig. 4b Fig. 8c, Supplementary Table 10). Sex showed significant associations with 68 vOTUs and 24 VCs (Fig. 4d, Supplementary Tables 10 and 11). Males had significantly higher abundances of phages predicted to infect Prevotella and Megamonas, while females had higher abundances of Faecalibacterium-and Ruminococcus-related phages, possibly reflecting differences in the bacteriome between males and females (Extended Data Fig. 8e).
To further explore the comprehensive relationships, we next performed association analysis between the viral profiles and 232 host/environmental factors exhaustively collected from the 4,198 individuals (Supplementary Table 2). Redundancy analysis showed that these factors explained 0.6% of the total variance in the virome at the vOTU level (Fig. 5a), which was substantially lower than the value that the same factors explained in the gut bacteriome at the species and genus levels (4.9% and 10.0%, respectively; Fig. 5a). The metadata categories associated most strongly with gut virome variation were clinical factors, such as medication and disease (explained variance = 0.5% and 0.3%, respectively). Permutational analysis of variance revealed that 97 of the 232 factors were significantly associated with virome variation (false discovery rate [FDR] <0.05), among which age showed the strongest association with the virome (Fig. 5b, d). The factors significantly associated with the virome included various diseases (e.g. HIV infection, inflammatory bowel disease, past history of gastrointestinal resection), medications (e.g. osmotically acting laxatives, antiviral drugs, and alphaglucosidase inhibitors), and diets (e.g. fruits, dairy products, and milk), which included numerous factors not identified previously 14,42 . The variation explained by each factor was highly correlated between the virome and bacteriome (rs = 0.87, P-value < 2.2e-16, Fig. 5c).
The vOTU of crAssphage (vOTU_974) showed no significant association with any metadata in this cohort, suggesting the presence of still unknown host, environmental, or ecological factors that explain the variation of this most abundant phage in the human gut.
Among the 97 factors with statistical significance, the majority showed strong associations with both the virome and bacteriome (Fig. 5d). By contrast, proton pump inhibitors, which showed the strongest effect on the gut bacteriome, had a relatively moderate effect on the virome (Fig. 5d). A large number of bacterial species in the oral cavity reach the gut following the administration of proton pump inhibitors, but these species are less transcriptionally active 43 and might have fewer interactions with gut phages 44 . Furthermore, antimicrobial drugs, such as cephalosporins, macrolides, and sulphonamides, which had large effects on the bacteriome, also had a moderate effect on the virome (Extended Data Fig. 8f), which might be explained by the absence of phage-originated molecular targets for antimicrobial drugs. Thus, ecological and biological differences between the virome and

Discussion
In the present study, we have performed a large-scale analysis for viral profiles of deeply phenotyped individuals (n = 4,198) and shown extensive virome variation, and its association with the bacteriome and numerous host and environmental factors. This study is, to the best of our knowledge, the largest single cohort analysis for the human gut virome. The analysis uncovered prevalent and abundant viral clades, interactions between the virome and bacterial anti-viral genes, and clinical factors strongly associated with the virome, expanding our knowledge of the human gut virome.
We uncovered prevalent but previously uncharacterized dsDNA phage clades in the human gut (Fig. 2). Although recent large-scale studies have been uncovering numerous phage genomes in the human gut 9,10,15,31 , our result suggests that the human gut phages are still under explored and further efforts are needed to construct a more comprehensive phage catalogue to capture the whole genomic diversity of the gut phages. Some of the novel clades identified in this study could infect Firmicutes and Actinobacteria (VC_2, 15, 3, 44, and 18), which is in contrast to recently proposed large clades, such as crAssphage, crAss-like phage, and Gubaphage/Flandersviridae, all of which could infect Bacteroidetes 6,7,9,10 . Firmicutes is one of the major taxa in the human gut and includes clinically important species 45 . Actinobacteria, such as Bifidobacterium, include species used for probiotics and are quite abundant in the guts of infants 46 . The novel phage clades identified in this study may play important roles in the bacteriome by interacting with these major bacterial species and regulating their population dynamics.
Comparative analysis showed a significant positive correlation between the diversity of virome and bacteriome (Fig. 3a), supporting previous findings 16,47 . Interestingly, at the level of individual phages and hosts, relative abundances were also positively correlated and they co-occurred (Fig. 3c). Since phages kill their hosts, a negative correlation between phages and their hosts would be expected. Actually, previous studies observed negative correlations between them in a time-series dataset using a mouse model 48,49 . At the same time, however, phages have a limited host range and cannot live without their hosts 50,51 (Supplementary Table 5), suggesting that their symbiotic relationship results in a positive correlation at a more global scale (e.g. among different individuals and environments). Our findings of the positive correlation suggest that, at the population level, the distribution of host bacteria is a dominant factor governing the distribution of phages in the human gut. Furthermore, we found that prokaryotic defence systems, such as the CRISPR-Cas, RM, and Abi systems, had significant associations with virome diversity (Fig. 3d). These associations suggest that anti-phage systems are important factors in shaping the virome through regulation of phage infection.
Alternatively, given that numerous gut microbes are infected and lysed by phages 49 , the ability to protect against phage infection may also have a role in determining the structure and diversity of the bacteriome. However, our results based on the cross-sectional design cannot rule out the possibility that the defence systems were simply correlated with other microbiome properties (e.g. other gene functions, microbial density, and spatial structure) that actually cause the increase in virome diversity.
The exploration for factors that shape the gut virome has been conducted only on smaller-scale studies [14][15][16] . Our association analysis using exhaustively collected metadata revealed numerous intrinsic/extrinsic factors significantly associated with the gut virome, which included previously unidentified factors, especially in disease and medication (Fig. 5).
Strong associations with the virome were observed for host age, disease, and medication, in accordance with associations between these factors and the bacteriome (Fig. 4b). Additionally, the predicted hosts of the associated phages were mostly consistent with the microbial species associated with the factor (Fig. 4 and 5, Extended Data Fig. 10), again supporting strong interactions between phages and their bacterial hosts. However, the variation in the virome explained by these metadata was only 0.6%, significantly lower than the variations in the bacteriome (5%-10%) that was explained by factors in this cohort (Fig. 5a) as well as other cohorts 52,53 . This was possibly owing to substantially higher inter-individual variation in the virome compared to that of the bacteriome (Fig. 3b), which might be driven by various viral and bacterial ecological factors, such as the rapid evolutionary rate of phages 54 , strain-level diversity of bacterial species among individuals 55 , or the associated variability of the defence systems 56 . More detailed analysis and understanding of inter-individual diversity of the virome is needed, since its variation has potential impacts on the success or failure of faecal microbiota transplantation and phage therapy 57 .
In summary, our population-level analysis of the human gut virome uncovered its All authors have read and approved the submitted version of the manuscript.

Competing interests
The authors declare no competing interests. Correspondence and requests for material should be addressed to Naoyoshi Nagata (nnagata_ncgm@yahoo.co.jp) or Suguru Nishijima (nishijima.suguru@gmail.com).

Sample collection and metagenomic sequencing
The Japanese 4D (Disease, Drug, Diet, Daily life) microbiome project supplying the metagenomic data used in this study is a prospective multi-centre hospital-based registry in the Tokyo metropolitan area. Participant entry began in January 2015 and is currently ongoing.
We included Japanese individuals with colonoscopic diagnosis: either having undergone colonoscopy within the last 3 years or planning to undergo colonoscopy for cancer screening or for other reasons. Exclusion criteria were as follows: 1) suspected acute infectious disease; 2) acute gastrointestinal bleeding; 3) hearing loss; 4) inability to write or understand written documents; or 5) limited ability to perform activities of daily living. We also avoided collecting samples within 1 month of administering bowel preparation for colonoscopy because it has a profound effect on the gut microbiome and metabolome 58 . The protocol for the Japanese 4D Participants collected faecal samples using a Cary-Blair medium-containing tube 59 at home, and the samples were refrigerated for up to 2 days before the hospital visit. Immediately after participants arrived at the hospital, their faecal samples were frozen at −80 °C until DNA extraction. Health professionals checked that the amount of stool was sufficient for analysis.
In total, 4,198 faecal samples were collected and sequenced as described in detail previously 20 Table 1). To explore the viral profiles of VLPs and whole metagenomes from the same samples, we collected additional faecal samples from 24 individuals in the same manner as described earlier.

Metadata collection
Details for metadata collection were described previously 20  to ensure that there were no omissions or discrepancies with the self-reported data. Electronic medical records were also checked to identify medications used. Drug use was defined as oral or self-injected administration within the previous month. All medications with pharmaceutical brand names were grouped according to the WHO's Anatomical Therapeutic Chemical classification system (4th level) 65 . In total, 232 metadata were assessed and used in this study. Whole metagenomic DNA was also prepared from the same faecal samples (10 to 250 mg faeces) with an enzymatic lysis method as described previously 66 . Libraries were constructed from 100 ng whole metagenomic DNA and sequenced by NovaSeq using the same method as for VLP DNA.

Identification of phages in the metagenomic data
To construct a high-quality double-stranded DNA (dsDNA) phage catalogue with minimum contamination of bacterial chromosome and plasmid sequences, we developed a custom pipeline and applied it to the 4,198 whole gut metagenomes as described below. The metagenomic reads of each individual were assembled into contigs using the MEGAHIT assembler (v1.2.9) 67 . The circularity of the assembled contigs (>10 kb) was assessed using the check_circularity.pl script, included in the sprai assembler package (https://spraidoc.readthedocs.io/en/latest/index.html), by modifying the threshold for terminal redundancy as follows: >97% identity and >130 bp. Encoded genes in the contigs were predicted by MetaGeneMark (3.38) 68 . Assembled contigs were defined as phages if they passed all of the following six criteria. 1. A genome size threshold was applied, and contigs less than 10 Kb were excluded, as typical dsDNA phages have genomes larger than >10 Kb 69 .
2. Viral-specific k-mer patterns were checked by DeepVirFinder 22  First, we screened complete phage genomes from the circular contigs using these six criteria (Extended Data Fig. 1a). To identify phage genomes that were not complete, but were of high or medium quality, we next screened possible phage contigs in the linear contigs. We aligned genes identified in the linear contigs to gene sets obtained from the complete phage genomes identified in this study (n = 1,125) and the IMG/VR database (n = 13,628). The alignment was performed using DIAMOND with the more-sensitive option and e-value < 1E-5 as a threshold.
Contigs were defined as possible phage contigs if >40% of the genes were aligned to genes from a complete phage genome and the size of the contig was >70% and <120% of the complete genome. For these possible phage contigs, the above five criteria were applied, and those that did not pass were excluded. Finally, checkV was used to screen for excess host bacterial genomes and exclude linear contigs defined as low quality or having >10% contamination.
To evaluate the performance of this custom pipeline, we applied the pipeline to Taxonomy annotation of phages was performed with a voting approach described previously 16 with minor modifications. First, the protein sequences of each phage were aligned to viral proteins detected from phage genomes in RefSeq (n = 2,609, in April 2020) using DIAMOND with the more-sensitive option. Then, the best-hit taxonomy of each protein (family levels) was counted, and the most common taxonomy was assigned to the phage if >20% of proteins in the phage were aligned to the same taxonomy.
Phage lifestyles (i.e. lytic or temperate) were predicted by BACPHLIP 79 and alignments to reference bacterial genomes in the RefSeq. Phages were defined as temperate if the BACPHLIP score was >0.8 or the phage genome was aligned to any reference genomes with >1,000 bp alignment length with >95% identity.

Host prediction
Bacterial and archaeal genomes were downloaded from the RefSeq database (in April 2019).
To reduce the redundancy of genomes from closely related strains in the same species (e.g. Escherichia coli), 10 genomes were selected randomly for species with more than 10 genomes, and other genomes were excluded from the dataset. The reference dataset consisted of 33,215 bacterial and 822 archaeal genomes.
Host prediction of the identified phages was performed using CRISPR spacers 80 .
CRISPR spacers were predicted from the reference microbial genomes and assembled contigs (>10,000 bp) from the 4,198 metagenomic datasets using PILER-CR (1.06) 81 . Short (<25 bp) or long (>100 bp) spacers were discarded. In total, 679,323 and 283,619 spacers were identified from the reference microbial genomes and assembled contigs, respectively. Taxonomy information was assigned to the assembled contigs if they were aligned to the microbial reference genomes with >90% identity and >70% length coverage thresholds using MiniMap2 82 . The CRISPR spacers were mapped to the phage genomes using BLASTN with the option for short sequences: -a20 -m9 -e1 -G10 -E2 -q1 -W7 -F F 80

Quantification of viral abundance and analysis of the virome profile
To quantify the viral abundances in each sample, metagenomic reads were mapped to VHGs (Supplementary Table 3

Phylogenetic analysis of novel VCs
To construct phylogenetic trees for the vOTUs and reference genomes, protein sequences of large terminases, portal proteins, and major capsid proteins (Supplementary Table 3 KOs involved in prokaryotic defence mechanisms, such as CRISPR-Cas and RM (Supplementary Table 8) 56 , were collected, and their total relative abundance in each system was calculated. Since functional annotation for the Abi system is not included in the KEGG database, we collected genes with the eggNOG annotations 'abortive phage infection' and 'abortive phage resistance' and calculated total abundance. The 4,198 individuals were classified into three groups (high, middle, and low) based on tertiles of total abundance, and Shannon diversity of the virome was compared among the three groups by the Wilcoxon ranksum test.

Phage-host correlation analysis
To explore the phage-host association in the community, Spearman correlations between relative abundances of vOTUs and microbial species at the genus level were evaluated. If the vOTU was predicted to infect more than one genus (i.e. generalist phage), the correlation was calculated for every predicted host. If a phage-host pair was absent (0 abundance) in a sample, the sample was excluded from the correlation analysis. vOTUs with average relative abundance > 0.01% (n = 865) and genera with average relative abundance > 0.5% (n = 32) were included in the analysis.

Analysis of VLPs and whole metagenomes from 24 faecal samples
Quality filtering of sequenced reads from the 24 VLPs and whole metagenomes was performed using fastp (version 0.20.1) 92 with the default parameters. Contamination with human (hg38) or phiX genomes was excluded by mapping the reads to the genomes using Bowtie2.
To exclude bacterial DNA contamination in the VLP dataset, we performed further filtering. First, the VLP reads were assembled into contigs using MEGAHIT. Contigs were defined as viral contigs if they were predicted as viruses by DeepVirFinder (p-value <0.05) and did not encode rRNA and marker genes checked by barrnap and fetchMG, respectively.
Then, VLP reads were mapped to the viral contigs using Bowtie2, and those not mapped to the contigs were excluded from the VLP dataset. Viral profiles at the vOTU and VC levels for the de-contaminated VLP and whole metagenomic datasets were obtained with the same methodology for the 4,198-subject metagenomic dataset described earlier.

Association analysis between the virome and various host/environmental factors
The association between each vOTU/VC and age/sex was assessed by multivariate regression analysis considering the effects of other covariates 20 . Briefly, the relative abundance of each vOTU/VC was log10-transformed, and single linear-regression analysis was performed using the transformed abundance as a response variable and metadata as an explanatory variable.
This single linear-regression analysis was performed for age, sex, and other metadata (n = 230, Table 2), and metadata significantly associated with the vOTU/VC were determined (FDR <0.05) by taking into account the total number of single regression analyses (number of vOTU/VCs multiplied by number of metadata). For vOTU/VCs that showed significant association with age/sex, multiple regression analysis was performed using the transformed abundance of vOTUs/VCs as a response variable and all metadata that showed significant associations in the single regression analysis as explanatory variables. To exclude confounding factors, stepwise variable selection was performed based on Akaike's information criterion with the step function. Age/sex was defined as significantly associated with the vOTU/VCs if they remained in the model with a p-value <0.05. All regression models were constructed using the glm2 function in the glm2 package. In total, 390 vOTUs and 112 VCs, whose average relative abundances in the 4,198 metagenomic dataset were >0.05% and >0.1%, respectively, were included in the analysis. For visualization, individuals younger than 20 years (n = 2) and older than 80 years (n = 6) were excluded due to the low numbers of such individuals ( Fig. 5a and b).

Supplementary
The total variance of the virome and bacteriome (relative abundance data) explained by each metadata category was assessed with stepwise redundancy analysis using the ordriR2step function in the vegan package 93 . To investigate the associations between the virome/bacteriome and each single metadata item, permutational analysis of variance was performed using the adonis function in the vegan package based on the Bray-Curtis distance with 10,000 permutations. P-values were corrected for multiple comparisons by the Benjamini-Hochberg method 94 .

Data availability
Sequence statistics of the 4,198 individuals and cohort-level summaries of the metadata are available in Supplementary Table 1 4,198). Shadows of the lines represent 95% confidence intervals. f, Number of vOTUs, genome size, taxonomy, the ratio of specialist and generalist phages, and the ratio of lytic and lysogenic phages for each genus. If a vOTU was predicted to infect more than one genus, it was distributed to all the predicted genera. Thirty-three genera predicted to be infected with at least 5 vOTUs are shown in the figure.

Abundance (RPKM)
Extended Data Fig. 4 | Consistencies of the viral profiles obtained from VLPs and whole metagenomes a, Boxplot showing Spearman correlations of viral profiles between VLPs and whole metagenomes from the same fecal sample (n = 24). b, Heatmap representing viral profiles for the samples at the VC level. Red and blue colors depict high and low abundance of the VC (log 10 -transformed), respectively. Hierarchical clustering was created using the ward.D2 method based on correlations as distance.

Extended Data Fig. 5 | Genome structures of the seven novel major VCs
Genome structures of the seven newly identified VCs. For each VC, the representative genome of the vOTU with the highest number of contigs was selected. From inner to outer, the circles represent GC skew (green for positive and orange for negative), GC content, hit to CRISPR spacer, strand of encoded genes (blue and yellow for positive and negative strand, respectively), and functional category of the genes annotated by the eggNOG mapper. Annotation of the VHGs is shown on the genes.

Extended Data Fig. 6 | Phylogenetic relationship of the abundant VCs
Phylogenetic tree of the 10 VCs as well as reference phages in the RefSeq database based on portal (a) and major capsid proteins (b). Circles on the edges show vOTUs belonging to the VC, and edges without circles represent reference genomes in the RefSeq. Numbers in parentheses indicate candidate genera of crAss-like phages. Portal and major capsid proteins were not detected in VC_44, so the VC is not included among the trees.