A new genomic blueprint of the human gut microbiota

The composition of the human gut microbiota is linked to health and disease, but knowledge of individual microbial species is needed to decipher their biological roles. Despite extensive culturing and sequencing efforts, the complete bacterial repertoire of the human gut microbiota remains undefined. Here we identify 1,952 uncultured candidate bacterial species by reconstructing 92,143 metagenome-assembled genomes from 11,850 human gut microbiomes. These uncultured genomes substantially expand the known species repertoire of the collective human gut microbiota, with a 281% increase in phylogenetic diversity. Although the newly identified species are less prevalent in well-studied populations compared to reference isolate genomes, they improve classification of understudied African and South American samples by more than 200%. These candidate species encode hundreds of newly identified biosynthetic gene clusters and possess a distinctive functional capacity that might explain their elusive nature. Our work expands the known diversity of uncultured gut bacteria, which provides unprecedented resolution for taxonomic and functional characterization of the intestinal microbiota.

The composition of the human gut microbiota is linked to health and disease, but knowledge of individual microbial species is needed to decipher their biological roles. Despite extensive culturing and sequencing efforts, the complete bacterial repertoire of the human gut microbiota remains undefined. Here we identify 1,952 uncultured candidate bacterial species by reconstructing 92,143 metagenome-assembled genomes from 11,850 human gut microbiomes. These uncultured genomes substantially expand the known species repertoire of the collective human gut microbiota, with a 281% increase in phylogenetic diversity. Although the newly identified species are less prevalent in well-studied populations compared to reference isolate genomes, they improve classification of understudied African and South American samples by more than 200%. These candidate species encode hundreds of newly identified biosynthetic gene clusters and possess a distinctive functional capacity that might explain their elusive nature. Our work expands the known diversity of uncultured gut bacteria, which provides unprecedented resolution for taxonomic and functional characterization of the intestinal microbiota.
For the past decade, studies of the human gut microbiota have shown that the interplay between microbes and host is associated with various phenotypes of medical importance 1,2 . Shotgun metagenomic analysis methods can infer both taxonomic and functional information from complex microbial communities, guiding phenotypic studies aimed at understanding their potential roles in human health and disease. However, various strategies used for analysis of metagenomic datasets rely on high-quality reference databases 3 . This highlights the need for extensive and well-characterized collections of reference genomes, such as those from the Human Microbiome Project (HMP) 4,5 and the Human Gastrointestinal Bacteria Genome Collection (HGG) [6][7][8] . Despite a new wave of culturing efforts, there is still a substantial but undetermined degree of unclassified microbial diversity within the gut ecosystem 6,[8][9][10][11] . Whereas these unknown community members may have eluded current culturing strategies for a variety of reasons (for example, owing to lack of nutrients in growth media or their low abundance in the gut), they are likely to perform important biological roles that remain undiscovered. Thus, having access to a comprehensive catalogue of representative genomes and isolates from the intestinal microbiota is essential to gain new mechanistic insights.
Culture-independent and reference-free approaches have proved to be successful strategies for species discovery and characterization [12][13][14][15][16] . The most common approach is to perform de novo assembly of shotgun metagenomic reads into contig sequences and place them into different bins on the basis of sequence coverage and tetranucleotide frequency 15 -a process that enables the recovery of potential genomes, termed metagenome-assembled genomes (MAGs). Several studies have applied these methods to reconstruct large numbers of MAGs 13,[17][18][19] , one of the most prominent being the recovery of thousands of genomes revealing new insights into the tree of life 16 .
Here we generated and classified a set of 92,143 MAGs from 11,850 human gut metagenome assemblies to expand our understanding of gut-associated microbiome diversity. We discovered 1,952 uncultured bacterial species and investigated their association with specific geographical backgrounds, as well as their unique functional capacity. This enabled new insights into which species and functions within this uncharacterized bacterial community might have underappreciated roles in the human gut environment.

Large-scale discovery of uncultured species
To perform a comprehensive characterization of the human gastrointestinal microbiota, we retrieved 13,133 human gut metagenomic datasets from 75 different studies (Supplementary Table 1 and Extended Data Fig. 1). Samples were collected mainly from North America (n = 6,869, 52%) or Europe (n = 4,716, 36%), reflecting a geographical bias in current human gut microbiome studies. The majority of datasets with available metadata were from diseased patients (n = 4,323, 33%) and adults (n = 3,053, 23%).
Following assembly with SPAdes 20,21 , 11,850 of the 13,133 metagenome assemblies produced contigs that could undergo genomic binning by MetaBAT 15 , generating a total of 242,836 bins. The quality of each bin was evaluated with CheckM 22 according to the level of genome completeness and contamination (Extended Data Fig. 2). On the basis of these metrics, 40,029 MAGs with more than 90% completeness and less than 5% contamination were obtained (hereafter referred to as 'near-complete' 16 ). We also generated 65,671 medium-quality 23 MAGs (at least 50% completeness and less than 10% contamination), 52,347 of which had a quality score 16 (QS) above 50 (defined as completeness -(5 × contamination)). The robustness of our MAGs was evaluated with two independent assembly/binning methodologies 24,25 (see Supplementary Discussion and Extended Data Fig. 3), which showed the MAGs to be highly reproducible, independent of the method used for assembly or binning.
As CheckM is unable to evaluate non-prokaryotic genomes, we investigated separately how many of our bins represented known eukaryotes or viral sequences (see Supplementary Discussion and Supplementary  Table 2). However, for the main set of analyses, we focused on the 39,891 near-complete MAGs that CheckM resolved to bacterial lineages

Species characterization and distribution
Having identified 2,068 MGS in the human gut, we sought to determine their taxonomic classification and extend the analysis to more comprehensive reference databases. By complementing the phylogenetic inference method of CheckM with protein searches against the UniProt Knowledgebase (UniProtKB) 29 , we attempted to assign the most likely taxonomic lineage to each MGS. This approach, which utilizes both multiple marker genes and protein-level matches, is similar to those used by various analysis tools [30][31][32] and provides a more reliable method for taxonomic assignment compared to traditional single-marker gene classifications (for example, based on the 16S rRNA gene). Using a species-level threshold 26,33 (at least 60% of the proteins with at least 96% amino acid identity), we found that 94% of the MGS (n = 1,952) did not match any isolate genome within UniProtKB, and therefore represent uncultured candidate species. Of these 1,952 unclassified MGS (UMGS), 74% correspond to entirely 'novel' genomes as of August 2018 (see Supplementary Discussion and Supplementary Table 4). We were able to assign 98% and 94% of the UMGS at the phylum and class levels, respectively, and 91% to a known order (Fig. 2a). Interestingly, 26% of the UMGS were unassigned at the family level, while almost half (40%) could not be classified to a known genus, meaning that a substantial portion of the UMGS may belong to new families and/or genera. The three most frequently assigned families were Coriobacteriaceae (20.6%), Ruminococcaceae (9.9%) and Peptostreptococcaceae (7.4%), whereas the top genera were Collinsella (17.7%), Clostridium (7.3%) and Prevotella (4.4%). These data suggest that despite being known colonizers of the intestinal microbiota, these clades still contain considerable uncultured diversity. The Clostridium genus has been acknowledged as highly polyphyletic, with recent phylogenetic estimates suggesting that this group may span 121 genera belonging to 29 families 34 . Therefore, the detection of many uncultured species assigned to this genus may reflect current taxonomic limitations rather than a biological signal.
In order to determine the prevalence and abundance of the uncultured candidate species within each gut microbiome, we compared the raw reads from the original 13,133 metagenomic datasets to the UMGS collection. Prevalence was estimated by how many samples each genome was found in by taking into account the level of genome coverage, mean read depth and evenness (Extended Data Fig. 7). Half of the UMGS were found in at least 12 metagenomic samples (Extended Data Fig. 7c). The most frequently observed UMGS belong to the family Ruminococcaceae and the Faecalibacterium genus, and include mostly members from the Clostridia class (Fig. 2b).
To place these uncultured species in context with the known bacterial colonizers of the human gut, we then positioned the UMGS within the gut-specific species from the HR database, hereafter referred to as the human gut reference (HGR). A maximum-likelihood phylogeny of the 1,952 UMGS and the 553 HGR genomes was built on the basis of the 40 marker genes extracted with specI 32 (Fig. 3a). Phylogenetic analysis showed that the UMGS genomes expand the known diversity of the human gut bacterial lineages by 281%, on the basis of total branch lengths, with the largest increase within the Firmicutes phylum (Fig. 3b)  Blue and red dots in the second layer denote genomes that were found in at least one sample from all six continents analysed (Africa, Asia, Europe, North America, South America and Oceania), or exclusively detected in non-European, non-North American samples, respectively. Green bars in the outermost layer represent the prevalence of the genome among the 13,133 metagenomic datasets. b, Level of increase in phylogenetic diversity provided by the UMGS, relative to the complete diversity per phylum (left) and represented as absolute total branch lengths (right). The number of HGR and UMGS genomes assigned to each phylum is depicted in brackets (HGR/UMGS).

Article reSeArcH
to correspond to rarer or more difficult-to-culture clades from the human gut, as none had a representative isolate genome in the HGR database.
Subsequently, we correlated the prevalence and abundance of each UMGS and HGR genome with the geographical origin of the sample to infer any associations (Fig. 4). We investigated how many samples from the different continents each species was found at a relative abundance of more than 0.01% (Fig. 4a). In the majority of the sampled populations, the UMGS were less prevalent than the HGR genomes, a possible indication of why they have not been detected in previous genomic studies. However, the UMGS were more frequent, compared to the HGR genomes, among understudied samples from Africa and South America with non-Western lifestyles (Fig. 4a). This was particularly evident for a subset of 75 and 120 UMGS that were present at an abundance of more than 0.01% in more than 20% of the samples from Africa and South America, respectively (Fig. 4b). This was only the case for 6 and 16 HGR genomes, respectively, suggesting that some of our newly identified UMGS better represent the gut diversity present in the small number of samples from these two underrepresented populations.
To further evaluate the improvements provided by the UMGS for classification of the full metagenomic datasets, we assessed the percentage of reads that we were able to assign to HR, RefSeq and our UMGS dataset. With all the available genomes (HR, RefSeq, plus all UMGS), we observed a median classification of 72.8% (IQR = 65-81.1%). This represents an improvement of 23% over the use of a database comprising just HR, and of 17% over a combined set with HR and RefSeq. As the UMGS collection comprises over three times the number of gut species present in the HR database, this modest increase again suggests that the majority of these uncultured organisms are present at a lower abundance in most samples, compared to the gut isolate genomes.
After partitioning the data according to geographical origin, the small number of datasets from Africa (n = 21) and South America (n = 36) saw an improvement in read assignment of 215% and 278%, respectively (Fig. 4c). This confirms that some UMGS are much more abundant in these specific gut communities. In order to deduce how much diversity might remain undetected, we built an accumulation curve based on the number of UMGS retrieved as a function of the number of samples obtained from each continent (Fig. 4d). European and North American populations showed the greatest coverage, trending towards a saturation point. Conversely, in samples outside North America and Europe, new uncultured species are still detected at a consistent rate. These results underscore the importance of sampling underrepresented regions to continue to uncover the global diversity of the human gut microbiota.

A distinctive functional repertoire
With access to 2,505 human gut species (1,952 UMGS and 553 HGR), we performed a comprehensive and in-depth functional characterization of the collective gut bacterial population. Using antiSMASH 35 , we screened for the presence of secondary metabolite biosynthetic gene clusters (BGCs) encoded within both the UMGS and HGR (Supplementary Table 5). We detected over 200 BGCs coding for sactipeptides, nonribosomal peptide synthetases (NRPSs) and bacteriocins (Extended Data Fig. 8a). Notably, 85% and 70% of the total BGCs detected in the UMGS and the HGR, respectively, represented novel clusters (that is, without a positive match in the Minimum Information about a Biosynthetic Gene (MIBiG) cluster database; Extended Data Fig. 8b). This suggests the potential presence of many undiscovered natural compounds produced by the intestinal microbiota with possible antimicrobial and/or biotechnological applications for future study.
We next applied complementary approaches to identify the most distinguishing traits between the UMGS and HGR genomes. First, from the predicted protein-coding sequences, we used InterProScan 36 to generate annotations that were translated to 1,199 Genome Properties 37,38 (GPs) and 115 metagenomics Gene Ontology 39,40 (GO) slim terms-a summarized classification of GO annotations from metagenomic data 41 . Each GP-a functional attribute predicted to be encoded in a genome-was determined to be present, partially present or absent, depending on the number of proteins that were detected to be involved in that property. In parallel, we used GhostKOALA 42 to generate KEGG Orthology (KO) annotations to track the differential abundance of specific functional categories across the UMGS and HGR sets. Globally, by analysing the repertoire of GPs according to the taxonomic composition, we observed a good separation by phylum (ANOSIM R = 0.42, P < 0.001), with the Bacteroidetes and Proteobacteria taxa in particular displaying very distinctive functional profiles (Fig. 5a). We further investigated the separation between the UMGS and HGR genomes within each phylum, which revealed a strong differentiation among Actinobacteria, Firmicutes, Proteobacteria and Tenericutes (ANOSIM R ≥ 0.30, Extended Data Fig. 9a). In particular, we detected 182, 207, 115 and 68 GPs particularly enriched in the UMGS genomes from Actinobacteria, Firmicutes, Proteobacteria and Tenericutes, respectively (χ 2 test, adjusted P < 0.05), with only eight functions enriched within the Bacteroidetes group. Properties involved in iron metabolism and transport were among the 21 functions consistently enriched in the UMGS across these four most distinctive phyla (Extended Data  Table 1).
Subsequently, by assessing the frequency of the GO and KO annotations, we were able to apply a quantitative approach to compare the HGR and UMGS functional repertoires. In general, KEGG pathways involved in carbohydrate metabolism were the most differentially abundant between the UMGS and HGR genomes, indicating distinct metabolic affinities between the cultured and uncultured species (Extended Data Fig. 9b). In the case of GO terms, less abundant genes Article reSeArcH (Wilcoxon rank-sum test, adjusted P < 0.05) within the UMGS were particularly associated with antioxidant and redox functions (Fig. 5b), indicative of lower tolerance to reactive oxygen species. If the UMGS correspond to strict anaerobes more sensitive to ambient oxygen, they are likely to be more difficult to isolate and culture. Conversely, in accordance with the GP results, we also observed an enrichment of genes coding for iron-sulfur and ion binding among the UMGS genomes, in addition to a variety of other functions. In anoxic conditions, the ferrous form of iron (Fe 2+ ) that favours both sulfur and nitrogen ligands is most abundant 43 . An enrichment of iron-sulfur binding genes again suggests the UMGS may be better adapted to specific niches of the gastrointestinal tract with particularly low oxygen tension or high iron concentration, both of which generate high levels of ferrous ions in their environment 43 . Overall, these data show that the uncultured species described here carry specific functions that could explain their elusive nature, while raising awareness of biological traits underrepresented in current reference genome collections derived from pure bacterial cultures.

Discussion
The human gut microbiota is one of the most studied microbial environments, but technical and practical constraints hinder our ability to isolate and sequence every constituent species. Metagenomic methods provide access to the uncultured microbial diversity, and here we have used these approaches to uncover 1,952 uncultured candidate bacterial species. Almost half of these putative species could not be classified at the genus level, suggesting that a substantial degree of bacterial diversity remains uncultured. This resource further expands and complements a recent study investigating the unexplored diversity of body-wide human microbiomes 44 .
As a result of our work, we now have representative genomes of 92,143 MAGs reconstructed from human gut assemblies and are able to classify 73% of the underlying read data. Nevertheless, both culturing and de novo analysis methods are inherently biased towards the most abundant organisms, meaning consistently less abundant species may still be missed. Furthermore, geographical regions such as Africa and South America are severely underrepresented in current studies. Therefore, expanding this analysis to large cohorts worldwide will be imperative for obtaining a complete overview of the human intestinal microbiota landscape. In addition, our work focused mainly on the study of bacterial genomes owing to the availability of more comprehensive reference databases and well-established standards and tools. However, as also shown here, metagenome assemblies generated from the gut microbiota include a wide range of other organisms such as archaea, eukaryotes and viruses that warrant a more thorough investigation.
Having access to comprehensive collections of bacterial genomes provides the ability to perform precise and computationally efficient reference-based genome analysis to achieve a detailed classification of microbial ecosystem composition. Our research is aimed at generating Article reSeArcH high-quality reference genomes, from pure cultures to MAGs, which will serve as a blueprint for metagenomic analysis of the human microbiota. The ability to leverage almost 2,000 additional species in future association and mechanistic studies will bring unprecedented power to investigate the impact of the microbiota in human health and disease.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of data availability and associated accession codes are available at https://doi.org/10.1038/s41586-019-0965-1.

MEthOdS
No statistical methods were used to predetermine sample size. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. Metagenomic datasets. We extracted 13,133 sequencing runs classified as human gut metagenomes in the European Nucleotide Archive (ENA), encompassing 75 different studies (Supplementary Table 1). Metadata (location, age, health status and antibiotic usage) for each individual sampled was retrieved through the ENA API with the mg-toolkit (https://pypi.org/project/mg-toolkit/) and further curated by inspecting the publications linked to each project when available. Samples were classified as having been obtained from healthy individuals only if explicitly stated in their original study. De novo assembly and binning. Raw reads from each run were first assembled with SPAdes v.3.10.0 20 with option --meta 21 . Thereafter, MetaBAT 2 15 (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.16 45  and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.0 49 using the bacterial tRNA model (option -B) and default parameters. Classification into high-and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenomeassembled genome (MIMAG) standards 23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). Given that only 240 of the MAGs with >90% completeness and <5% contamination passed the MIMAG thresholds regarding the presence of rRNA and tRNA genes due to known issues relating to the assembly of rRNA regions 16,50 , we refer to our highest quality MAGs as 'near complete' 16 instead. VirFinder v.1.1 51 was used to predict the presence of viral contigs within the 13,133 human gut assemblies generated with SPAdes. This tool uses a k-mer-based, machine-learning approach to detect distinguishing signatures between virus and host (prokaryotic) sequences. Expected P values for the presence of viral sequences were calculated for each contig with ≥5 kb length and subsequently corrected for multiple testing using the Benjamini-Hochberg method with a FDR threshold of 10%. Assignment of MAGs to reference databases. Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG 8 . From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 2018 13,[16][17][18][19] , including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database 52 . For each database, the function 'mash sketch' from Mash v.2.0 53 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with 'mash dist' to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.23 54 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI. Genome dereplication. To dereplicate the collection of unclassified bacterial MAGs (AQ <60% or ANI <95% against the target references), high-level similarity clusters were first generated with Mash 53 . In brief, a MinHash sketch was created for these genomes to perform an all-against-all comparison. Then, a hierarchical clustering was built from the Mash distance relationships and individual clusters were defined at a cut-off of 0.2. Each cluster was subsequently dereplicated with dRep v.2.2.2 55 to extract the MAGs displaying the best quality and representing individual metagenomic species (MGS). dRep was run with options -pa 0.9 (primary cluster at 90%), -sa 0.95 (secondary cluster at 95%), -cm larger (coverage method: larger), -con 5 (contamination threshold of 5%). For the near-complete MAGs, the -nc parameter was set to 0.60 (coverage threshold of 60%), whereas for the medium-quality MAGs with a QS >50 this was changed to 0.30 (coverage threshold of 30%). The 2,468 HR genomes were also dereplicated into 956 representative species with dRep, using the criteria defined above for the near-complete MAGs. These included 553 species collected specifically from the human gut, referred to as HGR. Phylogenetic and taxonomic analyses. Genes were predicted using prodigal v.2.6.3 56 (default single mode) and 40 universal core marker genes from each genome were extracted using specI v.1.0 32 . Phylogenetic trees were built by concatenating and aligning the marker genes with MUSCLE v.3.8.31. Marker genes absent only from specific genomes were kept in the alignment as missing data. Maximum-likelihood trees were constructed using RAxML v.8.1.15 57 with option -m PROTGAMMAAUTO. All phylogenetic trees were visualized in iTOL 58 . Phylogenetic diversity was quantified by the sum of branch lengths using the phytools R package 59 .
Taxonomic classification of each MGS was performed with both CheckM and UniProtKB 29 . First, the function tree_qa from CheckM was used to infer the approximate phylogenetic placement of the MGS genome within the CheckM internal reference tree (which comprised 2,052 finished and 3,604 draft genomes). Those classified at least at the class rank were then compared with the taxonomic assignment deduced from protein alignments against UniProtKB (release 2018_04) using the blastp function of DIAMOND v.0.9.17.118 60 . A positive hit at the species level was inferred if ≥60% of the proteins had ≥80% of the sequence aligned with an amino acid identity of ≥96%, based on previously reported thresholds 26,33 . Genomes within UniProtKB were presumed to represent cultured species if labelled with a full species name lacking any of the following terms: uncultured, sp. or bacterium. For those MGS without an assigned species (UMGS), a genuslevel boundary was set with the following criteria, as previously defined 61 : at least 50% of the proteins with an e value less than 1 × 10 −5 , a sequence identity of more than 40% and a query coverage above 50%. In case the taxon predicted with UniProt was missing from the CheckM reference database, the full lineage was manually inspected to determine the most likely annotation. Owing to possible mislabelling of the UniProt entries, the CheckM taxonomic lineage was kept if there were incongruences between both classifications. Lastly, the positioning of the UMGS genomes within the HGR phylogenetic tree was used to resolve further inconsistencies or misclassifications. Technical reproducibility and cluster quality. A random subset of 1,000 metagenomes (Supplementary Table 1) was tested with two additional approaches to assess the reproducibility of the MAGs generated here. With one of the methods, metagenomes were assembled with MEGAHIT v.1.1.3 24 and subsequently binned with MetaBAT 2, MetaBAT 1 and MaxBin v.2.2.4 62 . A refinement step was then performed using the bin_refinement module from MetaWRAP v.1.0 25 to combine and improve the results generated by the three binners. The second method involved a modified co-assembly approach, in which individual assemblies from the same study were first merged and dereplicated with CD-HIT v.4.7 63 (cd-hit-est with option -c 0.99 defining a sequence identity threshold of 99%). Metagenomic datasets were then mapped to their merged, non-redundant assembly with BWA-MEM to obtain co-abundance information for binning with MetaBAT 2 (with option --minContig 2000). The resulting MAGs with a QS >50 obtained with each method were compared to the MAGs recovered with our main pipeline (individual assembly with SPAdes, plus binning with MetaBAT 2) for the same 1,000 datasets, using the combined Mash and MUMmer workflow described above.
To further assess the level of potential contamination of the MGS reported, we analysed the quality of the Mash clusters containing each MGS using the Matthews Correlation Coefficient (MCC). First, CompareM v.0.0.23 (https://github.com/ dparks1134/CompareM) was used to analyse the average amino acid identity (AAI) of the specI marker genes within and between Mash clusters. To be able to estimate the MCCs, true positives, false negatives, false positives and true negatives were determined based on three different AAI thresholds: 90%, 95% and 97%. For each pairwise comparison, we considered a true positive when both MAGs belonged to the same cluster and had an AAI equal to or above the threshold; false negatives if they belonged to the same cluster, but the AAI was below the threshold; false positives when the genomes were included in different clusters, but their AAI was equal to or above the threshold; and true negatives corresponded to genomes from different clusters with an AAI below the threshold. Thereafter, MCCs were calculated with the mcc function from the mltools 64 R package. Possible values range from −1 to 1, with 1 indicating perfect agreement between the Mash clustering and the marker genes AAI. Functional characterization. Functional prediction analyses were carried out for the 1,952 UMGS and the dereplicated set of 553 HGR genomes. Predicted genes were first functionally characterized with InterProScan v.5.27-66.0 36 with options -goterms and -pa. The presence of microbial BGCs was inferred with antiSMASH 4 35 , using option --knowclusterblast to determine the number of BGCs that matched the MIBiG repository. GO 39,40 annotations were deduced for each gene based on the InterPro (IPR) entries, and translated to GPs 37,38 using the assign_genome_properties.pl script present in http://github.com/ebi-pf-team/ genome-properties. GhostKOALA 42 was used to generate KO annotations of the protein-coding sequences. Differential abundance analysis of GO slim and KO term frequencies between the UMGS and HGR genomes was performed with the compositional data analysis tool ALDEx2 65 . Because we were evaluating genomes with differing lengths and degrees of completeness, this method was used to take into account discrepancies in total gene counts. The aldex.clr function was used with 128 Monte Carlo instances sampled from a Dirichlet distribution to generate a distribution of probabilities for each GO slim/KO term consistent with the observed data. These were subsequently converted to distributions of log ratios to account for the compositional nature of the data. The aldex.effect function was used to calculate the expected value of the difference between distributions of each group (median log 2 difference), the expected value of the pooled group variance (median log 2 dispersion) and the standardized effect sizes on the abundance difference of each GO/KO classification. The effect-size measure used is similar in concept to Cohen's d but is calculated on the distributions themselves rather than on the summary statistics of those distributions, resulting in metrics that are relatively robust and efficient 66 . Lastly, the aldex.ttest was used to perform non-parametric Wilcoxon rank-sum tests on the GO/KO frequencies between the two test groups (UMGS and HGR). GPs, classified as 'yes' , 'no' and 'partial' were converted to 2, 0 and 1, respectively, and those more prevalent specifically among the UMGS genomes were detected with a two-tailed χ 2 test. The expected P values from all the statistical tests were corrected for multiple testing with the Benjamini-Hochberg method. A PCA was carried out on the GP distributions of the HGR and UMGS genomes, using the FactorMineR 67 package. Separation according to phylum and genome type was assessed with the ANOSIM test based on the Gower distances between the GP profiles. Species prevalence and abundance. Read classification of the 13,133 human gut metagenomic datasets was performed with sourmash v.2.0.0a4 68 against the HR, RefSeq and UMGS genome collections. Signature files were generated for both the reference (FASTA) and query (FASTQ) files, with 'sourmash compute --scaled 1000 -k 31 --track-abundance' . For each set of references, a lowest common ancestor database was created ('sourmash lca index --scaled 1000 -k 31'), with each genome representing a unique species lineage. Raw reads were then compared with 'sourmash lca gather' against each database. Species prevalence and abundance was determined with BWA-MEM, where species presence was inferred by assessing the level of genome coverage, mean read depth and depth evenness. First, we calculated depth and variation penalty scores corresponding to the missing coverage (100% − genome coverage) multiplied by either the log(mean depth) or the depth coefficient of variation (defined as the standard deviation of read depth divided by the mean), respectively. These metrics allowed us to gauge both coverage and depth simultaneously, as genomes that have a high mean depth (or high depth variation) but are not well covered are less likely to be present in the sample than those that have the same level of coverage with lower read depth. Thresholds for determining genome presence were set at a minimum coverage of at least 60%, and both depth and variation penalty scores at a maximum of the 99th percentile (Extended Data Fig. 7). Relative abundance of each species was determined by the proportion of uniquely mapped and correctly paired reads (filtered using 'samtools view -q 1 -f 2') out of the total read count. Accumulation curves based on the number of UMGS detected per geographical region were bootstrapped ten times at each sampling interval. Asymptotic regressions were performed using the SSasymp and nls functions from the R stats package 69 . Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this paper. Code availability. Custom scripts used to generate data and figures are available at https://github.com/Finn-Lab/MGS-gut.

Data availability
The UMGS genomes generated in this work were deposited in ENA, under the study accession ERP108418. The 92,143 MAGs with QS >50, as well as the quantification results from BWA and sourmash, all phylogenetic trees and the functional analysis results with InterProScan, GP and GhostKOALA are available at ftp://ftp. ebi.ac.uk/pub/databases/metagenomics/umgs_analyses/.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The UMGS genomes generated in this work were deposited in ENA, under the study accession ERP108418. The 92,143 MAGs with QS > 50, as well as the quantification results from BWA and sourmash, all phylogenetic trees and the functional analysis results with InterProScan, GP and GhostKOALA are available in the following public FTP: ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/umgs_analyses/.