A unified sequence catalogue of over 280,000 genomes obtained from the human gut microbiome

Comprehensive reference data is essential for accurate taxonomic and functional characterization of the human gut microbiome. Here we present the Unified Human Gastrointestinal Genome (UHGG) collection, a resource combining 286,997 genomes representing 4,644 prokaryotic species from the human gut. These genomes contain over 625 million protein sequences used to generate the Unified Human Gastrointestinal Protein (UHGP) catalogue, a collection that more than doubles the number of gut protein clusters over the Integrated Gene Catalogue. We find that a large portion of the human gut microbiome remains to be fully explored, with over 70% of the UHGG species lacking cultured representatives, and 40% of the UHGP missing meaningful functional annotations. Intra-species genomic variation analyses revealed a large reservoir of accessory genes and single-nucleotide variants, many of which were specific to individual human populations. These freely available genomic resources should greatly facilitate investigations into the human gut microbiome.

7 collection (486/698; 70%), with only 70 (10%) being common to all three collections 140 (Supplementary Fig. 3d). This may be due to varied geographical sampling between the 141 collections (Asia, Europe and North America) or highlight the stochastic nature of culture-142 based studies. 143 144 By calculating the number of genomes contained within each cultured and uncultured species, 145 we found that species containing isolate genomes represented the largest clusters, while those 146 exclusively encompassing MAGs tended to be the rarest, as discussed previously 16,17,19 .

Most gut microbial species still lack isolate genomes 159
We found that 3,750 (81%) of the species in the UHGG did not have a representative in any of 160 the human gut culture databases. To extend the search to isolate genomes from other 161 environments or lacking information on their isolation source, we compared the UHGG 162 catalogue to all NCBI RefSeq isolate genomes. We identified an additional set of 438 species 163 8 closely matching cultured genomes, leaving 3,312 (71%) of UHGG species as uncultured 164 (Supplementary Table 2). 165 166 The phylogenetic distribution of the 4,616 bacterial (Fig. 3a) and the 28 archaeal species 167 ( Supplementary Fig. 6) revealed that uncultured species exclusively represented 66% and 31% 168 of the phylogenetic diversity of Bacteria and Archaea, respectively, with several phyla lacking 169 cultured representatives (Fig. 3b). The four largest monophyletic groups lacking cultured 170 genomes were the 4C28d-15 order (167 species, recently proposed as the novel order 171 Comantemales ord. nov. 29 ; Fig. 3c), order RF39 (139 species), family CAG-272 (88 species), 172 and order Gastranaerophilales (67 species). While none have been successfully cultured, 173 several have been described in the literature, including RF39 16 and Gastranaerophilales 174 (previously classified as a lineage in the Melainabacteria 30 ) which are characterized by highly 175 reduced genomes with numerous auxotrophies. This analysis suggests that, despite recent 176 culture-based studies 11,12,18,20 , much of the diversity in the gut microbiome remains uncultured, 177 including several large and prevalent clades. 178

179
The UHGP expands the protein universe in the human gut microbiome 180 Metagenomic approaches have the ability to leverage gene content information not only for 181 more precise taxonomic analysis, but to also predict the functional capacity of individual 182 species of interest compared to marker gene-based methods (e.g. relying solely on the 16S 183 rRNA gene or a limited number of diagnostic genes). We built the Unified Human 184 Gastrointestinal Protein (UHGP) catalogue with a total of 625,251,941 full-length protein 185 sequences predicted from the 286,997 genomes here analysed. These were clustered at 50% 186 (UHGP-50), 90% (UHGP-90), 95% (UHGP-90) and 100% (UHGP-100) amino acid identity, 187 generating between 5 to 171 million protein clusters ( Fig. 1c and Fig. 4a).

9
To determine how comprehensive the UHGP was when compared to existing human gut gene 189 catalogues, we combined the UHGP-90 (n = 13,910,025 protein clusters) together with the 190 Integrated Gene Catalogue 5 , a collection of 9.9 million genes from 1,267 gut metagenome 191 assemblies, which we grouped into 7,063,981 protein clusters at 90% protein identity (referred 192 to as IGC-90). Nearly all samples used to generate the IGC were also included in the UHGP 193 catalogue (except for 59 transcriptome datasets), but the latter was generated from a larger and 194 more geographically diverse metagenomic dataset (including samples from Africa, South 195 America and Oceania). The UHGP-90 and IGC-90 resulted in a combined set of 15.2 million 196 protein clusters, with an overlap of 5.8 million sequences (Fig. 4b). This revealed that 81% of 197 the IGC is represented in the UHGP catalogue, with the missing 19% likely representing 198 fragments of prokaryotic genomes <50% complete, viral or eukaryotic sequences, plasmids or 199 other sequences not binned into MAGs. Most notably though, the UHGP provided an increase 200 of 115% coverage of the gut microbiome protein space over the IGC. As the UHGP was 201 generated from individual genomes and not from their original unbinned metagenome 202 assemblies, our catalogue also has the advantage of providing a direct link between each gene 203 cluster and its genome of origin. This ultimately allows combining individual genes with their 204 genomic context for an integrated study of the gut microbiome. 205 206 Functional capacity of the human gut microbiota 207 We used the eggNOG 31 , InterPro 32 , COG 33 and KEGG 34 annotation schemes to capture the full 208 breadth of functions within the UHGP. However, we found that 42.6% of the UHGP-100 was 209 poorly characterized, as 28.1% lacked a match to any database and a further 14.5% only had a 210 match to a COG with no known function (Fig. 4c). Based on the distribution of COG functions, 211 the most highly represented categories were related to amino acid transport and metabolism, 212 cell wall/membrane/envelope biogenesis and transcription. 213 We further leveraged the set of 625 million proteins derived from the human gut genomes to 214 explore the functional diversity within each of the UHGG species. Protein sequences from all 215 conspecific genomes were clustered at a 90% amino acid identity to generate a pan-genome 216 for each species. Analysis of the functional capacity of the UHGG species pan-genomes 217 identified a total of 363 KEGG modules encoded by at least one species (Supplementary Fig. 218 7a and Supplementary Table 4). Most conserved modules were related to ribosomal structure, 219 glycolysis, inosine monophosphate biosynthesis, gluconeogenesis, and the shikimate pathway 220 -all representing essential bacterial functions. However, we found that for certain phyla such 221 as Myxococcota, Bdellovibrionota, Thermoplasmatota, Patescibacteria and 222 Verrucomicrobiota, a substantial proportion of the species pan-genomes remained poorly 223 characterized ( Supplementary Fig. 7b). At the same time, species belonging to the clades 224

Patterns of intra-species genomic diversity 231
With the protein annotations and pan-genomes inferred for each of the UHGG species, we 232 explored their intra-species core and accessory gene repertoire. Only near-complete genomes 233 (³90% completeness) and species with at least 10 independent conspecific genomes were 234 analysed. The overall pattern of gene frequency within each of the 781 species here considered 235 showed a distinctive bimodal distribution ( Supplementary Fig. 8), with most genes classified 236 as either core or rare (i.e. present in ³90% or <10% of conspecific genomes). We analysed the 237 pan-genome size per species in relation to the number of conspecific genomes to look for 238 11 differences in intra-species gene richness. We observed distinct patterns across different gut 239 phyla, with species from various Firmicutes clades showing the highest rates of gene gain (Fig.  240   5a). There was a wide variation in the proportion of core genes between species even among 241 clades with more than 1,000 genomes (Fig. 5b), with a median core genome proportion 242 (percentage of core genes out of all genes in the representative genome) estimated at 66% (IQR 243 = 59.6-73.9%). 244

245
To distinguish the functions encoded in the core and accessory genes, we analysed their 246 associated annotations. Core genes were well covered, with a median of 96%, 94%, 92% and 247 69% of the genes assigned with an eggNOG, InterPro, COG and KEGG annotation, 248 respectively ( Fig. 5c). However, the accessory genes had a significantly higher proportion of 249 unknown functions (P <0.001), with a median of 21% of the genes (IQR = 16.7-27.3%) lacking 250 a match in any of the databases considered. Thereafter, we investigated the functions encoded 251 by the core and accessory genes on the basis of the COG functional categories. Genes classified 252 as core were significantly associated (adjusted P <0.001) with key metabolic functions 253 involved in nucleotide, amino acid and lipid metabolism, as well as other housekeeping 254 functions (e.g. related to translation and ribosomal structure, Fig. 5d). In contrast, accessory 255 genes had a much greater proportion of COGs without a known function, and of genes involved 256 in replication and recombination which are typically found in mobile genetic elements (MGEs,257 Fig. 5d). A significant number of accessory genes were related to defence mechanisms, which 258 encompass not only general mechanisms of antimicrobial resistance (AMR) such as ABC 259 transporter efflux pumps, but also targeted systems towards invading MGEs (e.g. CRISPR-Cas 260 and restriction modification systems against bacteriophages). These results highlight the 261 potential of this resource to better understand the dynamics of chromosomally encoded AMR 262 within the gut and decipher to what extent the microbiome may be a source of both known and 263 novel resistance mechanisms. 264

265
We next investigated intra-species single nucleotide variants (SNVs) within the UHGG 266 species. We generated a catalogue consisting of 249,435,699 SNVs from 2,489 species with 267 three or more conspecific genomes (Fig. 6a). For context, a previously published catalogue 268 contained 10.3 million single nucleotide polymorphisms from 101 gut microbiome species 35 . 269 Of note, more than 85% of these SNVs were exclusively detected in MAGs, whereas only 2.2% 270 were exclusive to isolate genomes (Fig. 6b). We found the overall pairwise SNV density 271 between MAGs to be higher than that observed between isolate genomes (Fig. 6c). Next, we 272 assigned the detected SNVs to the continent of origin of each genome and observed that 36% 273 of the SNVs were continent exclusive. Notably, genomes with a European origin contributed 274 to the most exclusive SNVs (Fig. 6d). However, genomes from Africa contributed over three 275 times more variation on average than European or North American genomes. Pairwise SNV 276 analysis also supported a higher cross-continent SNV density, especially between genomes 277 from Africa and Europe (Fig. 6e). Our results suggest there is a high strain variability between 278 continents and that a considerable level of diversity remains to be discovered, especially from 279 underrepresented regions such as Africa, South America and Oceania. 280 281

Resource implementation 282
Both the UHGG and UHGP catalogues are available as part of a new genome layer within the 283 MGnify 36 website, where summary statistics of each species cluster and their functional 284 annotations can be interactively explored and downloaded (see 'Data availability' section for 285 more details). We have also generated a BItsliced Genomic Signature Index (BIGSI) 37 of the 286 UHGG, which will allow users to interactively screen for the presence of small sequence 287 13 fragments (<5 kb) in this collection. As new genomes from the human gut microbiome are 288 generated and made publicly available, we plan to periodically update the resource with newly 289 discovered species or by replacing uncultured reference genomes with better quality versions. 290 291

292
We have generated a unified sequence catalogue representing 286,997 genomes and over 625 293 million protein sequences of the human gut microbiome. Of the 4,644 species contained in the 294 UHGG, 71% lack a cultured representative, meaning the majority of microbial diversity in the 295 catalogue remains to be experimentally characterized. During preparation of our manuscript, a 296 new collection of almost 4,000 cultured genomes from 106 gut species was released 38 , which 297 will be incorporated in future versions of the resource. As 96% of these genomes were reported 298 to have a species representative in the culture collections here included, we do not anticipate 299 this dataset to provide a substantial increase in the number of species discovered. Nevertheless, 300 our analyses suggest additional uncultured species from the human gut microbiome are yet to 301 be discovered, highlighting the importance and need for culture-based studies. Furthermore, 302 given the sampling bias towards populations from China, Europe and the United States, we 303 expect that many underrepresented regions still contain substantial uncultured diversity. 304 305 By comparing recently published large datasets of uncultured genomes 16,17,19 , we were able to 306 assess the reproducibility of the results from each study. We show that despite the different 307 assembly, binning and refinement procedures employed in the three studies, almost all of the 308 same species and near-identical strains were recovered independently when using a consistent 309 sample set. These results further increase confidence in the use of metagenome-assembled 310 genomes for the characterization of uncultured microbial diversity. 311 312 With the establishment of this massive sequence catalogue, it is evident that a large portion of 313 the species and functional diversity within the human gut microbiome remains uncharacterized. 314 Moreover, our knowledge of the intra-species diversity of many species is still limited due to 315 the presence of a small number of conspecific genomes. Having this combined resource can 316 help guide future studies and prioritize targets for further experimental validation. Using the 317 UHGG or UHGP, the community can now screen for the prevalence and abundance of 318 species/genes in a large panel of intestinal samples and in specific clinical contexts. By 319 pinpointing particular taxonomic groups with biomedical relevance, more targeted approaches 320 could be developed to improve our understanding of their role in the human gut. The functional 321 predictions generated for the species pan-genomes could also be leveraged to develop new 322 culturing strategies for isolation of candidate species. Target-enrichment methods such as 323 single-cell 39 and/or bait-capture hybridization 40 approaches could also be applied. Being able 324 to enrich for specific groups of interest, even without culturing, could allow recovery of better-325 quality versions of MAGs and improve the analysis derived from genome sequence data alone. 326 Given the large uncultured diversity still remaining in the human gut microbiome, having a 327 high-quality catalogue of all currently known species substantially enhances the resolution and 328 accuracy of metagenome-based studies. Therefore, the presented genome and protein catalogue 329 represents a key step towards a hypothesis-driven, mechanistic understanding of the human gut set, we employed an iterative approach where random chunks of 50,000 genomes were 499 clustered independently. The selected representatives from each chunk were combined and 500 subsequently clustered, reducing the final computational load. To ensure the best quality 501 genome was selected as the species representative in each iteration, a score was calculated for 502 each genome based on the following formula: 503 Score = CMP -5 ´ CNT + 0.5 ´ log(N50) 504 where CMP represents the completeness level, CNT the estimated contamination and N50 the 505 assembly contiguity characterized by the minimum contig size in which half of the total 506 genome sequence is contained. The genome with the highest score was chosen as the species 507 representative, with cultured genomes prioritized over uncultured genomes (i.e. if a MAG had 508 a higher score than an isolate genome, the latter would still be chosen as the representative). 509 510

Evaluating methods reproducibility 511
The species clusters inferred here were compared with those previously generated in the human 512 gut MAG studies 16,17,19 from a common set of genomes. Similarity between species clusterings 513 was estimated using the Adjusted Rand Index (ARI) computed in the Scikit-learn python 514 package 46 . This metric considers both the number of clusters and cluster membership to 515 compute a similarity score ranging from 0 to 1. 516 23 Conspecific genomes recovered in the same metagenomic samples but in different studies were 517 compared with FastANI v1.1 25 with default parameters to obtain both the maximum aligned 518 fraction and ANI for each pairwise comparison. 519 520

Inferring cultured status 521
To determine their cultured status, the UHGG species representatives were searched against 522 NCBI RefSeq release 93 after excluding uncultured genomes (i.e. metagenome-assembled or 523 single-cell amplified genomes). Genome alignments were performed in two stages: (1) Mash 524 v2.1 was used as an initial screen (using the function 'mash dist') to identify the most similar 525 RefSeq genome to each of the UHGG species, and (2)  the genome clustering and taxonomic classification procedures, genomes were assigned at the 550 species level if the ANI to the closest GTDB-Tk species representative genome was ³95% and 551 the alignment fraction ³30%. In this taxonomy scheme, genera and species names with an 552 alphabetic suffix indicate taxa that are polyphyletic or needed to be subdivided based on 553 taxonomic rank normalization according to the current GTDB reference tree. The lineage 554 containing the type strain retains the unsuffixed (valid) name and all other lineages are given 555 alphabetic suffixes, indicating they are placeholder names that need to be replaced in due 556 course. Taxon names above the rank of genus appended with an alphabetic suffix indicate 557 groups that are not monophyletic in the GTDB reference tree, but for which there exists 558 alternative evidence that they are monophyletic groups. We also generated NCBI taxonomy 559 annotations for each species-level genome based on its placement in the GTDB tree, using the 560 'gtdb_to_ncbi_majority_vote.py' script available in the GTDB-Tk repository 561 (https://github.com/Ecogenomics/GTDBTk/tree/stable/scripts). 562 563 Maximum-likelihood trees were generated de novo using the protein sequence alignments 564 produced by the GTDB-Tk: we used IQ-TREE v1.6.11 to build a phylogenetic tree of the 4,616 565 bacterial and 28 archaeal species. The best-fit model was automatically selected by 566 25 'ModelFinder' on the basis of the Bayesian Information Criterion (BIC) score. The LG+F+R10 567 model was chosen for building the bacterial tree, while the LG+F+R4 model was used for the 568 archaeal phylogeny. Trees were visualized and annotated with the Interactive Tree Of Life 569 (iTOL) v4.4.2 48 . Phylogenetic diversity (PD) was estimated by the sum of branch lengths, with 570 the amount that was exclusive to uncultured species calculated as PDtotal -PDcultured. Uncultured 571 monophyletic groups were defined as nodes in the tree containing child leaves exclusively 572 comprised of uncultured genomes. 573