New insights from uncultivated genomes of the global human gut microbiome

The genome sequences of many species of the human gut microbiome remain unknown, largely owing to challenges in cultivating microorganisms under laboratory conditions. Here we address this problem by reconstructing 60,664 draft prokaryotic genomes from 3,810 faecal metagenomes, from geographically and phenotypically diverse humans. These genomes provide reference points for 2,058 newly identified species-level operational taxonomic units (OTUs), which represents a 50% increase over the previously known phylogenetic diversity of sequenced gut bacteria. On average, the newly identified OTUs comprise 33% of richness and 28% of species abundance per individual, and are enriched in humans from rural populations. A meta-analysis of clinical gut-microbiome studies pinpointed numerous disease associations for the newly identified OTUs, which have the potential to improve predictive models. Finally, our analysis revealed that uncultured gut species have undergone genome reduction that has resulted in the loss of certain biosynthetic pathways, which may offer clues for improving cultivation strategies in the future.

The gut microbiome has myriad important roles in human health and disease 1 . Microbial reference genomes are essential resources for understanding the functional role of specific organisms in the microbiome, and for quantifying their abundance from metagenomes 2 . However, an estimated 40-50% of human gut species lack a reference genome 3,4 . Although considerable efforts have been made to culture and sequence members of the gut microbiome 5-7 , many microorganisms have not been grown under laboratory conditions to date and lack a sequenced genome-despite being prevalent in humans 8 .
Recent advances in experimental technologies have begun to close this gap: some studies 6,7 have used microbial culturomics to isolate and sequence hundreds of previously uncultured organisms in the human gut, and others have performed single-cell genome sequencing 9 . In contrast to experimental approaches, metagenome binning is a computational approach that can be used to obtain genomes directly from samples without isolation or culturing. Sequencing reads are first assembled into contigs, which are then binned into metagenomeassembled genomes (MAGs) on the basis of nucleotide frequency, abundance and/or co-variation of abundance across a group of samples 10 . This process is performed either for individual metagenomes 11 or multiple co-assembled metagenomes 12 . MAGs are subsequently evaluated for various indicators of genome quality, including estimated completeness and contamination, the presence of marker genes and overall contiguity [13][14][15] .
MAGs were first assembled from a low-complexity acid-mine drainage community 16 but-with advances in sequencing technology and computational methods-MAGs have now been recovered from a myriad of environments including the global ocean 17 , cow rumen 12 , aquifer systems 18 and others 11 . These uncultured genomes have expanded the tree of life by revealing novel lineages in diverse environments, as well as unusual biology 11,19 . Despite the growing number of publicly available human gut metagenomes, there has not been any large-scale assembly of MAGs from the gut microbiome. Nielsen et al. 20 were the first to recover MAGs from gut metagenomes, and similar concepts have been developed and applied to other individual studies 21 . We hypothesized that human gut MAGs systematically recovered from public metagenomes could substantially increase the diversity of species with a sequenced genome, and shed light on the biology of uncultivated organisms in the gut microbiome.

Reconstructing genomes from global gut metagenomes
To recover genomes for novel human gut lineages, we performed metagenomic assembly and binning on 3,810 globally distributed samples from phenotypically and demographically diverse human subjects, using a pipeline developed for this study (Fig. 1a, b, Supplementary Tables 1-5). MAG quality was improved further by using a pipeline that we developed to identify and remove incorrectly binned contigs (Fig. 1c, Extended Data Fig. 1, Supplementary Tables 6, 7, Methods). We performed single-sample assembly and binning (rather than coassembly) to preserve strain variation between human hosts, and because co-assembly was not computationally feasible for our large dataset. On the basis of a subset of samples, our pipeline produced 1.8× more non-redundant high-quality MAGs compared to co-assembly, and 3.3× more than a previous study 20 that used abundance co-variation across samples (Extended Data Fig. 2).
Our pipeline yielded 60,664 MAGs that met or exceeded the mediumquality 'Minimum information about a metagenome-assembled genome' (MIMAG) standard 14 which we refer to as the global human gut MAG (HGM) dataset (Fig. 1b, Supplementary Table 8). The MAGs form 43,737 clusters at an average-nucleotide-identity threshold of 99%, which indicates that most of the MAGs are unique. The vast majority of MAGs displayed >98% DNA identity within the same species and <98% identity between species at individual marker genes, which suggests that they are not chimeric (Extended Data Fig. 3g-l). A subset of 24,345 high-quality MAGs was estimated to be near-complete; MAGs in this subset had minimal contamination, high contiguity and were of a similar length to isolate genomes of the same species (Fig. 1b, d, Extended Data Fig. 3f). Only 14.5% of these MAGs were classified as being of high quality by the MIMAG standard, largely owing to the absence of a full complement of rRNA genes-these are challenging Article reSeArcH to assemble from metagenomes 22 and are often absent from otherwise near-complete MAGs 11 .
Despite the large number of recovered genomes, we identified several challenges to recovering MAGs from human gut metagenomes. First, by mapping reads back to each MAG and quantifying single nucleotide polymorphisms, we confirmed that strain diversity results in highly fragmented MAGs 15 (Fig. 1e). Second, we found that reliably assembling a MAG required at least 10-20× read depth (Fig. 1f), which indicates that MAGs were only assembled for the most-abundant taxa in each community 23 . MAG assembly was particularly challenging for some phyla-such as Bacteroidetes (Fig. 1f)-and for metagenomes with high community diversity (Extended Data Fig. 4a, b). Despite these challenges, our results indicate that thousands of partial and near-complete genomes can be reconstructed from individual human gut metagenomes using standard pipelines for assembly and binning.

MAGs represent thousands of unknown species
To explore whether the HGM dataset represented novel taxa, we clustered the 60,664 MAGs plus 145,917 non-redundant reference genomes into species-level operational taxonomic units (OTUs) on the basis of 95% average nucleotide identity (Fig. 2a, Supplementary Tables 9, 10). Although the species concept for prokaryotes is controversial 24 , our operational definition is commonly used 3,4 and is considered to be a gold standard 25 . We found that our species-level OTUs were consistent with taxonomic annotations from other databases, and that they were robust to genome incompleteness and contamination (Extended Data Figs. 3a-c, 5).
Our procedure yielded a total of 23,790 species-level OTUs, which included 4,558 OTUs from the human gut microbiome (Fig. 2a, c, Extended Data Fig. 6a, b, Supplementary Table 10). We formed the Integrated Gut Genomes Database, which contains the 156,478 genomes that comprise the human gut OTUs and includes 2,058 newly identified OTUs that are comprised exclusively of 10,368 MAGs (Fig. 2e). In support of novelty of the newly identified OTUs, 96% of them were not classified at the species level according to the Genome Taxonomy Database 26 (Supplementary Table 10), and 69% of them had <90% average nucleotide identity to any OTU that contained a reference genome.
A considerable number of MAGs were not taxonomically classified at or above the genus rank (n = 3,215) (Supplementary Table 10). To identify the novel clades represented by these MAGs, we constructed a phylogeny of all MAGs and reference genomes, and clustered them on the basis of rank-specific phylogenetic distance cut-offs (Fig. 2b, Extended Data Fig. 3d, e). This revealed 360 genus-level OTUs, 15 family-level OTUs and 2 order-level OTUs that were previously unknown (Fig. 2e). A collector's curve revealed saturation of OTUs at or above the genus rank, but not for species (Extended Data Fig. 6c). Together, MAGs from the newly identified OTUs represented 70.9% of the total phylogenetic diversity of sequenced gut bacteria, and a 50% increase compared to reference genomes alone (Fig. 2d).
The newly identified OTUs were broadly distributed across taxonomic groups (Fig. 3)  Sample sizes are indicated in parentheses, and pin colour indicates the majority age group and lifestyle (infants, ≤3 years old; adults, ≥18 years old). Several locations are represented by multiple studies; several studies were conducted in multiple locations. b, Computational pipeline for assembling MAGs. c, Pipeline for identifying and removing incorrectly binned contigs. d, Quality metrics across low-(n = 101,651), medium-(med., n = 36,319) and high-quality (n = 24,345) MAGs. e, Barriers to MAG recovery. Single nucleotide polymorphisms (SNPs) were called for MAGs with sufficient read depth (n = 17,671), and compared with N50. Red line is from a Spearman correlation (ρ = −0.61). f, At least 10-20× depth is required to assemble a MAG, but assembly rates vary between taxa. AB, Actinobacteria; AR, Archaea; BD, Bacteroidetes; FR, Firmicutes; VM, Verrucomicrobia; PR, Proteobacteria; SP, Spirochaetes. Sequencing read depth was estimated using IGGsearch (see Methods), and curves were fit using logistic regression. For box plots, the middle line denotes the median; the box denotes the interquartile range (IQR); and the whiskers denote 1.5× IQR.
Article reSeArcH of assembling this phylum (Figs. 1f, 3). By contrast, almost no newly identified OTUs were found in Archaea even though MAGs were easily assembled (Fig. 1f), which suggests that most of the abundant human gut Archaea already have a sequenced genome. Several large clades within Cyanobacteria (Melainabacteria class) and Clostridia were not represented by any high-quality genome, which may be explained by genome reduction or unknown factors that interfere with genome assembly (Extended Data Fig. 7a). Overall, these results indicate that the HGM dataset has greatly expanded the known genomic diversity of bacteria across the tree of life in the human gut.

Distribution of species in the human population
To quantify the abundance of newly identified OTUs in the gut microbiome, we developed a tool named 'IGGsearch' , which uses a strategy similar to that of MetaPhlAn2 27 . IGGsearch rapidly estimates the abundance of all 23,790 species-level OTUs, by aligning metagenomic reads to a database of single-copy, species-specific genes that have been identified from MAGs and reference genomes ( Supplementary  Fig. 1, Methods). A number of other tools exist for metagenomic taxonomic profiling, but none of these contains the MAGs from this study. Using benchmark datasets, we found that IGGsearch accurately quantifies OTU abundance, as well as presence versus absence of OTUs ( Supplementary Fig. 2, Supplementary Tables 11, 12).
Using IGGsearch profiling, we found that the newly identified species-level OTUs accounted for 33.4% of richness and 27.7% of relative abundance per sample from healthy individuals (Extended Data Fig. 4c) and were commonly detected in samples from which no MAG was recovered (Extended Data Fig. 4e). These results were consistent in metagenomes that were not used for assembly or binning (Supplementary Table 13). The newly identified species-level OTUs were particularly abundant in healthy adults from rural populations (Tanzania, Peru, Mongolia, Fiji and El Salvador) but were notably rare in infants from Europe and the United States (Fig. 2f, Extended Data Fig. 4d), which may reflect biases in previous genome sequencing efforts. Microbial communities with high diversity were enriched for newly identified OTUs, although no difference was observed between the microbiomes of healthy individuals and individuals with disease (Extended Data Fig. 4f, Supplementary Tables 13, 14).
Together, these results reveal that the uncultured OTUs discovered in this study comprise a considerable fraction of the healthy human gut microbiome, and that they are more common in non-Western populations.

Association of gut species with human diseases
Human gut microbiota have been linked to a myriad of diseases: disease associations with the microbiome can be leveraged to understand disease aetiology, for clinical diagnosis or for building predictive models 1,21 . We hypothesized that IGGsearch would be able to identify associations with human diseases among the 2,058 species-level OTUs discovered in this study. To address this question, we performed metagenome-wide association of 4,558 species-level OTUs from the Integrated Gut Genomes Database versus disease status for ten clinical microbiome studies (including six that were not used for MAG recovery) (Supplementary Tables 15, 16,  Methods).
Overall, we identified 2,283 associations between species and disease (at a false discovery rate of <1%) that included an even balance of case-enriched and control-enriched OTUs (Extended Data Table 1). Nearly 40% of the disease associations corresponded to the newly identified OTUs, including many of the most significant associations (Fig. 4). For example, the most significant association for ankylosing spondylitis (an inflammatory arthritis that affects the spine and large joints) was with a newly identified species in the Negativicutes class (OTU 14148, adjusted P = 5.3 × 10 −28 ), which was strongly depleted in patients relative to healthy controls and eight orders of magnitude more significant than any previously known species.
To contextualize these results, we estimated microbial species abundances in the same datasets using three other commonly used tools-MIDAS 4 , mOTU 3 and MetaPhlAn2 27 -along with the reference databases that are distributed with each tool. After applying the same statistical procedure to each set of abundance profiles, we identified 716, 404 and 326 disease associations using each respective tool (Extended Data Table 1), which is nearly fivefold fewer than we identified using IGGsearch, on average. Additionally, we used abundance data from each tool to build random-forest machine learning models to predict disease status. We found that IGGsearch abundance profiles T a n z a n ia   Article reSeArcH yielded the most-predictive model (or equivalent) for eight of the ten diseases, with considerable improvements for colorectal cancer, cardiovascular disease, type 2 diabetes and rheumatoid arthritis (Extended Data Table 1). More work is needed to understand how associated species relate to disease aetiology, and whether these results can be replicated in other human populations.  7  43  24  25  11  18  98  1  2  7  1  5  1  1  1  3  2  31  3  1  1  5  1  25  8  5  3  2  1  1  2  4  1   3  2  1  2 4 New family OTUs ( )

Fig. 3 | Newly identified gut species are broadly distributed across taxonomic groups.
Order-level clades with ≥10 human gut specieslevel OTUs or that were detected in ≥10% of metagenomes from healthy individuals. Taxonomic labels are based on the Genome Taxonomy Database (GTDB). Red labels indicate orders represented exclusively by MAGs (whether in the current study or from previous studies).
Pie charts indicate the prevalence of orders across metagenomes from healthy individuals. Grey bars indicate the number of gut species-level OTUs per order, and the green bars indicate the percentage of OTUs that are newly identified in this study. Red stars and purple triangles indicate the number of newly identified genus-level and family-level OTUs, respectively.

Genome reduction of uncultured gut bacteria
Previous MAG studies of environmental communities have uncovered large uncultured lineages with unusual genomic properties, including reduced genomes, slow replication rates and the absence of conserved genes 19,28 . We found that the human gut also contains a number of large lineages that are exclusively represented by MAGs (Fig. 2e, Extended Data Fig. 7b). To elucidate biological properties of these organisms, we performed a comparative genomic analysis between cultured and uncultured species-level OTUs from the gut (Supplementary Table 17, Methods). Notably, uncultured OTUs tended to have significantly reduced genomes, a finding that was consistent across all of the major phyla and classes that we tested-including Actinobacteria, Bacilli, Clostridia, Bacteroidetes and Proteobacteria (Fig. 5a). Previous studies have identified difficult-to-culture taxa with reduced genomes-including TM7 29 and Melainabacteria 30 -but to our knowledge, this is the first time genome reduction has been identified as a broadly shared feature of uncultivated bacteria from the gut microbiome. Other genomic features-including estimated replication rates, coding density and GC content-did not consistently differ between cultured and uncultured OTUs (Extended Data Fig. 8, Supplementary Table 18).
Given the overall pattern of genome reduction, we identified functions that were frequently missing from uncultured OTUs using phylogenetic logistic regression and annotations from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Methods). Overall, we found 1,492 KEGG orthology groups (21.5% of total) that significantly differed between groups at a false discovery rate of <1%most of which were depleted from uncultivated OTUs (Fig. 5b). These patterns were consistent between MAGs and isolate genomes of the same species, and were not affected by the database used for functional annotation (Extended Data Fig. 9a, b). Among our top hits, we found functions related to the maintenance of osmotic pressure and protection against oxidative stress (Extended Data Fig. 9c), which may indicate that uncultivated bacteria are less viable after transfer to a culture medium or are more sensitive to oxygen exposure outside of the host 6 .
The above patterns were best exemplified by RF39, which is an uncultivated order within the class Bacilli that has a highly reduced genome and numerous auxotrophies (Fig. 5c). Little has been published regarding this group, despite the fact that RF39 has previously been detected in MAG studies 11,20 and that it was found in a large proportion of metagenomes analysed in our study (Fig. 3). Numerous highly conserved metabolic pathways-including those for biosynthesis of fatty acids, several amino acids and vitamins-were entirely missing from nearly all RF39 genomes. The complete loss of the fatty acid biosynthesis pathway was notable, because fatty acids are integral components of cellular membranes and are considered to be a housekeeping capacity of cells. These organisms may incorporate exogenous fatty acids into membrane phospholipids using a mechanism that has recently been described in Firmicutes 31 .

Discussion
Here we have illustrated that it is possible to use large-scale metagenomic assembly and binning to recover thousands of genomes for previously unknown members of the human gut microbiome. We generated the Integrated Gut Genomes Database and the IGGsearch tool as resources to drive further discoveries in human microbiome science. During the review of this manuscript, several studies were published that generated many new human gut genomes from metagenomes 32,33 and cultivated isolates 34,35 . In the future, these genomes could be integrated with the Integrated Gut Genomes Database to provide an updated catalogue of genomes from the gut microbiome.
Although we recovered thousands of MAGs, we also identified several challenges-including low species abundance, high strain diversity and low recovery rates for some phyla (such as Bacteroidetes). Future efforts to recover MAGs from the gut microbiome may benefit from  Article reSeArcH alternative approaches that target these hard-to-assemble organisms. Likewise, we found that adults from non-Western countries were a major source of previously unknown diversity, which indicates that future metagenome studies should focus on human populations outside of Europe, the United States and China.
One of the most surprising results from our study was that the majority of microbial diversity in the human gut is not currently represented by cultured isolates, which are important for numerous applications in basic research and biotechnology. In the future, MAGs from this study could be used to improve culture conditions or identify novel growth factors for uncultured human gut species. For example, menaquinone and fatty acids have been shown to promote the growth of uncultured bacteria 36,37 and both pathways were missing from many uncultured OTUs from this study (Supplementary Table 19). Furthermore, we found that uncultivated bacteria have undergone considerable genome reduction, which may be an adaptive process that results from use of public goods (as outlined in the Black Queen hypothesis 38 ); more work is needed to explore this question.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of data availability and associated accession codes are available at https://doi.org/10.1038/s41586-019-1058-x.

MEthOdS
Publicly available human gut metagenomes. We downloaded 11,523 sequencing runs for publicly available human gut metagenomes from the NCBI SRA 39 . These data correspond to 3,810 samples, 15 studies 9,21,40-51 (and https://olive.broadinstitute.org/projects/infant_gut_flora_and_antibiotics) and >181 billion sequencing reads with an average length of 100 bp (Supplementary Tables 1, 2). Sequencing metadata were obtained from the SRAdb relational database 52 and host metadata were obtained from either the NCBI BioSample database 53 or from supplementary datasets linked to publications (Supplementary Table 3). No metadata were available online or upon request from the Fiji cohort 9 ; these individuals were treated as healthy adults from a rural population. Metagenome assembly and binning. We co-assembled the 11,523 sequencing runs for each of the 3,810 biological samples using MegaHIT v.  Table 6). Our logic was that strains of the same species should share homology between most contigs, and that contigs that fail this condition (that is, are present in one genome but absent in the other) probably represent contamination. For each input MAG, we used Mash v.2.0 59 to find at least five closely related, near-complete genomes in the database (>95% estimated completeness, <5% estimated contamination, Mash distance ≤ 0.05, P ≤ 0.001), and then used BLASTN to align contigs between each MAG and all target genomes. Contigs in the MAG that failed to align at ≥70% nucleotide identity over ≥25% length to any of the closely related genomes were flagged for removal.

Refinement of MAGs on the basis of taxonomic annotation of contigs.
We identified and removed taxonomically discordant contigs from MAGs using two complementary approaches (Supplementary Table 6). The first approach performs taxonomic annotation on the basis of universal single-copy marker genes. Hidden Markov models for marker-gene families were downloaded from the PhyEco database 60 , and searched against MAGs with HMMER v.3.1b2 61 . A subset of 100 (for Archaea) or 88 (for Bacteria) gene families was used. Marker genes found in MAGs were then aligned against a reference database of taxonomically annotated marker genes from reference genomes using BLASTP. For each gene, we transferred the taxonomy of the best hit in the reference database at the appropriate rank on the basis of the percentage of amino acid identity cut-offs specific to each gene family at each rank. We then taxonomically annotated each MAG on the basis of the consensus taxonomy of marker genes at the lowest rank, such that >70% of marker genes were annotated. Contigs were flagged for removal if they (1) contained a taxonomically discordant marker gene, and (2) lacked a concordant marker gene. The second approach for taxonomic refinement is similar to the first, except that 855,764 clade-specific prokaryotic marker genes from the MetaPhlAn2 27 database were used for taxonomic annotation after excluding 'pseudo markers' that are not unique to a clade.

Refinement of MAGs on the basis of outlier nucleotide composition and sequencing read depth.
Using an approach similar to a previously published method 11 , we identified and removed contigs from MAGs with either (1) outlier GC content, (2) outlier tetranucleotide frequency or (3) outlier sequencing read depth (Supplementary Table 6). We used principal component analysis to reduce the tetranucleotide frequency dimensionality down to the first principal component (tetranucleotide frequency PC1). For each MAG, we then measured the average GC content, average tetranucleotide frequency PC1 and average sequencing read depth. Contigs were flagged for removal if they deviated from these averages beyond cut-offs selected to minimize reduction in completeness (Supplementary Table 6). Validation of MAG refinement pipeline. We simulated 1,000 human gut MAGs to validate our overall MAG refinement strategy (Supplementary Table 7). Each simulated MAG contained two genomes: one 'host' genome (representing the target genome) and one 'donor' genome (representing the contaminating genome). All 102 genomes used in simulations were isolated from the human gut, and were estimated to have >95% completeness, <1% contamination and <25 contigs. MAGs were simulated with completeness (mean = 61.9%), contamination (mean = 10.0%) and N50 (mean = 35.8 kb) on the basis of randomly sampled MAGs from the HGM dataset. MAGs were dropped in cases in which contamination exceeded completeness, and thus the host genome was in the minority. The refinement pipeline was applied to each simulated MAG and-to evaluate the pipeline-we quantified the overall reduction in completeness and contamination (Extended Data Fig. 1a, b). Application of refinement strategies to the HGM dataset. We applied each of the refinement approaches described above to the MAGs (Extended Data Fig. 2c, Supplementary Table 6). In rare cases, these approaches may erroneously flag a large proportion of a MAG. To avoid this, we applied a particular approach to a MAG only if it resulted in ≤25% reduction in total length. The five approaches combined removed 5,251,859 contigs (7.13% of total) and 20,821.2 Mb (6.70% of total) from the MAGs. After removing potential contaminants, we were left with 152,279 MAGs with a total length ≥100 kb and 10,036 individual contigs longer than 100 kb that were either unbinned or removed during decontamination. These long contigs were included with other MAGs, which brought the total number to 162,315. Quality assessment of MAGs. CheckM v.1.0.7 13 was used to estimate completeness and contamination of the 162,315 recovered MAGs (Supplementary Table 6); CheckM is based on the copy-number of lineage-specific single-copy genes. Additional statistics were obtained for each genome, including the contig N50, number of contigs, average contig length, contig read-depth, and number of tRNA and rRNA genes. tRNAs were identified using tRNAscan-s.e. v.1.3.1 62 and rRNA genes using Barrnap v.0.9-dev 63 with options '-reject 0.01 -evalue 1e-3' . We identified 60,664 MAGs that met the MIMAG medium-quality criteria of ≥50% completeness with ≤10% contamination 14 . For analyses that required near-complete genomes, we used a subset of 24,345 high-quality MAGs that were ≥90% complete, ≤5% contaminated, with an N50 ≥ 10 kb, an average contig length ≥5 kb, ≤500 contigs and ≥90% of contigs with ≥5× read-depth. Estimation of SNP density. Read mapping and SNP calling were performed to assess the genetic diversity of each MAG (Supplementary Table 5). Bowtie 2 v.2.3.4 64 was used to construct a database of MAGs for each sample, and to align metagenomic reads. Reads with low mapping and sequence quality were discarded (quality scores <20 and <30, respectively), and we counted the occurrence of nucleotides with quality ≥30 across each MAG. To compare SNPs between MAGs sequenced to different depths, we down-sampled each MAG to 40 mapped reads per site. MAGs with at least 200,000 sites of ≥40× depth were retained for analysis. A SNP was called if at least two reads matched the alternative allele at a genomic site. SNP density was calculated as the number of SNPs per kilobase. Reference genomes used for comparison. We downloaded 201,102 publicly available bacterial and archaeal reference genomes from the Integrated Microbial Genomes (IMG; https://img.jgi.doe.gov/) 65 (n = 61,713) and Pathosystems Resource Integration Center (PATRIC; https://www.patricbrc.org/) 66 (n = 139,389) databases, on 16 January 2018. These included genomes from 2 human gut culturomics studies 6,7 and 16,525 previously published MAGs, including a previous MAG study from the human gut 20 and nearly 8,000 MAGs assembled from SRA metagenomes 11 . To remove redundancy within and between databases, we used Mash 59 with default parameters to cluster genomes with a Mash distance of 0.0, which are expected to be identical. This resulted in 153,900 non-redundant reference genomes, of which 127,419 were classified as high quality, 18,498 as medium quality and another 7,983 as low quality (Supplementary Table 9). Species-level clustering of reference genomes and MAGs. Using an approach similar to a previously published method 67 , we clustered the 60,664 MAGs and 145,917 reference genomes meeting or exceeding the MIMAG mediumquality standard into species-level OTUs on the basis of 95% whole-genome ANI (Supplementary Table 10). We first performed single-linkage clustering of genomes on the basis of a Mash ANI of 99%, which resulted in 79,675 clusters that can be confidently assigned to the same species-level OTU. Mash is extremely fast, although it can underestimate ANI for incomplete genomes 67 . To address this, we used the ANIcalculator v.1.0 68 to compute whole-genome-based ANI (gANI) between the 99%-identity clusters, and required that at least 20% of genes were aligned. The 20% cut-off was chosen to minimize the negative effects of incomplete genomes, and to avoid the formation of spurious OTUs (Extended Data Fig. 5a).
To increase computational efficiency, we calculated gANI only for genome pairs with >90% Mash ANI. Genomes were clustered into OTUs using average-linkage hierarchical clustering with a 95% gANI cut-off using the package MC-UPGMA v.1.0.0 69 , which yielded 23,790 OTUs. All OTUs were taxonomically annotated using the tool GTDBTk v.0.0.6 (release 80, www.github.com/Ecogenomics/GtdbTk), which produces standardized taxonomic labels that are based on those used in the Genome Taxonomy Database 26 . Additionally, we constructed pan-genomes on the basis of clustering all genes within each OTU, using VSEARCH v.2.4.3 70 with 90% DNA identity and 50% alignment cut-offs (maximum 500 genomes per OTU). Human gut OTUs were identified from the set of 23,790 OTUs on the basis of (1) containing a MAG from the HGM dataset, (2) being detected by IGGsearch (see 'Development of IGGsearch for metagenomic profiling of species-level OTUs') in at least 1 of 3,810 metagenomes used for MAG recovery or (3) containing a genome isolated from the human gut (Extended Data Fig. 6a, b, Supplementary Table 10). A total of 4,558 species-level OTUs were annotated as being found in the human gut, on the basis of a combination of the three criteria. Phylogenetic analysis of MAGs and reference genomes. We constructed phylogenetic trees of MAGs and reference genomes using concatenated alignments of conserved, single-copy marker-gene families from the PhyEco database 60 for Bacteria (n = 88 genes) and Archaea (n = 100 genes). Individual marker genes were identified using HMMER v.3.1b2 with bit-score cut-offs that are specific to gene family. For computational efficiency, genomes were collapsed down to species-level OTUs, which were represented as individual leaves in the phylogenetic tree. To reduce the effect of contamination, taxonomically discordant marker genes were removed, as described in 'Refinement of MAGs on the basis of taxonomic annotation of contigs' . FAMSA v.1.2.5 71 was used to construct protein-based multiple sequence alignments for each gene family. Columns with >15% gaps were removed, alignments were concatenated and sequences with >70% gaps were removed (n = 39). FastTree2 v.2.1.10 72 was used to build a maximum likelihood phylogeny for Bacteria and Archaea with default options. All trees were visualized using iTOL v.3 73 . To quantify the gain in phylogenetic diversity from the HGM dataset, we computed the total branch length of two subtrees: a tree of all 4,558 gut OTUs (PD Gut ) and a tree of 2,500 gut OTUs with reference genomes (PD RefGut ). The percentage gain in phylogenetic diversity was computed as: 100 × (PD Gut − PD RefGut )/PD RefGut . To identify OTUs for higher-rank groups, we performed average-linkage hierarchical clustering of phylogenetic distances, which was implemented in R (Supplementary Table 10). Rank-specific cut-offs were identified by maximizing similarity to the Genome Taxonomy Database for reference genomes (Extended Data Fig. 3d, e).

Development of IGGsearch for metagenomic profiling of species-level OTUs.
Using an approach similar to MetaPhlAn2 27 , we developed an accurate and efficient tool for quantifying the abundance of species-level OTUs from unassembled metagenomes. First, we identified marker genes for each OTU ( Supplementary  Fig. 1a). Up to 300 genes from the pan-genome of each OTU were selected with the maximum intra-OTU frequency and minimum inter-OTU frequency. The intra-OTU frequency was computed as the fraction of genomes within an OTU in which a gene was found at 90% DNA identity. The inter-OTU frequency was determined on the basis of DNA alignments (using HS-BLASTN v.0.0.5 74 ) between each gene and the pan-genomes of other OTUs, and accounts for (1) the number of other pan-genomes in which the gene is found, (2) the frequency of the gene in each pan-genome and (3) the percentage of identity of each alignment. For computational reasons, genes were first aligned within each phylum, and only the 300 top-scoring candidates per OTU were subsequently checked for uniqueness between phyla. A total of 6,198,663 marker genes were identified for 23,790 OTUs.
A large number of OTUs contained just a single genome, which made it difficult to accurately predict conserved genes. To refine our marker-gene set, we used abundance co-variation information, which is a common strategy for binning genetic regions from the same species and has previously been applied 3,10,20,21,56 . Specifically, we performed read-mapping of the 3,810 metagenomic samples to the database of 6,198,663 marker genes using Bowtie 2 v.2.3.4 and quantified the read depth of each gene in each sample. We used average linkage clustering to group genes from each OTU into co-variance groups on the basis of Pearson correlations of read depth across samples ( Supplementary Fig. 1b). After applying a correlation threshold of 0.90, we selected the largest cluster of genes for the final marker-gene set. This procedure removed 55,132 genes for 1,402 OTUs that were present in ≥10 samples with ≥1× coverage.
IGGsearch is a command-line tool that uses Bowtie 2 to map metagenomic reads to the database of marker genes and quantify species-level OTUs. Read alignments are removed with low percentage of identity (minimum = 95%), alignment coverage (minimum = 70% of read) and base quality (minimum = 20). For each metagenomic sample, OTU relative abundance is estimated by taking the average read depth across marker genes and normalizing these values to 1.0 across all OTUs. Species presence is determined on the basis of the percentage of marker genes with at least one mapped read.
The sensitivity and specificity of IGGsearch was evaluated on two benchmark datasets. First, we benchmarked IGGsearch on the CAMI challenge dataset (https:// data.cami-challenge.org/participate) (Supplementary Tables 11, 12, Supplementary  Fig. 2a). Second, we benchmarked IGGsearch on simulated gut metagenomes that contained between 500,000 and 50,000,000 paired-end reads, read length of 100 bp, Illumina-style sequencing error and 1 genome from each of 100 randomly selected gut species-level OTUs ( Supplementary Fig. 2b). On the basis of these benchmarks, we called OTUs present when at least 15% of their marker genes were detected, which gave a good balance between sensitivity and specificity. Metagenome-wide association of species abundance with disease. We used IGGsearch species profiles to identify species-level OTUs associated with disease for ten previously published studies, including colorectal cancer 43 , type 2 diabetes 21,44 , rheumatoid arthritis 42 , Parkinson's disease 75 , atherosclerotic cardiovascular disease 76 , ankylosing spondylitis 77 , non-alcoholic fatty liver disease 78 , liver cirrhosis 79 and obesity 80 (Extended Data Table 1, Supplementary Tables 15, 16). To identify species-disease associations, we compared the relative abundances of species for the 4,558 human gut species-levels OTUs between cases and healthy controls using the Wilcoxon rank-sum test. Non-gut OTUs were excluded to reduce the effect of multiple hypothesis testing. For each disease, P values were corrected for multiple hypothesis tests using the Benjamini-Hochberg procedure. We performed the same statistical procedure using species profiles from three other tools: MIDAS v.1.3.0 4 , MetaPhAn2 v.2.7.7 27 and mOTU v.1.1.1 3 . All tools were run with default parameters and the distributed reference data. To prevent confounding signals owing to disease treatment, we excluded 100 individuals taking drugs that affect microbiome composition, including metformin in patients with type 2 diabetes 21,44 , acarbose, atorvastatin, fondaparinux and metoprolol in patients with atherosclerotic cardiovascular disease 76 , and antirheumatic drugs in patients with rheumatoid arthritis 42 . Machine learning models for disease prediction. We constructed random forest models implemented in the scikit-learn Python package (https://scikitlearn.org) to predict disease state from species abundance profiles generated with IGGsearch, MIDAS, mOTU and MetaPhlAn2 (Extended Data Table 1). For IGGsearch, we included all 23,790 species OTUs and allowed the random forest model to choose the most-predictive OTUs. Random forest models were implemented in the scikit-learn package v.0.19.1 81 and were optimized for each of the four tools for each of the ten diseases. Specifically, we tested 1,000 random combinations of parameter values for (1) the number of trees in the forest, (2) the number of features to consider at each split, (3) the maximum number of levels in each tree, (4) the minimum number of samples to split a node, (5) the minimum number of samples at each leaf and (6) whether to use bootstrapping during model training. To avoid overfitting, each model was evaluated using tenfold cross-validation and the combination of parameters that yielded the best receiver operating curve (ROC) area under the curve (AUC) was selected. To obtain robust estimates of model performance, all models were re-run 100 times and ROC AUC values were averaged across runs.

Identifying genomic features and auxotrophies of uncultured gut bacteria.
We selected a subset of 504 human gut species-level OTUs from bacteria for comparative genomic analysis between cultured and uncultured organisms (Supplementary Table 17). OTUs with <5% prevalence in human gut metagenomes were excluded, because rare organisms may be amenable to cultivation but not yet sampled. Uncultivated OTUs were defined as those that contain only MAGs (either from the current study or previous studies, n = 271) and cultivated OTUs as those that contain at least one isolate genome (n = 233). We based all comparative analysis between OTUs using 24,345 high-quality MAGs from the HGM dataset, which was done (1) to avoid biases that result from a comparison of MAGs to isolate genomes (which differ in assembly quality) and (2) to avoid issues arising from low completeness among MAGs in the medium-quality tier.
We compared several broad genomic features between groups, including estimated genome size, GC content, coding density and estimated replication rate. Estimated genome size was corrected for completeness and contamination using: Ĝ = G × 100/Ĉ − (G × T/100), in which Ĝ is the estimated genome size of a MAG, G is the observed genome size, Ĉ is the estimated percentage completeness and T is the estimated percentage contamination. Replication rate was estimated with iRep v.1.10 28 for MAGs with >5× read-depth, which is based on differences in sequencing depth between the origin and terminus of replication. Genomic features were averaged across all high-quality MAGs for each OTU, and then compared between OTUs using the Wilcoxon rank-sum test (Supplementary  Table 18).
To identify potential auxotrophies, we compared the prevalence of genes, modules and pathways from the KEGG database (release 77.1) 82 between cultivated and uncultivated OTUs. Proteins from high-quality MAGs were annotated on the basis of amino acid alignments to KEGG using LAST v.828 83 , and assigned to the KEGG orthology group with lowest value of E < 1 × 10 −5 . Next, we computed the fraction of MAGs per OTU that contained each KEGG orthology group, and compared these values between OTUs using the Ives-Garland test implemented in the phylolm R package v.2.6 84 . The Ives-Garland test performs logistic regression while controlling for differences in phylogeny between groups, and has previously been applied to microbiome data 85 . This analysis was repeated for modules and pathways from the KEGG database. P values were corrected for multiple hypothesis tests using the Benjamini-Hochberg procedure (Supplementary Table 19). This same analysis was performed for functions from the TIGRFAM database (release 15.0) 86 (Extended Data Fig. 9a). Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability
Representative MAGs for the 2,058 new species have been deposited in the European Nucleotide Archive (ENA) under accession PRJEB31003 (Supplementary  Table 20). The entire HGM dataset, phylogenomic trees and related metadata are available at https://github.com/snayfach/IGGdb.

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Representative MAGs for the 2,058 new species have been deposited in the European Nucleotide Archive (ENA) under accession PRJEB31003. The entire data set and related metadata is freely available at https://github.com/snayfach/IGGdb.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. All studies must disclose on these points even when the disclosure is negative.

Study description
Previous studies have found that a large proportion of species in the gut microbiome lack a sequenced genome. We addressed this problem by systematically recovering >60,000 draft genomes from nearly 4,000 metagenomes from phenotypically and geographically diverse human subjects.