Article | Open | Published:

This is an unedited manuscript that has been accepted for publication. Nature Research are providing this early version of the manuscript as a service to our customers. The manuscript will undergo copyediting, typesetting and a proof review before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers apply.

A new genomic blueprint of the human gut microbiota

Nature (2019) | Download Citation


The composition of the human gut microbiota is linked to health and disease, but knowledge of individual microbial species is needed to decipher their biological roles. Despite extensive culturing and sequencing efforts, the complete bacterial repertoire of the human gut microbiota remains undefined. Here we identify 1,952 uncultured candidate bacterial species by reconstructing 92,143 metagenome-assembled genomes from 11,850 human gut microbiomes. These uncultured genomes substantially expand the known species repertoire of the collective human gut microbiota, with a 281% increase in phylogenetic diversity. Although the newly identified species are less prevalent in well-studied populations compared to reference isolate genomes, they improve classification of understudied African and South American samples by more than 200%. These candidate species encode hundreds of newly identified biosynthetic gene clusters and possess a distinctive functional capacity that might explain their elusive nature. Our work expands the known diversity of uncultured gut bacteria, which provides unprecedented resolution for taxonomic and functional characterization of the intestinal microbiota.

Code availability

Custom scripts used to generate data and figures are available at

Data availability

The UMGS genomes generated in this work were deposited in ENA, under the study accession ERP108418. The 92,143 MAGs with QS >50, as well as the quantification results from BWA and sourmash, all phylogenetic trees and the functional analysis results with InterProScan, GP and GhostKOALA are available at

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 8, 1784 (2017).

  2. 2.

    Turnbaugh, P. J. et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–1031 (2006).

  3. 3.

    Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017).

  4. 4.

    Nelson, K. E. et al. A catalog of reference genomes from the human microbiome. Science 328, 994–999 (2010).

  5. 5.

    Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

  6. 6.

    Browne, H. P. et al. Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation. Nature 533, 543–546 (2016).

  7. 7.

    Thomas-White, K. et al. Culturing of female bladder bacteria reveals an interconnected urogenital microbiota. Nat. Commun. 9, 1557 (2018).

  8. 8.

    Forster, S. C. et al. A human gut bacterial genome and culture collection for precise and efficient metagenomic analysis. Nat. Biotechnol. 37, 186–192 (2019).

  9. 9.

    Lagier, J.-C. et al. Culture of previously uncultured members of the human gut microbiota by culturomics. Nat. Microbiol. 1, 16203 (2016).

  10. 10.

    Lau, J. T. et al. Capturing the diversity of the human gut microbiota through culture-enriched molecular profiling. Genome Med. 8, 72 (2016).

  11. 11.

    Hugon, P. et al. A comprehensive repertoire of prokaryotic species identified in human beings. Lancet Infect. Dis. 15, 1211–1219 (2015).

  12. 12.

    Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

  13. 13.

    Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).

  14. 14.

    Alneberg, J. et al. Genomes from uncultivated prokaryotes: a comparison of metagenome-assembled and single-amplified genomes. Microbiome 6, 173 (2018).

  15. 15.

    Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).

  16. 16.

    Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

  17. 17.

    Delmont, T. O. et al. Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes. Nat. Microbiol. 3, 804–813 (2018).

  18. 18.

    Stewart, R. D. et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat. Commun. 9, 870 (2018).

  19. 19.

    Ferretti, P. et al. Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145 (2018).

  20. 20.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

  21. 21.

    Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

  22. 22.

    Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

  23. 23.

    Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

  24. 24.

    Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

  25. 25.

    Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).

  26. 26.

    Varghese, N. J. et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43, 6761–6771 (2015).

  27. 27.

    Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).

  28. 28.

    Rajilić-Stojanović, M. & de Vos, W. M. The first 1000 cultured species of the human gastrointestinal microbiota. FEMS Microbiol. Rev. 38, 996–1047 (2014).

  29. 29.

    The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).

  30. 30.

    Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).

  31. 31.

    Wu, M. & Eisen, J. A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 9, R151 (2008).

  32. 32.

    Mende, D. R., Sunagawa, S., Zeller, G. & Bork, P. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).

  33. 33.

    Konstantinidis, K. T. & Tiedje, J. M. Towards a genome-based taxonomy for prokaryotes. J. Bacteriol. 187, 6258–6264 (2005).

  34. 34.

    Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

  35. 35.

    Blin, K. et al. antiSMASH 4.0—improvements in chemistry prediction and gene cluster boundary identification. Nucleic Acids Res. 45, W36–W41 (2017).

  36. 36.

    Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

  37. 37.

    Haft, D. H. et al. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 41, D387–D395 (2013).

  38. 38.

    Richardson, L. J. et al. Genome properties in 2019: a new companion database to InterPro for the inference of complete functional attributes. Nucleic Acids Res. 47, D564–D572 (2018).

  39. 39.

    Ashburner, M. et al.; The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

  40. 40.

    The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).

  41. 41.

    Mitchell, A. L. et al. EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies. Nucleic Acids Res. 46, D726–D735 (2018).

  42. 42.

    Kanehisa, M., Sato, Y. & Morishima, K. BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. J. Mol. Biol. 428, 726–731 (2016).

  43. 43.

    Crichton, R. R. Iron Metabolism : From Molecular Mechanisms to Clinical Consequences. (John Wiley, Hoboken, NJ, 2016).

  44. 44.

    Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).

  45. 45.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

  46. 46.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  47. 47.

    Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).

  48. 48.

    Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

  49. 49.

    Yuan, C., Lei, J., Cole, J. & Sun, Y. Reconstructing 16S rRNA genes in metagenomic data. Bioinformatics 31, i35–i43 (2015).

  50. 50.

    Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).

  51. 51.

    Markowitz, V. M. et al. IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res. 40, D115–D122 (2012).

  52. 52.

    Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).

  53. 53.

    Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

  54. 54.

    Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

  55. 55.

    Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

  56. 56.

    Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

  57. 57.

    Letunic, I. & Bork, P. Interactive tree of life (iTOL)v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–W245 (2016).

  58. 58.

    Revell, L. J. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol. 3, 217–223 (2012).

  59. 59.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

  60. 60.

    Qin, Q.-L. et al. A proposed genus boundary for the prokaryotes based on genomic insights. J. Bacteriol. 196, 2210–2215 (2014).

  61. 61.

    Wu, Y.-W. W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).

  62. 62.

    Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

  63. 63.

    Ben Gorman. mltools: Machine Learning Tools. R package v.0.3.5. (2018).

  64. 64.

    Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G. & Gloor, G. B. ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq. PLoS ONE 8, e67019 (2013).

  65. 65.

    Fernandes, A. D., Vu, M. T. H. Q., Edward, L.-M., Macklaim, J. M. & Gloor, G. B. A reproducible effect size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets. Draught at (2018).

  66. 66.

    Lê, S., Josse, J. & Husson, F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw. 25, 1–18 (2008).

  67. 67.

    Brown, C. T. & Irber, L. sourmash: a library for MinHash sketching of DNA. J. Open Source Softw. 1, 27 (2016).

  68. 68.

    R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2017).

Download references


We thank all the authors who generated the raw data used in this study. We also thank P. Glaser and A. Zhu for comments and suggestions. Funding for this work was from European Molecular Biology Laboratory (EMBL); European Commission within the Research Infrastructures Programme of Horizon 2020 (676559) (ELIXIR-EXCELERATE); Biotechnology and Biological Sciences Research Council (BB/N018354/1); Wellcome Trust (098051); Australian National Health and Medical Research Council (1091097 and 1141564 to S.C.F.); Victorian Government Operational Infrastructure Support Program; and National Sciences and Engineering Research Council (RGPIN-03878-2015).

Author information


  1. European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK

    • Alexandre Almeida
    • , Alex L. Mitchell
    • , Miguel Boland
    • , Aleksandra Tarkowska
    •  & Robert D. Finn
  2. Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK

    • Alexandre Almeida
    • , Samuel C. Forster
    •  & Trevor D. Lawley
  3. Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia

    • Samuel C. Forster
  4. Department of Molecular and Translational Sciences, Monash University, Clayton, Victoria, Australia

    • Samuel C. Forster
  5. Department of Biochemistry, University of Western Ontario, London, Ontario, Canada

    • Gregory B. Gloor


  1. Search for Alexandre Almeida in:

  2. Search for Alex L. Mitchell in:

  3. Search for Miguel Boland in:

  4. Search for Samuel C. Forster in:

  5. Search for Gregory B. Gloor in:

  6. Search for Aleksandra Tarkowska in:

  7. Search for Trevor D. Lawley in:

  8. Search for Robert D. Finn in:


A.A., A.L.M., S.C.F., T.D.L. and R.D.F. conceived the study. A.A. wrote the manuscript and performed assembly, binning and downstream bioinformatics analyses. M.B. developed the assembly pipeline. A.L.M., M.B. and R.D.F. performed assembly and binning. G.B.G. contributed to the statistical analyses. A.T. developed the mg-toolkit and contributed to the extraction of sample metadata. A.A., A.L.M., S.C.F., G.B.G., T.D.L. and R.D.F. revised the manuscript and contributed to the interpretation of the data. All authors read and approved the final manuscript.

Competing interests

S.C.F., T.D.L. and R.D.F. are either employees of, or consultants to, Microbiotica Pty Ltd.

Corresponding authors

Correspondence to Alexandre Almeida or Robert D. Finn.

Extended data figures and tables

  1. Extended Data Fig. 1 Metadata of the human gut datasets.

    Percentage of the 13,133 metagenomic datasets according to location, health state and age group of the individual sampled, as depicted in the figure key.

  2. Extended Data Fig. 2 CheckM quality assessment of bins.

    a, Quality metrics estimated by CheckM for the 242,836 bins generated by MetaBAT. b, Number of bins recovered according to the level of genome completeness and contamination. QS = completeness – (5 × contamination). Source Data

  3. Extended Data Fig. 3 Technical reproducibility of MAGs.

    a, MAGs resulting from the MetaWRAP pipeline (left, n = 9,552) and from a modified co-assembly approach (right, n = 4,404) compared to the original MAGs generated with SPAdes and MetaBAT for 1,000 random datasets. A good match was defined as ≥95% ANI over ≥60% of alignment fraction, whereas an excellent match indicates ≥98% ANI over ≥80% alignment. b, Proportion of MAGs generated with each pipeline (MetaWRAP and co-assembly) coloured by their level of match to the original set.

  4. Extended Data Fig. 4 Phylogenetic diversity of the human-specific isolate genomes.

    Phylogenetic tree of the 2,468 HR genomes, labelled according to class, with the bar graphs in the outer layer depicting the log-transformed number of near-complete MAGs matching that corresponding genome.

  5. Extended Data Fig. 5 Analysis of Mash similarity clusters.

    Pearson correlation between the log-transformed number of MAGs and the corresponding number of distinct samples (a) or studies (b) per Mash cluster. Data points represent each of the 702 similarity groups (defined with a Mash distance <0.2). The coefficient of determination (R2) is depicted in each graph. Source Data

  6. Extended Data Fig. 6 Quality metrics of the metagenomic species.

    a, Distribution of completeness (min: 55.5; Q1: 80.5; median: 92.3; Q3: 97.1; max: 100) and contamination levels (min: 0; Q1: 0.1; median: 0.8; Q3: 1.7; max: 4.1) estimated by CheckM for the 2,068 metagenomic species (MGS). b, Number of tRNAs coding for the 20 standard amino acids detected across the MGS genomes. c, MCC calculated for all the 2,068 MGS, based on the Mash clustering structure and an average amino acid identity threshold of 97%.

  7. Extended Data Fig. 7 Defining genome presence, and prevalence distribution.

    a, b, Depth (a) and variation (b) penalty scores plotted against the level of genome coverage of the 1,952 UMGS across all 13,133 metagenomic samples. The depth penalty score was calculated by multiplying the missing coverage (100 − genome coverage) by the log-transformed mean read depth. The variation penalty score was based on the missing coverage multiplied by the depth coefficient of variation (standard deviation of read depth divided by the mean). Dashed red lines correspond to the 99th percentile, set as the upper threshold used to define genome presence. c, Number of UMGS detected in the corresponding number of metagenomic samples. The distribution of UMGS found in up to 100 samples is illustrated as an inset. The vertical dashed line represents the median value of all the data.

  8. Extended Data Fig. 8 Biosynthetic gene clusters found in the human gut species.

    a, Number of BGCs found in the UMGS and the HGR genomes, subdivided by functional category. Only the 25 most abundant categories are depicted. PKS, polyketide synthases. b, Fraction of all BGCs that did not match the MIBiG database.

  9. Extended Data Fig. 9 Functional capacity of cultured and uncultured species.

    a, PCA based on GPs of the 553 HGR genomes and the 1,952 UMGS for the five most prevalent phyla (Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria and Tenericutes). b, Number of genes found to be enriched with an absolute effect size >0.2 in either the UMGS or HGR genomes across the analyses of each of the five major phyla, grouped by their corresponding KEGG functional category.

  10. Extended Data Table 1 Genome Properties overrepresented in the UMGS genomes

Supplementary information

  1. Supplementary Discussion

    This file contains the Supplementary Discussion with further technical details and results concerning the reproducibility of the assembly/binning pipeline; detection of non-prokaryotic bins; genome de-replication and quality assessment, and comparison with publicly available uncultured genomes

  2. Reporting Summary

  3. Supplementary Table 1

    File containing Supplementary Table 1 with information on the human gut datasets analysed

  4. Supplementary Table 2

    File with Supplementary Table 2 indicating the bins predicted to belong to eukaryotic organisms

  5. Supplementary Table 3

    File with Supplementary Table 3 with information on the near-complete bacterial genomes generated in this work

  6. Supplementary Table 4

    Detailed genome and quality statistics of the 1,952 unclassified metagenomic species (UMGS) identified in this work

  7. Supplementary Table 5

    Number and type of biosynthetic gene clusters detected with antiSMASH in the unclassified metagenomic species (UMGS)

Source data

About this article

Publication history






By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.