Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Towards the biogeography of prokaryotic genes

Abstract

Microbial genes encode the majority of the functional repertoire of life on earth. However, despite increasing efforts in metagenomic sequencing of various habitats1,2,3, little is known about the distribution of genes across the global biosphere, with implications for human and planetary health. Here we constructed a non-redundant gene catalogue of 303 million species-level genes (clustered at 95% nucleotide identity) from 13,174 publicly available metagenomes across 14 major habitats and use it to show that most genes are specific to a single habitat. The small fraction of genes found in multiple habitats is enriched in antibiotic-resistance genes and markers for mobile genetic elements. By further clustering these species-level genes into 32 million protein families, we observed that a small fraction of these families contain the majority of the genes (0.6% of families account for 50% of the genes). The majority of species-level genes and protein families are rare. Furthermore, species-level genes, and in particular the rare ones, show low rates of positive (adaptive) selection, supporting a model in which most genetic variability observed within each protein family is neutral or nearly neutral.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Global Microbial Gene Catalogue, version 1.
Fig. 2: The number of conspecific genes (gene pool per species) and the functional redundancy in each metagenome show significantly less variation within than between habitats.
Fig. 3: Most genes are rare.
Fig. 4: Rare unigenes are under lower selection pressure.

Similar content being viewed by others

Data availability

All data analysed during the current study are publicly available. Supplementary Table 1 contains the accession numbers for all the metagenomes used. GMGCv1 is available for download at https://gmgc.embl.de. The full catalogue is available for download as are sub-catalogues specialized to individual habitats and the subset derived only from sequenced genomes (which can be further subset to obtain the pangenome of a species of interest). Both the full catalogue and a version containing only complete ORFs are available as they represent different tradeoffs: the complete catalogue achieves higher coverage, while the version with only complete ORFs may be more appropriate for analyses that require the whole gene. Similarly, protein families are available at different amino acid identity thresholds (see ‘Protein family cluster calculation’). In addition to being available for download, the catalogue can be queried with an amino acid sequence. We developed and use a novel k-mer based algorithm (see ‘k-mer based homology search’) to enable fast queries over the complete 303 million protein database and allow interactive use.

Code availability

The source code implementing the analyses in this manuscript is available on Github (https://github.com/luispedro/Coelho2021_GMGCv1) and is archived at Zenodo (https://doi.org/10.5281/zenodo.4769556).

References

  1. Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).

    Article  PubMed  Google Scholar 

  2. Zou, Y. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179–185 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Mohammad, B. F. et al. Structure and function of the global topsoil microbiome. Nature 560 233–237 (2018).

  4. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Xiao, L. et al. A catalog of the mouse gut metagenome. Nat. Biotechnol. 33, 1103–1108 (2015).

    Article  CAS  PubMed  Google Scholar 

  6. Coelho, L. P. et al. Similarity of the dog and human gut microbiomes in gene content and response to diet. Microbiome 6, 72 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662.e20 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Partridge, S. R., Kwong, S. M., Firth, N. & Jensen, S. O. Mobile genetic elements associated with antimicrobial resistance. Clin. Microbiol. Rev. 31, (2018).

  9. Mende, D. R. et al. ProGenomes2: An improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621–D625 (2020).

    CAS  PubMed  Google Scholar 

  10. Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  11. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  PubMed  Google Scholar 

  12. Daniel H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nuc. Acids Res. 46, D851–D860 (2018).

  13. Mering, C. von et al. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315, 1126–1130 (2007).

    Article  ADS  Google Scholar 

  14. Richardson, E. J. et al. Gene exchange drives the ecological success of a multi-host bacterial pathogen. Nat. Ecol. Evol. 2, 1468–1478 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

    Article  CAS  PubMed  Google Scholar 

  16. Mende, D. R., Sunagawa, S., Zeller, G. & Bork, P. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).

    Article  CAS  PubMed  Google Scholar 

  17. Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Louca, S. et al. Function and functional redundancy in microbial systems. Nat. Ecol. Evol. 2, 936–943 (2018).

    Article  PubMed  Google Scholar 

  19. Maistrenko, O. M. et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J. 14, 1247–1259 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Baumdicker, F., Hess, W. R. & Pfaffelhuber, P. The diversity of a distributed genome in bacterial populations. Ann. Appl. Probab. 20, 1567–1606 (2010).

    Article  MathSciNet  MATH  Google Scholar 

  21. Sela, I., Wolf, Y. I. & Koonin, E. V. Theory of prokaryotic genome evolution. Proc. Natl Acad. Sci. USA 113, 11399–11407 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998).

    Article  CAS  PubMed  Google Scholar 

  23. Nei, M., Suzuki, Y. & Nozawa, M. The neutral theory of molecular evolution in the genomic era. Annu. Rev. Genomics Hum. Genet. 11, 265–289 (2010).

    Article  CAS  PubMed  Google Scholar 

  24. Iranzo, J., Cuesta, J. A., Manrubia, S., Katsnelson, M. I. & Koonin, E. V. Disentangling the effects of selection and loss bias on gene dynamics. Proc. Natl Acad. Sci. USA 114, E5616–E5624 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Wolf, Y. I., Makarova, K. S., Lobkovsky, A. E. & Koonin, E. V. Two fundamentally different classes of microbial genes. Nat. Microbiol. 2, 16208 (2016).

    Article  CAS  PubMed  Google Scholar 

  26. Rasko, D. A. et al. The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J. Bacteriol. 190, 6881–6893 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Koskella, B., Hall, L. J. & Metcalf, C. J. E. The microbiome beyond the horizon of ecological and evolutionary theory. Nat. Ecol. Evol. 1, 1606–1615 (2017).

    Article  PubMed  Google Scholar 

  28. Liu, R. et al. Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat. Med. 23, 859–868 (2017).

    Article  CAS  PubMed  Google Scholar 

  29. Metcalf, J. L. et al. Microbial community assembly and metabolic function during mammalian corpse decomposition. Science 351, 158–162 (2015).

    Article  ADS  PubMed  Google Scholar 

  30. Vincent, C. et al. Bloom and bust: intestinal microbiota dynamics in response to hospital exposures and Clostridium difficile colonization or infection. Microbiome 4, 12 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Zeller, G. et al. Potential of fecal microbiota for early‐stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Gibson, M. K. et al. Developmental dynamics of the preterm infant gut microbiota and antibiotic resistome. Nat. Microbiol. 1, 16024 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Zhang, X. et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat. Med. 21, 895–905 (2015).

    Article  CAS  PubMed  Google Scholar 

  34. Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435–439 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  35. Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Turnbaugh, P. J. et al. The human microbiome project. Nature 449, 804–810 (2007).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Hannigan, G. D. et al. The human skin double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome. MBio 6, e01578-15 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Taft, D. H. et al. Intestinal microbiota of preterm infants differ over time and between hospitals. Microbiome 2, 36 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Zeevi, D. et al. Personalized nutrition by prediction of glycemic responses. Cell 163, 1079–1094 (2015).

    Article  CAS  PubMed  Google Scholar 

  40. Wilhelm, R. C. et al. Biogeography and organic matter removal shape long-term effects of timber harvesting on forest soil microbial communities. ISME J. 11, 2552–2568 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Xie, H. et al. Shotgun metagenomics of 250 adult twins reveals genetic and environmental impacts on the gut microbiome. Cell Syst. 3, 572–584.e3 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. The MetaSUB International Consortium. The metagenomics and metadesign of the subways and urban biomes (metasub) international consortium inaugural meeting report. Microbiome 4, 24 (2016).

    Article  Google Scholar 

  43. Chatelier, E. L. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).

    Article  PubMed  Google Scholar 

  44. Li, J. et al. Gut microbiota dysbiosis contributes to the development of hypertension. Microbiome 5, (2017).

  45. Pehrsson, E. C. et al. Interconnected microbiomes and resistomes in low-income human habitats. Nature 533, 212–216 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  46. Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).

    Article  CAS  PubMed  Google Scholar 

  47. Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).

    Article  ADS  CAS  PubMed  Google Scholar 

  48. Gu, Y. et al. Analyses of gut microbiota and plasma bile acids enable stratification of patients for antidiabetic treatment. Nat. Commun. 8, 1785 (2017).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  49. Karlsson, F. H. et al. Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).

    Article  ADS  CAS  PubMed  Google Scholar 

  50. Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

    Article  CAS  PubMed  Google Scholar 

  51. Youngster, I. et al. Fecal microbiota transplant for relapsing clostridium difficile infection using a frozen inoculum from unrelated donors: a randomized, open-label, controlled pilot study. Clin. Infect. Dis. 58, 1515–1522 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Guittar, J., Shade, A. & Litchman, E. Trait-based community assembly and succession of the infant gut microbiome. Nat. Commun. 10, 512 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  53. Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Chng, K. R. et al. Whole metagenome profiling reveals skin microbiome-dependent susceptibility to atopic dermatitis flare. Nat Microbiol 1, 16106 (2016).

    Article  CAS  PubMed  Google Scholar 

  55. Chu, D. M. et al. Maturation of the infant microbiome community structure and function across multiple body sites and in relation to mode of delivery. Nat. Med. 23, 314–326 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Van Rossum, T. et al. Spatiotemporal dynamics of river viruses, bacteria and microeukaryotes. Preprint at https://doi.org/10.1101/259861 (2018).

  57. Feng, Q. et al. Integrated metabolomics and metagenomics analysis of plasma and urine identified microbial metabolites associated with coronary heart disease. Sci. Rep. 6, 22525 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  58. Oh, J., Byrd, A. L., Park, M., Kong, H. H. & Segre, J. A. Temporal stability of the human skin microbiome. Cell 165, 854–866 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Xiao, L. et al. A reference gene catalogue of the pig gut microbiome. Nat. Microbiol. 1, 16161 (2016).

    Article  CAS  PubMed  Google Scholar 

  60. R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2014).

  61. Coelho, L. P. et al. NG-meta-profiler: Fast processing of metagenomes using ngless, a domain-specific language. Microbiome 7, 84 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct De Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

    Article  CAS  PubMed  Google Scholar 

  63. Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–W454 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Coelho, L. P. Jug: Software for parallel reproducible computation in Python. J. Open Res. Softw. 5, 30 (2017).

    Article  Google Scholar 

  65. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using diamond. Nat. Methods 12, 59–60 (2015).

    Article  CAS  PubMed  Google Scholar 

  66. Eberhardt, R. Y. et al. AntiFam: A tool to help identify spurious ORFs in protein annotation. Database 2012, bas003 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Kang, D. et al. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  69. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Zhou, W., Gay, N. & Oh, J. ReprDB and panDB: minimalist databases with maximal microbial representation. Microbiome 6, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Hingamp, P. et al. Exploring nucleo-cytoplasmic large DNA viruses in tara oceans microbial metagenomes. ISME J. 7, 1678–1695 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

    Article  CAS  PubMed  Google Scholar 

  74. Huerta-Cepas, J. et al. eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).

    Article  CAS  PubMed  Google Scholar 

  75. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    Article  ADS  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  76. Smyshlyaev, G., Barabas, O. & Bateman, A. Sequence analysis allows functional annotation of tyrosine recombinases in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Jia, B. et al. CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).

    Article  CAS  PubMed  Google Scholar 

  78. Gibson, M. K., Forsberg, K. J. & Dantas, G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. ISME J. 9, 207–216 (2015).

    Article  CAS  PubMed  Google Scholar 

  79. Li, T., Fan, K., Wang, J. & Wang, W. Reduction of protein sequence complexity by residue grouping. Protein Eng. 16, 323–330 (2003).

    Article  CAS  PubMed  Google Scholar 

  80. Zhao, M., Lee, W.-P., Garrison, E. P. & Marth, G. T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS ONE 8, e82138 (2013).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  81. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).

    Article  Google Scholar 

  82. Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  83. Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Kumar, R., Acharya, V., Singh, D. & Kumar, S. Strategies for high-altitude adaptation revealed from high-quality draft genome of non-violacein producing Janthinobacterium lividum ERGS5:01. Stand. Genomic Sci. 13, 11 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Patijanasoontorn, B. et al. Hospital acquired Janthinobacterium lividum septicemia in srinagarind hospital. J. Med. Assoc. Thai. 75 Suppl 2, 6–10 (1992).

    PubMed  Google Scholar 

  86. Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  87. Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  89. Collins, R. E. & Higgs, P. G. Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol. Biol. Evol. 29, 3413–3425 (2012).

    Article  CAS  PubMed  Google Scholar 

  90. Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol. Syst. Biol. 7, 539 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  91. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  92. Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609–12 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Murrell, B. et al. FUBAR: a fast, unconstrained Bayesian approximation for inferring selection. Mol. Biol. Evol. 30, 1196–1205 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Smith, M. D. et al. Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection. Mol. Biol. Evol. 32, 1342–1353 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Washietl, S. et al. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578–594 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

Funding was provided by the European Union’s Horizon 2020 Research and Innovation Programme (grant 686070: DD-DeCaF to P.B.) and Marie Skłodowska-Curie Actions (grant 713673 to A.R.d.R.), the European Research Council (ERC) MicrobioS (ERC-AdG-669830 to P.B.), JTC project jumpAR (01KI1706 to P.B.), a BMBF Grant (grant 031L0181A: LAMarCK to P.B.), the European Molecular Biology Laboratory (P.B.), the ETH and Helmut Horten Foundation (S.S.), the National Key R&D Program of China (grant 2020YFA0712403 to X.-M.Z.), National Natural Science Foundation of China (grant 61932008 to X.-M.Z.; grant 61772368 to X.-M.Z.; grant 31950410544 to L.P.C.), the Shanghai Municipal Science and Technology Major Project (grant 2018SHZDZX01 to X.-M.Z. and L.P.C.) and Zhangjiang Lab (X.-M.Z. and L.P.C.), the International Development Research Centre (grant 109304, EMBARK under the JPI AMR framework; to L.P.C.), la Caixa Foundation (grant 100010434, fellowship code LCF/BQ/DI18/11660009 to A.R.d.R.), the Severo Ochoa Program for Centres of Excellence in R&D from the Agencia Estatal de Investigación of Spain (grant SEV-2016-0672 (2017–2021) to C.P.C.), the Ministerio de Ciencia, Innovación y Universidades (grant PGC2018-098073-A-I00 MCIU/AEI/FEDER to J.H.-C. and J.G.-L.), the Innovation Fund Denmark (grant 4203-00005B, PNM), the Biotechnology and Biological Sciences research Council (BBSrC) Institute Strategic Programme Gut Microbes and Health BB/r012490/1 and its constituent project BBS/e/F/000Pr10355 (F.H.). R.A. is a member of the Collaboration for joint PhD degree between EMBL and Heidelberg University, Faculty of Biosciences. The authors thank the Bork group for helpful discussion, in particular A. Głazek for discussions of algorithm design, J. C. Somody (EMBL) for help with figure design, and A. Fullam (EMBL) for computational assistance in processing the MAGs.

Author information

Authors and Affiliations

Authors

Contributions

The study was conceived and supervised by P.B. and designed by L.P.C., S.S., J.H.-C. and P.B. L.P.C., R.A., A.R.d.R., P.N.M., T.S.S., A.O., F.H., T.V.R., S.K.F., S.K., O.M.M., P.F. and J.H.-C. analysed data. L.P.C., T.S.S., F.H., T.V.R., S.K.F., P.F., J.H.-C. and P.B. drafted the manuscript. L.P.C., R.A., A.R.d.R., C.P.C. and D.R.M. built the unigene, protein clusters and protein family catalogues. L.P.C., R.A., T.S.S., D.R.M., I.L., F.H., S.K.F., S.K. and J.H.-C. annotated the catalogue. A.R.d.R., C.P.C., J.G.-L., O.M.M. and J.H.-C. performed the selection pressure analyses. P.N.M. and H.B.N. built the MGSs. L.P.C., R.A., I.L., S.P., L.J., X.-M.Z., T.V.R. and J.H.-C. designed and implemented the web resource, including the search algorithms and the associated GMGC-mapper tool. L.P.C., T.S.S., F.H. and O.M.M. annotated metagenomes. T.S.S. and A.O. built the MAGs. All authors contributed to the review of the manuscript before submission for publication and approved the final version.

Corresponding authors

Correspondence to Luis Pedro Coelho, Jaime Huerta-Cepas or Peer Bork.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Gene accumulation curves. Legend.

(a) For most (but not all) habitats, unigenes with high prevalence (≥ 5%) have been well-captured, while rare unigenes continue to be found in each new sample. (b-d) New unigenes continue to be found in each sample. Each grey line represents a random permutation of the samples, while the solid black line shows the mean over these random permutations. The dotted red line is least-squares fit of Heap’s Law (N = k · sample^alpha). In all cases, the parameter fit indicates that the number of has not reached saturation. (e) The number of assembled/detected genes per sample grows with sequencing depth without a plateau being reached. (f) Similarly, the number of detected ORFs per insert grows with sequencing depth.

Extended Data Fig. 2 Identity thresholds and their relationship to taxonomy and function in the GMGCv1.

Legend: (a) A 95% nucleotide identity threshold is a proxy for species. Shown is nucleotide identity of closest gene homolog within the same species or within the same genus (excluding within-species comparisons). The threshold used in this work (95%) is marked with a dashed red line. (b) Within well-conserved, universal, 40 single-copy orthologues (see Methods), the average pairwise amino acid identity is 49%, albeit with a wide range (27-75%) when considering within-orthologue averages. In dashed red, the thresholds used for building protein families are highlighted. Boxplots display quartiles and ranges (see Methods). (c) Proportion of genes annotated at each taxonomic level.

Extended Data Fig. 3 Short reads map to the GMGCv1 at higher rates (compared to a reference database of reference genomes).

Legend: (a) Mapping rates for short reads from metagenomes mapped against the GMGCv1 or the reference genomes in proGenomes2. (b) Fraction of short reads from human gut metagenomes mapping to a collection of sequenced genomes and the GMGCv1, per country, (c) Same data as (b), aggregated by the World Bank’s classification of countries into income groups. In all panels, boxplots show quartiles (including median) and range (except for outliers, see Methods). Blue boxes show mapping rates to proGenomes2, while orange boxes show mapping rates to GMGCv1.

Extended Data Fig. 4 MAGs only capture a small fraction of all genes in a sample.

Legend: Fraction of undetected genes when mapping to only the genes captured by metagenome-assembled genomes (MAGs) across the habitats compared to mapping to the full GMGCv1.

Extended Data Fig. 5 Species and protein cluster sharing between habitats is similar to unigene sharing, but sharing of protein families is more extensive.

Legend: (a) The sharing of metagenomic species between habitats mimics unigene sharing. Width of each ribbons represents the number of MGSs shared between the habitats (the largest number shared is between the human and the pig gut, which share 166 MGSs out of 1,908 MGSs in the human gut and 898 in pig gut, respectively). (b) Species-level unigene sharing between habitats by fraction of the number of unigenes from each habitat (cf. Fig. 1b, which uses abundance weighting). (c) Sharing of protein clusters (90% amino acid identity clusters) between habitats, abundance-weighted. (d) Sharing of protein families between habitats, abundance-weighted. When considering coarser clusterings of sequences, gene sharing between habitats increases, yet we still observed higher rates of sharing between similar habitats and significant fractions of habitat-specific families (e.g., in the marine environment, 31.3% of the genes, by abundance, are in marine-specific protein families).

Extended Data Fig. 6 Antibiotic resistance and mobile genes are more likely to be multi-habitat genes, while most species are found in a single habitat.

Legend: (a) Fraction of unigenes within each habitat which are multi-habitat genes (for all unigenes, or when considering only mobile elements or antibiotic resistance genes). (b) A total of 7,443 MGSs were built, across all the habitats as species proxies to reliably assess their habitats. Each circle shows the number of metagenomic species for each habitat, x-axis represents the number of genes in the catalogue specific to each habitat, the y-axis represents the number of samples. Note that differing sampling depth and habitat-specific biodiversity impact those numbers.

Extended Data Fig. 7 Determinants of functional community structure.

Legend: (a) principal coordinate analysis of all samples by protein family profile and the correlations with taxonomic and protein family richness (after rarefying to 1 million inserts to remove effects of sample depth). (b) Hierarchical clustering of the habitats using high-level functional profiles based.

Extended Data Fig. 8 Marine and soil richness patterns are a mixture of subpatterns.

Legend: Conspecific genes per species in marine (a) and (b) soil sub-habitats. The differences in the marine environment are particularly large when comparing the samples in the photic zones (the shallower, light-accessible, surface and deep-chlorophyll maximum samples) to the non-photic mesopelagic samples (deeper, beyond the reach of sunlight). The differences in the soil environment follow differences in acidity (with Podzol, Dystric Brunisol and Ultic soils being acidic, while Luvisols are usually neutral or alkaline) and differences in moisture (with Xeralfs being dry in the summer, while Glossudalfs are moist year round).

Extended Data Fig. 9 Most genes are detected only infrequently and rare genes are (on average) present at a lower abundance in metagenomes.

Legend: (a) Shown are the percentage of genes detected in at most 1,...,50 metagenomes (out of a total of 13,174). (b, c) Histograms of gene prevalence are roughly linear on a log-log scale, as predicted from neutral or nearly-neutral evolution models. Shown are histograms for 90% amino acid identity protein clusters (b) and 20% amino acid identity protein families (c), which behave similar to species-level unigenes (see Fig. 3). (d) Shown is the percentage of genes in each sample that is composed of rare genes (Count) and the total abundance represented by these (Abundance). Except for wastewater (likely due to under-sampling), rare genes represent a lower fraction of the abundance than of detection. Boxplots show quartiles (including median drawn as a line) and whiskers show the range of the data excluding outliers, which are shown as extra elements (see Methods).

Extended Data Fig. 10 More abundant and larger protein families are under more intense selection.

Legend: (a) dN/dS within each protein family, with protein families split into 5 abundance quintiles, showing a downward trend with abundance (higher negative selection). (b) dN/dS within each gene size category, similarly showing a downward trend with size. Categories are defined by increasing size, with each bin representing the same number of unigenes. Boxplots show quartiles and ranges (see Methods).

Supplementary information

Reporting Summary

Supplementary Table 1

List of samples included in the global gene catalogue including their habitat annotation. This table includes the accession ID of the metagenome, the GC fraction, and the size of the input (measured in total basepairs and number of inserts) both prior to (Raw) and post quality filtering (HQ, for high-quality). It also includes the number of ORFs predicted.

Supplementary Table 2

Mapping of our human-readable habitat terms to predefined ontology terms. List of habitat annotations used and their correspondence to EnvO/Uberon terms and NCBI taxonomic ID of the host (for host-associated habitat).

Supplementary Table 3

Statistics of (redundant) ORF predictions. The tables summarize the assembly statistics per habitat (including number of contigs, average contig size, N50) and the ORF stats per habitat, both complete ORFs and fragmentary ORFs.

Supplementary Table 4

Robustness checks for protein family definition and inclusion or exclusion of fragmentary ORFs. The major statements in the manuscript were tested using alternative protein family thresholds as well as including or excluding fragmentary ORFs to ensure robustness.

Supplementary Table 5

List of metagenome-assembled genomes and their statistics. All the assembled bins are included, including those marked as low-quality (following standard criteria, see Methods).

Supplementary Table 6

Pearson correlations between observed gene prevalence and theoretical prediction assuming neutral evolution.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coelho, L.P., Alves, R., del Río, Á.R. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022). https://doi.org/10.1038/s41586-021-04233-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-021-04233-4

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing