Towards the biogeography of prokaryotic genes

Coelho, Luis Pedro; Alves, Renato; del Río, Álvaro Rodríguez; Myers, Pernille Neve; Cantalapiedra, Carlos P.; Giner-Lamia, Joaquín; Schmidt, Thomas Sebastian; Mende, Daniel R.; Orakov, Askarbek; Letunic, Ivica; Hildebrand, Falk; Van Rossum, Thea; Forslund, Sofia K.; Khedkar, Supriya; Maistrenko, Oleksandr M.; Pan, Shaojun; Jia, Longhao; Ferretti, Pamela; Sunagawa, Shinichi; Zhao, Xing-Ming; Nielsen, Henrik Bjørn; Huerta-Cepas, Jaime; Bork, Peer

doi:10.1038/s41586-021-04233-4

Article
Published: 15 December 2021

Towards the biogeography of prokaryotic genes

Nature volume 601, pages 252–256 (2022)Cite this article

31k Accesses
58 Citations
338 Altmetric
Metrics details

Subjects

Abstract

Microbial genes encode the majority of the functional repertoire of life on earth. However, despite increasing efforts in metagenomic sequencing of various habitats^1,2,3, little is known about the distribution of genes across the global biosphere, with implications for human and planetary health. Here we constructed a non-redundant gene catalogue of 303 million species-level genes (clustered at 95% nucleotide identity) from 13,174 publicly available metagenomes across 14 major habitats and use it to show that most genes are specific to a single habitat. The small fraction of genes found in multiple habitats is enriched in antibiotic-resistance genes and markers for mobile genetic elements. By further clustering these species-level genes into 32 million protein families, we observed that a small fraction of these families contain the majority of the genes (0.6% of families account for 50% of the genes). The majority of species-level genes and protein families are rare. Furthermore, species-level genes, and in particular the rare ones, show low rates of positive (adaptive) selection, supporting a model in which most genetic variability observed within each protein family is neutral or nearly neutral.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Global Microbial Gene Catalogue, version 1.**

**Fig. 2: The number of conspecific genes (gene pool per species) and the functional redundancy in each metagenome show significantly less variation within than between habitats.**

**Fig. 4: Rare unigenes are under lower selection pressure.**

Unraveling the functional dark matter through global metagenomics

Article Open access 11 October 2023

Trait biases in microbial reference genomes

Article Open access 09 February 2023

A genomic catalog of Earth’s microbiomes

Article Open access 09 November 2020

Data availability

All data analysed during the current study are publicly available. Supplementary Table 1 contains the accession numbers for all the metagenomes used. GMGCv1 is available for download at https://gmgc.embl.de. The full catalogue is available for download as are sub-catalogues specialized to individual habitats and the subset derived only from sequenced genomes (which can be further subset to obtain the pangenome of a species of interest). Both the full catalogue and a version containing only complete ORFs are available as they represent different tradeoffs: the complete catalogue achieves higher coverage, while the version with only complete ORFs may be more appropriate for analyses that require the whole gene. Similarly, protein families are available at different amino acid identity thresholds (see ‘Protein family cluster calculation’). In addition to being available for download, the catalogue can be queried with an amino acid sequence. We developed and use a novel k-mer based algorithm (see ‘k-mer based homology search’) to enable fast queries over the complete 303 million protein database and allow interactive use.

Code availability

The source code implementing the analyses in this manuscript is available on Github (https://github.com/luispedro/Coelho2021_GMGCv1) and is archived at Zenodo (https://doi.org/10.5281/zenodo.4769556).

References

Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
Article PubMed Google Scholar
Zou, Y. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179–185 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mohammad, B. F. et al. Structure and function of the global topsoil microbiome. Nature 560 233–237 (2018).
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
Article CAS PubMed PubMed Central Google Scholar
Xiao, L. et al. A catalog of the mouse gut metagenome. Nat. Biotechnol. 33, 1103–1108 (2015).
Article CAS PubMed Google Scholar
Coelho, L. P. et al. Similarity of the dog and human gut microbiomes in gene content and response to diet. Microbiome 6, 72 (2018).
Article PubMed PubMed Central Google Scholar
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662.e20 (2019).
Article CAS PubMed PubMed Central Google Scholar
Partridge, S. R., Kwong, S. M., Firth, N. & Jensen, S. O. Mobile genetic elements associated with antimicrobial resistance. Clin. Microbiol. Rev. 31, (2018).
Mende, D. R. et al. ProGenomes2: An improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621–D625 (2020).
CAS PubMed Google Scholar
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Article ADS PubMed PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Daniel H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nuc. Acids Res. 46, D851–D860 (2018).
Mering, C. von et al. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315, 1126–1130 (2007).
Article ADS Google Scholar
Richardson, E. J. et al. Gene exchange drives the ecological success of a multi-host bacterial pathogen. Nat. Ecol. Evol. 2, 1468–1478 (2018).
Article PubMed PubMed Central Google Scholar
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS PubMed Google Scholar
Mende, D. R., Sunagawa, S., Zeller, G. & Bork, P. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Article CAS PubMed PubMed Central Google Scholar
Louca, S. et al. Function and functional redundancy in microbial systems. Nat. Ecol. Evol. 2, 936–943 (2018).
Article PubMed Google Scholar
Maistrenko, O. M. et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J. 14, 1247–1259 (2020).
Article PubMed PubMed Central Google Scholar
Baumdicker, F., Hess, W. R. & Pfaffelhuber, P. The diversity of a distributed genome in bacterial populations. Ann. Appl. Probab. 20, 1567–1606 (2010).
Article MathSciNet MATH Google Scholar
Sela, I., Wolf, Y. I. & Koonin, E. V. Theory of prokaryotic genome evolution. Proc. Natl Acad. Sci. USA 113, 11399–11407 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998).
Article CAS PubMed Google Scholar
Nei, M., Suzuki, Y. & Nozawa, M. The neutral theory of molecular evolution in the genomic era. Annu. Rev. Genomics Hum. Genet. 11, 265–289 (2010).
Article CAS PubMed Google Scholar
Iranzo, J., Cuesta, J. A., Manrubia, S., Katsnelson, M. I. & Koonin, E. V. Disentangling the effects of selection and loss bias on gene dynamics. Proc. Natl Acad. Sci. USA 114, E5616–E5624 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wolf, Y. I., Makarova, K. S., Lobkovsky, A. E. & Koonin, E. V. Two fundamentally different classes of microbial genes. Nat. Microbiol. 2, 16208 (2016).
Article CAS PubMed Google Scholar
Rasko, D. A. et al. The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J. Bacteriol. 190, 6881–6893 (2008).
Article CAS PubMed PubMed Central Google Scholar
Koskella, B., Hall, L. J. & Metcalf, C. J. E. The microbiome beyond the horizon of ecological and evolutionary theory. Nat. Ecol. Evol. 1, 1606–1615 (2017).
Article PubMed Google Scholar
Liu, R. et al. Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat. Med. 23, 859–868 (2017).
Article CAS PubMed Google Scholar
Metcalf, J. L. et al. Microbial community assembly and metabolic function during mammalian corpse decomposition. Science 351, 158–162 (2015).
Article ADS PubMed Google Scholar
Vincent, C. et al. Bloom and bust: intestinal microbiota dynamics in response to hospital exposures and Clostridium difficile colonization or infection. Microbiome 4, 12 (2016).
Article PubMed PubMed Central Google Scholar
Zeller, G. et al. Potential of fecal microbiota for early‐stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Article PubMed PubMed Central Google Scholar
Gibson, M. K. et al. Developmental dynamics of the preterm infant gut microbiota and antibiotic resistome. Nat. Microbiol. 1, 16024 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhang, X. et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat. Med. 21, 895–905 (2015).
Article CAS PubMed Google Scholar
Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435–439 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).
Article CAS PubMed PubMed Central Google Scholar
Turnbaugh, P. J. et al. The human microbiome project. Nature 449, 804–810 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Hannigan, G. D. et al. The human skin double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome. MBio 6, e01578-15 (2015).
Article PubMed PubMed Central Google Scholar
Taft, D. H. et al. Intestinal microbiota of preterm infants differ over time and between hospitals. Microbiome 2, 36 (2014).
Article PubMed PubMed Central Google Scholar
Zeevi, D. et al. Personalized nutrition by prediction of glycemic responses. Cell 163, 1079–1094 (2015).
Article CAS PubMed Google Scholar
Wilhelm, R. C. et al. Biogeography and organic matter removal shape long-term effects of timber harvesting on forest soil microbial communities. ISME J. 11, 2552–2568 (2017).
Article PubMed PubMed Central Google Scholar
Xie, H. et al. Shotgun metagenomics of 250 adult twins reveals genetic and environmental impacts on the gut microbiome. Cell Syst. 3, 572–584.e3 (2016).
Article CAS PubMed PubMed Central Google Scholar
The MetaSUB International Consortium. The metagenomics and metadesign of the subways and urban biomes (metasub) international consortium inaugural meeting report. Microbiome 4, 24 (2016).
Article Google Scholar
Chatelier, E. L. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).
Article PubMed Google Scholar
Li, J. et al. Gut microbiota dysbiosis contributes to the development of hypertension. Microbiome 5, (2017).
Pehrsson, E. C. et al. Interconnected microbiomes and resistomes in low-income human habitats. Nature 533, 212–216 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
Article CAS PubMed Google Scholar
Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).
Article ADS CAS PubMed Google Scholar
Gu, Y. et al. Analyses of gut microbiota and plasma bile acids enable stratification of patients for antidiabetic treatment. Nat. Commun. 8, 1785 (2017).
Article ADS PubMed PubMed Central Google Scholar
Karlsson, F. H. et al. Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).
Article ADS CAS PubMed Google Scholar
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).
Article CAS PubMed Google Scholar
Youngster, I. et al. Fecal microbiota transplant for relapsing clostridium difficile infection using a frozen inoculum from unrelated donors: a randomized, open-label, controlled pilot study. Clin. Infect. Dis. 58, 1515–1522 (2014).
Article PubMed PubMed Central Google Scholar
Guittar, J., Shade, A. & Litchman, E. Trait-based community assembly and succession of the infant gut microbiome. Nat. Commun. 10, 512 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).
Article PubMed PubMed Central Google Scholar
Chng, K. R. et al. Whole metagenome profiling reveals skin microbiome-dependent susceptibility to atopic dermatitis flare. Nat Microbiol 1, 16106 (2016).
Article CAS PubMed Google Scholar
Chu, D. M. et al. Maturation of the infant microbiome community structure and function across multiple body sites and in relation to mode of delivery. Nat. Med. 23, 314–326 (2017).
Article CAS PubMed PubMed Central Google Scholar
Van Rossum, T. et al. Spatiotemporal dynamics of river viruses, bacteria and microeukaryotes. Preprint at https://doi.org/10.1101/259861 (2018).
Feng, Q. et al. Integrated metabolomics and metagenomics analysis of plasma and urine identified microbial metabolites associated with coronary heart disease. Sci. Rep. 6, 22525 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Oh, J., Byrd, A. L., Park, M., Kong, H. H. & Segre, J. A. Temporal stability of the human skin microbiome. Cell 165, 854–866 (2016).
Article CAS PubMed PubMed Central Google Scholar
Xiao, L. et al. A reference gene catalogue of the pig gut microbiome. Nat. Microbiol. 1, 16161 (2016).
Article CAS PubMed Google Scholar
R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2014).
Coelho, L. P. et al. NG-meta-profiler: Fast processing of metagenomes using ngless, a domain-specific language. Microbiome 7, 84 (2019).
Article PubMed PubMed Central Google Scholar
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct De Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Article CAS PubMed Google Scholar
Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–W454 (2005).
Article CAS PubMed PubMed Central Google Scholar
Coelho, L. P. Jug: Software for parallel reproducible computation in Python. J. Open Res. Softw. 5, 30 (2017).
Article Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using diamond. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Eberhardt, R. Y. et al. AntiFam: A tool to help identify spurious ORFs in protein annotation. Database 2012, bas003 (2012).
Article PubMed PubMed Central Google Scholar
Kang, D. et al. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Article PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhou, W., Gay, N. & Oh, J. ReprDB and panDB: minimalist databases with maximal microbial representation. Microbiome 6, 15 (2018).
Article PubMed PubMed Central Google Scholar
Hingamp, P. et al. Exploring nucleo-cytoplasmic large DNA viruses in tara oceans microbial metagenomes. ISME J. 7, 1678–1695 (2013).
Article CAS PubMed PubMed Central Google Scholar
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Smyshlyaev, G., Barabas, O. & Bateman, A. Sequence analysis allows functional annotation of tyrosine recombinases in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jia, B. et al. CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).
Article CAS PubMed Google Scholar
Gibson, M. K., Forsberg, K. J. & Dantas, G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. ISME J. 9, 207–216 (2015).
Article CAS PubMed Google Scholar
Li, T., Fan, K., Wang, J. & Wang, W. Reduction of protein sequence complexity by residue grouping. Protein Eng. 16, 323–330 (2003).
Article CAS PubMed Google Scholar
Zhao, M., Lee, W.-P., Garrison, E. P. & Marth, G. T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS ONE 8, e82138 (2013).
Article ADS PubMed PubMed Central Google Scholar
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).
Article Google Scholar
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Article ADS PubMed PubMed Central Google Scholar
Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).
Article PubMed PubMed Central Google Scholar
Kumar, R., Acharya, V., Singh, D. & Kumar, S. Strategies for high-altitude adaptation revealed from high-quality draft genome of non-violacein producing Janthinobacterium lividum ERGS5:01. Stand. Genomic Sci. 13, 11 (2018).
Article CAS PubMed PubMed Central Google Scholar
Patijanasoontorn, B. et al. Hospital acquired Janthinobacterium lividum septicemia in srinagarind hospital. J. Med. Assoc. Thai. 75 Suppl 2, 6–10 (1992).
PubMed Google Scholar
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Collins, R. E. & Higgs, P. G. Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol. Biol. Evol. 29, 3413–3425 (2012).
Article CAS PubMed Google Scholar
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol. Syst. Biol. 7, 539 (2011).
Article PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article ADS PubMed PubMed Central Google Scholar
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article CAS PubMed PubMed Central Google Scholar
Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609–12 (2006).
Article CAS PubMed PubMed Central Google Scholar
Murrell, B. et al. FUBAR: a fast, unconstrained Bayesian approximation for inferring selection. Mol. Biol. Evol. 30, 1196–1205 (2013).
Article CAS PubMed PubMed Central Google Scholar
Smith, M. D. et al. Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection. Mol. Biol. Evol. 32, 1342–1353 (2015).
Article CAS PubMed PubMed Central Google Scholar
Washietl, S. et al. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578–594 (2011).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Funding was provided by the European Union’s Horizon 2020 Research and Innovation Programme (grant 686070: DD-DeCaF to P.B.) and Marie Skłodowska-Curie Actions (grant 713673 to A.R.d.R.), the European Research Council (ERC) MicrobioS (ERC-AdG-669830 to P.B.), JTC project jumpAR (01KI1706 to P.B.), a BMBF Grant (grant 031L0181A: LAMarCK to P.B.), the European Molecular Biology Laboratory (P.B.), the ETH and Helmut Horten Foundation (S.S.), the National Key R&D Program of China (grant 2020YFA0712403 to X.-M.Z.), National Natural Science Foundation of China (grant 61932008 to X.-M.Z.; grant 61772368 to X.-M.Z.; grant 31950410544 to L.P.C.), the Shanghai Municipal Science and Technology Major Project (grant 2018SHZDZX01 to X.-M.Z. and L.P.C.) and Zhangjiang Lab (X.-M.Z. and L.P.C.), the International Development Research Centre (grant 109304, EMBARK under the JPI AMR framework; to L.P.C.), la Caixa Foundation (grant 100010434, fellowship code LCF/BQ/DI18/11660009 to A.R.d.R.), the Severo Ochoa Program for Centres of Excellence in R&D from the Agencia Estatal de Investigación of Spain (grant SEV-2016-0672 (2017–2021) to C.P.C.), the Ministerio de Ciencia, Innovación y Universidades (grant PGC2018-098073-A-I00 MCIU/AEI/FEDER to J.H.-C. and J.G.-L.), the Innovation Fund Denmark (grant 4203-00005B, PNM), the Biotechnology and Biological Sciences research Council (BBSrC) Institute Strategic Programme Gut Microbes and Health BB/r012490/1 and its constituent project BBS/e/F/000Pr10355 (F.H.). R.A. is a member of the Collaboration for joint PhD degree between EMBL and Heidelberg University, Faculty of Biosciences. The authors thank the Bork group for helpful discussion, in particular A. Głazek for discussions of algorithm design, J. C. Somody (EMBL) for help with figure design, and A. Fullam (EMBL) for computational assistance in processing the MAGs.

Author information

Authors and Affiliations

Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
Luis Pedro Coelho, Shaojun Pan, Longhao Jia & Xing-Ming Zhao
MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Shanghai, China
Luis Pedro Coelho, Shaojun Pan, Longhao Jia & Xing-Ming Zhao
Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Luis Pedro Coelho, Renato Alves, Thomas Sebastian Schmidt, Daniel R. Mende, Askarbek Orakov, Falk Hildebrand, Thea Van Rossum, Sofia K. Forslund, Supriya Khedkar, Oleksandr M. Maistrenko, Pamela Ferretti, Shinichi Sunagawa, Jaime Huerta-Cepas & Peer Bork
Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Madrid, Spain
Álvaro Rodríguez del Río, Carlos P. Cantalapiedra, Joaquín Giner-Lamia & Jaime Huerta-Cepas
Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
Pernille Neve Myers
Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid (UPM), Madrid, Spain
Joaquín Giner-Lamia
Daniel K. Inouye Center for Microbial Oceanography: Research and Education, University of Hawai‘i at Mānoa, Honolulu, HI, USA
Daniel R. Mende
biobyte solutions GmbH, Heidelberg, Germany
Ivica Letunic
Earlham Institute, Norwich Research Park, Norwich, UK
Falk Hildebrand
Gut Health and Microbes Programme, Quadram Institute, Norwich Research Park, Norwich, UK
Falk Hildebrand
Experimental and Clinical Research Center (ECRC), a joint venture of the Max Delbrück Centre (MDC) and Charité University Hospital, Berlin, Germany
Sofia K. Forslund
Berlin Initiative of Health, Berlin, Germany
Sofia K. Forslund
Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
Shinichi Sunagawa
Clinical Microbiomics A/S, Copenhagen, Denmark
Henrik Bjørn Nielsen
Max Delbrück Centre for Molecular Medicine, Berlin, Germany
Peer Bork
Yonsei Frontier Lab (YFL), Yonsei University, Seoul, South Korea
Peer Bork
Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
Peer Bork

Authors

Luis Pedro Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Renato Alves
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro Rodríguez del Río
View author publications
You can also search for this author in PubMed Google Scholar
Pernille Neve Myers
View author publications
You can also search for this author in PubMed Google Scholar
Carlos P. Cantalapiedra
View author publications
You can also search for this author in PubMed Google Scholar
Joaquín Giner-Lamia
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Sebastian Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Daniel R. Mende
View author publications
You can also search for this author in PubMed Google Scholar
Askarbek Orakov
View author publications
You can also search for this author in PubMed Google Scholar
Ivica Letunic
View author publications
You can also search for this author in PubMed Google Scholar
Falk Hildebrand
View author publications
You can also search for this author in PubMed Google Scholar
Thea Van Rossum
View author publications
You can also search for this author in PubMed Google Scholar
Sofia K. Forslund
View author publications
You can also search for this author in PubMed Google Scholar
Supriya Khedkar
View author publications
You can also search for this author in PubMed Google Scholar
Oleksandr M. Maistrenko
View author publications
You can also search for this author in PubMed Google Scholar
Shaojun Pan
View author publications
You can also search for this author in PubMed Google Scholar
Longhao Jia
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Ferretti
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Sunagawa
View author publications
You can also search for this author in PubMed Google Scholar
Xing-Ming Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Bjørn Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Huerta-Cepas
View author publications
You can also search for this author in PubMed Google Scholar
Peer Bork
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The study was conceived and supervised by P.B. and designed by L.P.C., S.S., J.H.-C. and P.B. L.P.C., R.A., A.R.d.R., P.N.M., T.S.S., A.O., F.H., T.V.R., S.K.F., S.K., O.M.M., P.F. and J.H.-C. analysed data. L.P.C., T.S.S., F.H., T.V.R., S.K.F., P.F., J.H.-C. and P.B. drafted the manuscript. L.P.C., R.A., A.R.d.R., C.P.C. and D.R.M. built the unigene, protein clusters and protein family catalogues. L.P.C., R.A., T.S.S., D.R.M., I.L., F.H., S.K.F., S.K. and J.H.-C. annotated the catalogue. A.R.d.R., C.P.C., J.G.-L., O.M.M. and J.H.-C. performed the selection pressure analyses. P.N.M. and H.B.N. built the MGSs. L.P.C., R.A., I.L., S.P., L.J., X.-M.Z., T.V.R. and J.H.-C. designed and implemented the web resource, including the search algorithms and the associated GMGC-mapper tool. L.P.C., T.S.S., F.H. and O.M.M. annotated metagenomes. T.S.S. and A.O. built the MAGs. All authors contributed to the review of the manuscript before submission for publication and approved the final version.

Corresponding authors

Correspondence to Luis Pedro Coelho, Jaime Huerta-Cepas or Peer Bork.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Gene accumulation curves. Legend.

(a) For most (but not all) habitats, unigenes with high prevalence (≥ 5%) have been well-captured, while rare unigenes continue to be found in each new sample. (b-d) New unigenes continue to be found in each sample. Each grey line represents a random permutation of the samples, while the solid black line shows the mean over these random permutations. The dotted red line is least-squares fit of Heap’s Law (N = k · sample^alpha). In all cases, the parameter fit indicates that the number of has not reached saturation. (e) The number of assembled/detected genes per sample grows with sequencing depth without a plateau being reached. (f) Similarly, the number of detected ORFs per insert grows with sequencing depth.

Extended Data Fig. 2 Identity thresholds and their relationship to taxonomy and function in the GMGCv1.

Legend: (a) A 95% nucleotide identity threshold is a proxy for species. Shown is nucleotide identity of closest gene homolog within the same species or within the same genus (excluding within-species comparisons). The threshold used in this work (95%) is marked with a dashed red line. (b) Within well-conserved, universal, 40 single-copy orthologues (see Methods), the average pairwise amino acid identity is 49%, albeit with a wide range (27-75%) when considering within-orthologue averages. In dashed red, the thresholds used for building protein families are highlighted. Boxplots display quartiles and ranges (see Methods). (c) Proportion of genes annotated at each taxonomic level.

Extended Data Fig. 3 Short reads map to the GMGCv1 at higher rates (compared to a reference database of reference genomes).

Legend: (a) Mapping rates for short reads from metagenomes mapped against the GMGCv1 or the reference genomes in proGenomes2. (b) Fraction of short reads from human gut metagenomes mapping to a collection of sequenced genomes and the GMGCv1, per country, (c) Same data as (b), aggregated by the World Bank’s classification of countries into income groups. In all panels, boxplots show quartiles (including median) and range (except for outliers, see Methods). Blue boxes show mapping rates to proGenomes2, while orange boxes show mapping rates to GMGCv1.

Extended Data Fig. 4 MAGs only capture a small fraction of all genes in a sample.

Legend: Fraction of undetected genes when mapping to only the genes captured by metagenome-assembled genomes (MAGs) across the habitats compared to mapping to the full GMGCv1.

Extended Data Fig. 5 Species and protein cluster sharing between habitats is similar to unigene sharing, but sharing of protein families is more extensive.

Legend: (a) The sharing of metagenomic species between habitats mimics unigene sharing. Width of each ribbons represents the number of MGSs shared between the habitats (the largest number shared is between the human and the pig gut, which share 166 MGSs out of 1,908 MGSs in the human gut and 898 in pig gut, respectively). (b) Species-level unigene sharing between habitats by fraction of the number of unigenes from each habitat (cf. Fig. 1b, which uses abundance weighting). (c) Sharing of protein clusters (90% amino acid identity clusters) between habitats, abundance-weighted. (d) Sharing of protein families between habitats, abundance-weighted. When considering coarser clusterings of sequences, gene sharing between habitats increases, yet we still observed higher rates of sharing between similar habitats and significant fractions of habitat-specific families (e.g., in the marine environment, 31.3% of the genes, by abundance, are in marine-specific protein families).

Extended Data Fig. 6 Antibiotic resistance and mobile genes are more likely to be multi-habitat genes, while most species are found in a single habitat.

Legend: (a) Fraction of unigenes within each habitat which are multi-habitat genes (for all unigenes, or when considering only mobile elements or antibiotic resistance genes). (b) A total of 7,443 MGSs were built, across all the habitats as species proxies to reliably assess their habitats. Each circle shows the number of metagenomic species for each habitat, x-axis represents the number of genes in the catalogue specific to each habitat, the y-axis represents the number of samples. Note that differing sampling depth and habitat-specific biodiversity impact those numbers.

Extended Data Fig. 7 Determinants of functional community structure.

Legend: (a) principal coordinate analysis of all samples by protein family profile and the correlations with taxonomic and protein family richness (after rarefying to 1 million inserts to remove effects of sample depth). (b) Hierarchical clustering of the habitats using high-level functional profiles based.

Extended Data Fig. 8 Marine and soil richness patterns are a mixture of subpatterns.

Legend: Conspecific genes per species in marine (a) and (b) soil sub-habitats. The differences in the marine environment are particularly large when comparing the samples in the photic zones (the shallower, light-accessible, surface and deep-chlorophyll maximum samples) to the non-photic mesopelagic samples (deeper, beyond the reach of sunlight). The differences in the soil environment follow differences in acidity (with Podzol, Dystric Brunisol and Ultic soils being acidic, while Luvisols are usually neutral or alkaline) and differences in moisture (with Xeralfs being dry in the summer, while Glossudalfs are moist year round).

Extended Data Fig. 9 Most genes are detected only infrequently and rare genes are (on average) present at a lower abundance in metagenomes.

Legend: (a) Shown are the percentage of genes detected in at most 1,...,50 metagenomes (out of a total of 13,174). (b, c) Histograms of gene prevalence are roughly linear on a log-log scale, as predicted from neutral or nearly-neutral evolution models. Shown are histograms for 90% amino acid identity protein clusters (b) and 20% amino acid identity protein families (c), which behave similar to species-level unigenes (see Fig. 3). (d) Shown is the percentage of genes in each sample that is composed of rare genes (Count) and the total abundance represented by these (Abundance). Except for wastewater (likely due to under-sampling), rare genes represent a lower fraction of the abundance than of detection. Boxplots show quartiles (including median drawn as a line) and whiskers show the range of the data excluding outliers, which are shown as extra elements (see Methods).

Extended Data Fig. 10 More abundant and larger protein families are under more intense selection.

Legend: (a) dN/dS within each protein family, with protein families split into 5 abundance quintiles, showing a downward trend with abundance (higher negative selection). (b) dN/dS within each gene size category, similarly showing a downward trend with size. Categories are defined by increasing size, with each bin representing the same number of unigenes. Boxplots show quartiles and ranges (see Methods).

Supplementary information

Reporting Summary

Supplementary Table 1

List of samples included in the global gene catalogue including their habitat annotation. This table includes the accession ID of the metagenome, the GC fraction, and the size of the input (measured in total basepairs and number of inserts) both prior to (Raw) and post quality filtering (HQ, for high-quality). It also includes the number of ORFs predicted.

Supplementary Table 2

Mapping of our human-readable habitat terms to predefined ontology terms. List of habitat annotations used and their correspondence to EnvO/Uberon terms and NCBI taxonomic ID of the host (for host-associated habitat).

Supplementary Table 3

Statistics of (redundant) ORF predictions. The tables summarize the assembly statistics per habitat (including number of contigs, average contig size, N50) and the ORF stats per habitat, both complete ORFs and fragmentary ORFs.

Supplementary Table 4

Robustness checks for protein family definition and inclusion or exclusion of fragmentary ORFs. The major statements in the manuscript were tested using alternative protein family thresholds as well as including or excluding fragmentary ORFs to ensure robustness.

Supplementary Table 5

List of metagenome-assembled genomes and their statistics. All the assembled bins are included, including those marked as low-quality (following standard criteria, see Methods).

Supplementary Table 6

Pearson correlations between observed gene prevalence and theoretical prediction assuming neutral evolution.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coelho, L.P., Alves, R., del Río, Á.R. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022). https://doi.org/10.1038/s41586-021-04233-4

Download citation

Received: 16 June 2019
Accepted: 12 November 2021
Published: 15 December 2021
Issue Date: 13 January 2022
DOI: https://doi.org/10.1038/s41586-021-04233-4

This article is cited by

A compendium of ruminant gastrointestinal phage genomes revealed a higher proportion of lytic phages than in any other environments
- Yingjian Wu
- Na Gao
- Wei-Hua Chen
Microbiome (2024)
Largest genetic database of marine microbes could aid drug discovery
- Carissa Wong
Nature (2024)
The journey to understand previously unknown microbial genes
- Jakob Wirbel
- Ami S. Bhatt
- Alexander J. Probst
Nature (2024)
A genome and gene catalog of the aquatic microbiomes of the Tibetan Plateau
- Mingyue Cheng
- Shuai Luo
- Kang Ning
Nature Communications (2024)
Structure-guided discovery of anti-CRISPR and anti-phage defense proteins
- Ning Duan
- Emily Hand
- Akintunde Emiola
Nature Communications (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.