Microbial genes encode the majority of the functional repertoire of life on earth. However, despite increasing efforts in metagenomic sequencing of various habitats1,2,3, little is known about the distribution of genes across the global biosphere, with implications for human and planetary health. Here we constructed a non-redundant gene catalogue of 303 million species-level genes (clustered at 95% nucleotide identity) from 13,174 publicly available metagenomes across 14 major habitats and use it to show that most genes are specific to a single habitat. The small fraction of genes found in multiple habitats is enriched in antibiotic-resistance genes and markers for mobile genetic elements. By further clustering these species-level genes into 32 million protein families, we observed that a small fraction of these families contain the majority of the genes (0.6% of families account for 50% of the genes). The majority of species-level genes and protein families are rare. Furthermore, species-level genes, and in particular the rare ones, show low rates of positive (adaptive) selection, supporting a model in which most genetic variability observed within each protein family is neutral or nearly neutral.
This is a preview of subscription content
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
All data analysed during the current study are publicly available. Supplementary Table 1 contains the accession numbers for all the metagenomes used. GMGCv1 is available for download at https://gmgc.embl.de. The full catalogue is available for download as are sub-catalogues specialized to individual habitats and the subset derived only from sequenced genomes (which can be further subset to obtain the pangenome of a species of interest). Both the full catalogue and a version containing only complete ORFs are available as they represent different tradeoffs: the complete catalogue achieves higher coverage, while the version with only complete ORFs may be more appropriate for analyses that require the whole gene. Similarly, protein families are available at different amino acid identity thresholds (see ‘Protein family cluster calculation’). In addition to being available for download, the catalogue can be queried with an amino acid sequence. We developed and use a novel k-mer based algorithm (see ‘k-mer based homology search’) to enable fast queries over the complete 303 million protein database and allow interactive use.
Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
Zou, Y. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179–185 (2019).
Mohammad, B. F. et al. Structure and function of the global topsoil microbiome. Nature 560 233–237 (2018).
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
Xiao, L. et al. A catalog of the mouse gut metagenome. Nat. Biotechnol. 33, 1103–1108 (2015).
Coelho, L. P. et al. Similarity of the dog and human gut microbiomes in gene content and response to diet. Microbiome 6, 72 (2018).
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662.e20 (2019).
Partridge, S. R., Kwong, S. M., Firth, N. & Jensen, S. O. Mobile genetic elements associated with antimicrobial resistance. Clin. Microbiol. Rev. 31, (2018).
Mende, D. R. et al. ProGenomes2: An improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621–D625 (2020).
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Daniel H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nuc. Acids Res. 46, D851–D860 (2018).
Mering, C. von et al. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315, 1126–1130 (2007).
Richardson, E. J. et al. Gene exchange drives the ecological success of a multi-host bacterial pathogen. Nat. Ecol. Evol. 2, 1468–1478 (2018).
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Mende, D. R., Sunagawa, S., Zeller, G. & Bork, P. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Louca, S. et al. Function and functional redundancy in microbial systems. Nat. Ecol. Evol. 2, 936–943 (2018).
Maistrenko, O. M. et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J. 14, 1247–1259 (2020).
Baumdicker, F., Hess, W. R. & Pfaffelhuber, P. The diversity of a distributed genome in bacterial populations. Ann. Appl. Probab. 20, 1567–1606 (2010).
Sela, I., Wolf, Y. I. & Koonin, E. V. Theory of prokaryotic genome evolution. Proc. Natl Acad. Sci. USA 113, 11399–11407 (2016).
Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998).
Nei, M., Suzuki, Y. & Nozawa, M. The neutral theory of molecular evolution in the genomic era. Annu. Rev. Genomics Hum. Genet. 11, 265–289 (2010).
Iranzo, J., Cuesta, J. A., Manrubia, S., Katsnelson, M. I. & Koonin, E. V. Disentangling the effects of selection and loss bias on gene dynamics. Proc. Natl Acad. Sci. USA 114, E5616–E5624 (2017).
Wolf, Y. I., Makarova, K. S., Lobkovsky, A. E. & Koonin, E. V. Two fundamentally different classes of microbial genes. Nat. Microbiol. 2, 16208 (2016).
Rasko, D. A. et al. The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J. Bacteriol. 190, 6881–6893 (2008).
Koskella, B., Hall, L. J. & Metcalf, C. J. E. The microbiome beyond the horizon of ecological and evolutionary theory. Nat. Ecol. Evol. 1, 1606–1615 (2017).
Liu, R. et al. Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat. Med. 23, 859–868 (2017).
Metcalf, J. L. et al. Microbial community assembly and metabolic function during mammalian corpse decomposition. Science 351, 158–162 (2015).
Vincent, C. et al. Bloom and bust: intestinal microbiota dynamics in response to hospital exposures and Clostridium difficile colonization or infection. Microbiome 4, 12 (2016).
Zeller, G. et al. Potential of fecal microbiota for early‐stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Gibson, M. K. et al. Developmental dynamics of the preterm infant gut microbiota and antibiotic resistome. Nat. Microbiol. 1, 16024 (2016).
Zhang, X. et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat. Med. 21, 895–905 (2015).
Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435–439 (2016).
Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).
Turnbaugh, P. J. et al. The human microbiome project. Nature 449, 804–810 (2007).
Hannigan, G. D. et al. The human skin double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome. MBio 6, e01578-15 (2015).
Taft, D. H. et al. Intestinal microbiota of preterm infants differ over time and between hospitals. Microbiome 2, 36 (2014).
Zeevi, D. et al. Personalized nutrition by prediction of glycemic responses. Cell 163, 1079–1094 (2015).
Wilhelm, R. C. et al. Biogeography and organic matter removal shape long-term effects of timber harvesting on forest soil microbial communities. ISME J. 11, 2552–2568 (2017).
Xie, H. et al. Shotgun metagenomics of 250 adult twins reveals genetic and environmental impacts on the gut microbiome. Cell Syst. 3, 572–584.e3 (2016).
The MetaSUB International Consortium. The metagenomics and metadesign of the subways and urban biomes (metasub) international consortium inaugural meeting report. Microbiome 4, 24 (2016).
Chatelier, E. L. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).
Li, J. et al. Gut microbiota dysbiosis contributes to the development of hypertension. Microbiome 5, (2017).
Pehrsson, E. C. et al. Interconnected microbiomes and resistomes in low-income human habitats. Nature 533, 212–216 (2016).
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).
Gu, Y. et al. Analyses of gut microbiota and plasma bile acids enable stratification of patients for antidiabetic treatment. Nat. Commun. 8, 1785 (2017).
Karlsson, F. H. et al. Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).
Youngster, I. et al. Fecal microbiota transplant for relapsing clostridium difficile infection using a frozen inoculum from unrelated donors: a randomized, open-label, controlled pilot study. Clin. Infect. Dis. 58, 1515–1522 (2014).
Guittar, J., Shade, A. & Litchman, E. Trait-based community assembly and succession of the infant gut microbiome. Nat. Commun. 10, 512 (2019).
Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).
Chng, K. R. et al. Whole metagenome profiling reveals skin microbiome-dependent susceptibility to atopic dermatitis flare. Nat Microbiol 1, 16106 (2016).
Chu, D. M. et al. Maturation of the infant microbiome community structure and function across multiple body sites and in relation to mode of delivery. Nat. Med. 23, 314–326 (2017).
Van Rossum, T. et al. Spatiotemporal dynamics of river viruses, bacteria and microeukaryotes. Preprint at https://doi.org/10.1101/259861 (2018).
Feng, Q. et al. Integrated metabolomics and metagenomics analysis of plasma and urine identified microbial metabolites associated with coronary heart disease. Sci. Rep. 6, 22525 (2016).
Oh, J., Byrd, A. L., Park, M., Kong, H. H. & Segre, J. A. Temporal stability of the human skin microbiome. Cell 165, 854–866 (2016).
Xiao, L. et al. A reference gene catalogue of the pig gut microbiome. Nat. Microbiol. 1, 16161 (2016).
R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2014).
Coelho, L. P. et al. NG-meta-profiler: Fast processing of metagenomes using ngless, a domain-specific language. Microbiome 7, 84 (2019).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct De Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–W454 (2005).
Coelho, L. P. Jug: Software for parallel reproducible computation in Python. J. Open Res. Softw. 5, 30 (2017).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using diamond. Nat. Methods 12, 59–60 (2015).
Eberhardt, R. Y. et al. AntiFam: A tool to help identify spurious ORFs in protein annotation. Database 2012, bas003 (2012).
Kang, D. et al. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Zhou, W., Gay, N. & Oh, J. ReprDB and panDB: minimalist databases with maximal microbial representation. Microbiome 6, 15 (2018).
Hingamp, P. et al. Exploring nucleo-cytoplasmic large DNA viruses in tara oceans microbial metagenomes. ISME J. 7, 1678–1695 (2013).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Huerta-Cepas, J. et al. eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Smyshlyaev, G., Barabas, O. & Bateman, A. Sequence analysis allows functional annotation of tyrosine recombinases in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).
Jia, B. et al. CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).
Gibson, M. K., Forsberg, K. J. & Dantas, G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. ISME J. 9, 207–216 (2015).
Li, T., Fan, K., Wang, J. & Wang, W. Reduction of protein sequence complexity by residue grouping. Protein Eng. 16, 323–330 (2003).
Zhao, M., Lee, W.-P., Garrison, E. P. & Marth, G. T. SSW library: an SIMD Smith–Waterman C/C++ library for use in genomic applications. PLoS ONE 8, e82138 (2013).
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).
Kumar, R., Acharya, V., Singh, D. & Kumar, S. Strategies for high-altitude adaptation revealed from high-quality draft genome of non-violacein producing Janthinobacterium lividum ERGS5:01. Stand. Genomic Sci. 13, 11 (2018).
Patijanasoontorn, B. et al. Hospital acquired Janthinobacterium lividum septicemia in srinagarind hospital. J. Med. Assoc. Thai. 75 Suppl 2, 6–10 (1992).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Collins, R. E. & Higgs, P. G. Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol. Biol. Evol. 29, 3413–3425 (2012).
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol. Syst. Biol. 7, 539 (2011).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609–12 (2006).
Murrell, B. et al. FUBAR: a fast, unconstrained Bayesian approximation for inferring selection. Mol. Biol. Evol. 30, 1196–1205 (2013).
Smith, M. D. et al. Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection. Mol. Biol. Evol. 32, 1342–1353 (2015).
Washietl, S. et al. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578–594 (2011).
Funding was provided by the European Union’s Horizon 2020 Research and Innovation Programme (grant 686070: DD-DeCaF to P.B.) and Marie Skłodowska-Curie Actions (grant 713673 to A.R.d.R.), the European Research Council (ERC) MicrobioS (ERC-AdG-669830 to P.B.), JTC project jumpAR (01KI1706 to P.B.), a BMBF Grant (grant 031L0181A: LAMarCK to P.B.), the European Molecular Biology Laboratory (P.B.), the ETH and Helmut Horten Foundation (S.S.), the National Key R&D Program of China (grant 2020YFA0712403 to X.-M.Z.), National Natural Science Foundation of China (grant 61932008 to X.-M.Z.; grant 61772368 to X.-M.Z.; grant 31950410544 to L.P.C.), the Shanghai Municipal Science and Technology Major Project (grant 2018SHZDZX01 to X.-M.Z. and L.P.C.) and Zhangjiang Lab (X.-M.Z. and L.P.C.), the International Development Research Centre (grant 109304, EMBARK under the JPI AMR framework; to L.P.C.), la Caixa Foundation (grant 100010434, fellowship code LCF/BQ/DI18/11660009 to A.R.d.R.), the Severo Ochoa Program for Centres of Excellence in R&D from the Agencia Estatal de Investigación of Spain (grant SEV-2016-0672 (2017–2021) to C.P.C.), the Ministerio de Ciencia, Innovación y Universidades (grant PGC2018-098073-A-I00 MCIU/AEI/FEDER to J.H.-C. and J.G.-L.), the Innovation Fund Denmark (grant 4203-00005B, PNM), the Biotechnology and Biological Sciences research Council (BBSrC) Institute Strategic Programme Gut Microbes and Health BB/r012490/1 and its constituent project BBS/e/F/000Pr10355 (F.H.). R.A. is a member of the Collaboration for joint PhD degree between EMBL and Heidelberg University, Faculty of Biosciences. The authors thank the Bork group for helpful discussion, in particular A. Głazek for discussions of algorithm design, J. C. Somody (EMBL) for help with figure design, and A. Fullam (EMBL) for computational assistance in processing the MAGs.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
(a) For most (but not all) habitats, unigenes with high prevalence (≥ 5%) have been well-captured, while rare unigenes continue to be found in each new sample. (b-d) New unigenes continue to be found in each sample. Each grey line represents a random permutation of the samples, while the solid black line shows the mean over these random permutations. The dotted red line is least-squares fit of Heap’s Law (N = k · sample^alpha). In all cases, the parameter fit indicates that the number of has not reached saturation. (e) The number of assembled/detected genes per sample grows with sequencing depth without a plateau being reached. (f) Similarly, the number of detected ORFs per insert grows with sequencing depth.
Extended Data Fig. 2 Identity thresholds and their relationship to taxonomy and function in the GMGCv1.
Legend: (a) A 95% nucleotide identity threshold is a proxy for species. Shown is nucleotide identity of closest gene homolog within the same species or within the same genus (excluding within-species comparisons). The threshold used in this work (95%) is marked with a dashed red line. (b) Within well-conserved, universal, 40 single-copy orthologues (see Methods), the average pairwise amino acid identity is 49%, albeit with a wide range (27-75%) when considering within-orthologue averages. In dashed red, the thresholds used for building protein families are highlighted. Boxplots display quartiles and ranges (see Methods). (c) Proportion of genes annotated at each taxonomic level.
Extended Data Fig. 3 Short reads map to the GMGCv1 at higher rates (compared to a reference database of reference genomes).
Legend: (a) Mapping rates for short reads from metagenomes mapped against the GMGCv1 or the reference genomes in proGenomes2. (b) Fraction of short reads from human gut metagenomes mapping to a collection of sequenced genomes and the GMGCv1, per country, (c) Same data as (b), aggregated by the World Bank’s classification of countries into income groups. In all panels, boxplots show quartiles (including median) and range (except for outliers, see Methods). Blue boxes show mapping rates to proGenomes2, while orange boxes show mapping rates to GMGCv1.
Legend: Fraction of undetected genes when mapping to only the genes captured by metagenome-assembled genomes (MAGs) across the habitats compared to mapping to the full GMGCv1.
Extended Data Fig. 5 Species and protein cluster sharing between habitats is similar to unigene sharing, but sharing of protein families is more extensive.
Legend: (a) The sharing of metagenomic species between habitats mimics unigene sharing. Width of each ribbons represents the number of MGSs shared between the habitats (the largest number shared is between the human and the pig gut, which share 166 MGSs out of 1,908 MGSs in the human gut and 898 in pig gut, respectively). (b) Species-level unigene sharing between habitats by fraction of the number of unigenes from each habitat (cf. Fig. 1b, which uses abundance weighting). (c) Sharing of protein clusters (90% amino acid identity clusters) between habitats, abundance-weighted. (d) Sharing of protein families between habitats, abundance-weighted. When considering coarser clusterings of sequences, gene sharing between habitats increases, yet we still observed higher rates of sharing between similar habitats and significant fractions of habitat-specific families (e.g., in the marine environment, 31.3% of the genes, by abundance, are in marine-specific protein families).
Extended Data Fig. 6 Antibiotic resistance and mobile genes are more likely to be multi-habitat genes, while most species are found in a single habitat.
Legend: (a) Fraction of unigenes within each habitat which are multi-habitat genes (for all unigenes, or when considering only mobile elements or antibiotic resistance genes). (b) A total of 7,443 MGSs were built, across all the habitats as species proxies to reliably assess their habitats. Each circle shows the number of metagenomic species for each habitat, x-axis represents the number of genes in the catalogue specific to each habitat, the y-axis represents the number of samples. Note that differing sampling depth and habitat-specific biodiversity impact those numbers.
Legend: (a) principal coordinate analysis of all samples by protein family profile and the correlations with taxonomic and protein family richness (after rarefying to 1 million inserts to remove effects of sample depth). (b) Hierarchical clustering of the habitats using high-level functional profiles based.
Legend: Conspecific genes per species in marine (a) and (b) soil sub-habitats. The differences in the marine environment are particularly large when comparing the samples in the photic zones (the shallower, light-accessible, surface and deep-chlorophyll maximum samples) to the non-photic mesopelagic samples (deeper, beyond the reach of sunlight). The differences in the soil environment follow differences in acidity (with Podzol, Dystric Brunisol and Ultic soils being acidic, while Luvisols are usually neutral or alkaline) and differences in moisture (with Xeralfs being dry in the summer, while Glossudalfs are moist year round).
Extended Data Fig. 9 Most genes are detected only infrequently and rare genes are (on average) present at a lower abundance in metagenomes.
Legend: (a) Shown are the percentage of genes detected in at most 1,...,50 metagenomes (out of a total of 13,174). (b, c) Histograms of gene prevalence are roughly linear on a log-log scale, as predicted from neutral or nearly-neutral evolution models. Shown are histograms for 90% amino acid identity protein clusters (b) and 20% amino acid identity protein families (c), which behave similar to species-level unigenes (see Fig. 3). (d) Shown is the percentage of genes in each sample that is composed of rare genes (Count) and the total abundance represented by these (Abundance). Except for wastewater (likely due to under-sampling), rare genes represent a lower fraction of the abundance than of detection. Boxplots show quartiles (including median drawn as a line) and whiskers show the range of the data excluding outliers, which are shown as extra elements (see Methods).
Legend: (a) dN/dS within each protein family, with protein families split into 5 abundance quintiles, showing a downward trend with abundance (higher negative selection). (b) dN/dS within each gene size category, similarly showing a downward trend with size. Categories are defined by increasing size, with each bin representing the same number of unigenes. Boxplots show quartiles and ranges (see Methods).
List of samples included in the global gene catalogue including their habitat annotation. This table includes the accession ID of the metagenome, the GC fraction, and the size of the input (measured in total basepairs and number of inserts) both prior to (Raw) and post quality filtering (HQ, for high-quality). It also includes the number of ORFs predicted.
Mapping of our human-readable habitat terms to predefined ontology terms. List of habitat annotations used and their correspondence to EnvO/Uberon terms and NCBI taxonomic ID of the host (for host-associated habitat).
Statistics of (redundant) ORF predictions. The tables summarize the assembly statistics per habitat (including number of contigs, average contig size, N50) and the ORF stats per habitat, both complete ORFs and fragmentary ORFs.
Robustness checks for protein family definition and inclusion or exclusion of fragmentary ORFs. The major statements in the manuscript were tested using alternative protein family thresholds as well as including or excluding fragmentary ORFs to ensure robustness.
List of metagenome-assembled genomes and their statistics. All the assembled bins are included, including those marked as low-quality (following standard criteria, see Methods).
Pearson correlations between observed gene prevalence and theoretical prediction assuming neutral evolution.
About this article
Cite this article
Coelho, L.P., Alves, R., del Río, Á.R. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022). https://doi.org/10.1038/s41586-021-04233-4
Nature Microbiology (2022)
A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments
Nature Communications (2022)
The Epidermal Microbiome Within an Aggregation of Leopard Sharks (Triakis semifasciata) Has Taxonomic Flexibility with Gene Functional Stability Across Three Time-points
Microbial Ecology (2022)