Bacterial specialized metabolites are a proven source of antibiotics and cancer therapies, but whether we have sampled all the secondary metabolite chemical diversity of cultivated bacteria is not known. We analysed ~170,000 bacterial genomes and ~47,000 metagenome assembled genomes (MAGs) using a modified BiG-SLiCE and the new clust-o-matic algorithm. We estimate that only 3% of the natural products potentially encoded in bacterial genomes have been experimentally characterized. We show that the variation in secondary metabolite biosynthetic diversity drops significantly at the genus level, identifying it as an appropriate taxonomic rank for comparison. Equal comparison of genera based on relative evolutionary distance revealed that Streptomyces bacteria encode the largest biosynthetic diversity by far, with Amycolatopsis, Kutzneria and Micromonospora also encoding substantial diversity. Finally, we find that several less-well-studied taxa, such as Weeksellaceae (Bacteroidota), Myxococcaceae (Myxococcota), Pleurocapsa and Nostocaceae (Cyanobacteria), have potential to produce highly diverse sets of secondary metabolites that warrant further investigation.
This is a preview of subscription content
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
The clust-o-matic code is available at https://github.com/Helmholtz-HIPS.
The modified BiG-SLiCE script (that accepts as input a regular BiG-SLiCE output folder, then outputs the GCF membership in a tsv file) is available both in our Zenodo repository (file name: perform_l2norm_clustering.py) and at the following link: https://github.com/medema-group/bigslice/blob/master/misc/useful_scripts/perform_l2norm_clustering.py.
O’Connor, S. E. Engineering of secondary metabolism. Annu. Rev. Genet. 49, 71–94 (2015).
Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 83, 770–803 (2020).
Brown, E. D. & Wright, G. D. Antibacterial drug discovery in the resistance era. Nature 529, 336–343 (2016).
Atanasov, A. G., Zotchev, S. B., Dirsch, V. M., International Natural Product Sciences Taskforce & Supuran, C. T. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021).
Ziemert, N., Alanjary, M. & Weber, T. The evolution of genome mining in microbes - a review. Nat. Prod. Rep. 33, 988–1005 (2016).
Medema, M. H., de Rond, T. & Moore, B. S. Mining genomes to illuminate the specialized chemistry of life. Nat. Rev. Genet. 22, 553–571 (2021).
Cimermancic, P. et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell 158, 412–421 (2014).
Doroghazi, J. R. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat. Chem. Biol. 10, 963–968 (2014).
Hoffmann, T. et al. Correlating chemical diversity with taxonomic distance for discovery of natural products in myxobacteria. Nat. Commun. 9, 803 (2018).
Lewis, K. The science of antibiotic discovery. Cell 181, 29–45 (2020).
van Santen, J. A. et al. The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019).
Blin, K. et al. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 47, W81–W87 (2019).
Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 46, D851–D860 (2018).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Stewart, R. D. et al. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat. Biotechnol. 37, 953–961 (2019).
Glendinning, L., Stewart, R. D., Pallen, M. J., Watson, K. A. & Watson, M. Assembly of hundreds of novel bacterial genomes from the chicken caecum. Genome Biol. 21, 34 (2020).
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021). https://doi.org/10.1101/762682
Tully, B. J., Graham, E. D. & Heidelberg, J. F. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data 5, 170203 (2018).
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Nayfach, S. et al. Author correction: a genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 521 (2021).
Kautsar, S. A., van der Hooft, J. J. J., de Ridder, D. & Medemac, M. H. BiG-SLiCE: a highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters. GigaScience 10, giaa154 (2021).
Navarro-Muñoz, J. C. et al. A computational framework to explore large-scale biosynthetic diversity. Nat. Chem. Biol. 16, 60–68 (2020).
Kautsar, S. A. et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 48, D454–D458 (2020).
Coelho, L. P. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2021).
Sharrar, A. M. et al. Bacterial secondary metabolite biosynthetic potential in soil varies with phylum, depth, and vegetation type. mBio 11, e00416–e00420 (2020).
Barka, E. A. et al. Taxonomy, physiology, and natural products of actinobacteria. Microbiol. Mol. Biol. Rev. 80, 1–43 (2016).
Genilloud, O. Actinomycetes: still a source of novel antibiotics. Nat. Prod. Rep. 34, 1203–1232 (2017).
Chevrette, M. G. et al. The confluence of big data and evolutionary genome mining for the discovery of natural products. Nat. Prod. Rep. 38, 2024–2040 (2021).
Chase, A. B., Sweeney, D., Muskat, M. N., Guillén-Matus, D. & Jensen, P. R. Vertical inheritance governs biosynthetic gene cluster evolution and chemical diversification. mBio 12, e02700-21 (2021).
Männle, D. et al. Comparative genomics and metabolomics in the genus Nocardia. mSystems 5, e00125-20 (2020).
Ziemert, N. et al. Diversity and evolution of secondary metabolism in the marine actinomycete genus Salinispora. Proc. Natl Acad. Sci. USA 111, E1130–E1139 (2014).
Adamek, M. et al. Comparative genomics reveals phylogenetic distribution patterns of secondary metabolites in Amycolatopsis species. BMC Genomics 19, 426 (2018).
Buijs, Y. et al. Marine Proteobacteria as a source of natural products: advances in molecular tools and strategies. Nat. Prod. Rep. 36, 1333–1350 (2019).
Bérdy, J. Bioactive microbial metabolites: a personal view. J. Antibiotics 58, 1–26 (2005).
Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, D1217–D1222 (2018).
Medema, M. H. & Fischbach, M. A. Computational approaches to natural product discovery. Nat. Chem. Biol. 11, 639–648 (2015).
Miller, M. E. et al. Increased virulence of Puccinia coronata f. sp. avenae populations through allele frequency changes at multiple putative Avr loci. PLoS Genet. 16, e1009291 (2020).
Chavali, A. K. & Rhee, S. Y. Bioinformatics tools for the identification of gene clusters that biosynthesize specialized metabolites. Brief. Bioinform. 19, 1022–1034 (2018).
Adamek, M., Alanjary, M. & Ziemert, N. Applied evolution: phylogeny-based approaches in natural products research. Nat. Prod. Rep. 36, 1295–1312 (2019).
Ciufo, S. et al. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int. J. Syst. Evol. Microbiol. 68, 2386–2392 (2018).
Martínez-Romero, E. et al. Genome misclassification of Klebsiella variicola and Klebsiella quasipneumoniae isolated from plants, animals and humans. Salud Publica Mex. 60, 56–62 (2018).
Mateo-Estrada, V., Graña-Miraglia, L., López-Leal, G. & Castillo-Ramírez, S. Phylogenomics reveals clear cases of misclassification and genus-wide phylogenetic markers for Acinetobacter. Genome Biol. Evol. 11, 2531–2541 (2019).
Rekadwad, B. N. & Gonzalez, J. M. Correcting names of bacteria deposited in National Microbial Repositories: an analysed sequence data necessary for taxonomic re-categorization of misclassified bacteria - ONE example, genus Lysinibacillus. Data Brief 13, 761–778 (2017).
Chao, A. et al. Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecol. Monogr. 84, 45–67 (2014).
Hug, J. J., Bader, C. D., Remškar, M., Cirnski, K. & Müller, R. Concepts and methods to access novel antibiotics from Actinomycetes. Antibiotics 7, 44 (2018).
Ling, L. L. et al. A new antibiotic kills pathogens without detectable resistance. Nature 517, 455–459 (2015).
Subramani, R. & Sipkema, D. Marine rare Actinomycetes: a promising source of structurally diverse and unique novel natural products. Mar. Drugs 17, 249 (2019).
Weissman, K. J. & Müller, R. Myxobacterial secondary metabolites: bioactivities and modes-of-action. Nat. Prod. Rep. 27, 1276–1295 (2010).
Dahal, R. H., Chaudhary, D. K., Kim, D.-U., Pandey, R. P. & Kim, J. Chryseobacterium antibioticum sp. nov. with antimicrobial activity against Gram-negative bacteria, isolated from Arctic soil. J. Antibiotics 74, 115–123 (2021).
Schorn, M. A. et al. Sequencing rare marine actinomycete genomes reveals high density of unique natural product biosynthetic gene clusters. Microbiology 162, 2075–2086 (2016).
Baltz, R. H. Gifted microbes for genome mining and natural product discovery. J. Ind. Microbiol. Biotechnol. 44, 573–588 (2017).
Seyedsayamdost, M. R. Toward a global picture of bacterial secondary metabolism. J. Ind. Microbiol. Biotechnol. 46, 301–311 (2019).
Wohlleben, W., Mast, Y., Stegmann, E. & Ziemert, N. Antibiotic drug discovery. Microb. Biotechnol. 9, 541–548 (2016).
van Bergeijk, D. A., Terlouw, B. R., Medema, M. H. & van Wezel, G. P. Ecology and genomics of Actinobacteria: new concepts for natural product discovery. Nat. Rev. Microbiol. 18, 546–558 (2020).
Tracanna, V., de Jong, A., Medema, M. H. & Kuipers, O. P. Mining prokaryotes for antimicrobial compounds: from diversity to function. FEMS Microbiol. Rev. 41, 417–429 (2017).
Chen, R. et al. Discovery of an abundance of biosynthetic gene clusters in shark bay microbial mats. Front. Microbiol. 11, 1950 (2020).
Ghurye, J. S., Cepeda-Espinoza, V. & Pop, M. Metagenomic assembly: overview, challenges and applications. Yale J. Biol. Med. 89, 353–362 (2016).
Mantri, S. S. et al. Metagenomic sequencing of multiple soil horizons and sites in close vicinity revealed novel secondary metabolite diversity. mSystems 6, e0101821 (2021).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet: visualization of intersecting sets. IEEE Trans. Vis. Comput. Graph. 20, 1983–1992 (2014).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Hsieh, T. C., Ma, K. H. & Chao, A. iNEXT: an R package for rarefaction and extrapolation of species diversity (Hill numbers). Methods Ecol. Evol. 7, 1451–1456 (2016).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Gu, Z., Gu, L., Eils, R., Schlesner, M. & Brors, B. circlize implements and enhances circular visualization in R. Bioinformatics 30, 2811–2812 (2014).
Allen, M., Poggiali, D., Whitaker, K., Marshall, T. R. & Kievit, R. A. Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Res. 4, 63 (2019).
A.G. is grateful for the support of the Deutsche Forschungsgemeinschaft (DFG; Project ID No. 398967434-TRR 261). N. Ziemert was supported by the German Center for Infection Research (DZIF) (TTU 09.716). M.H.M. was supported by a European Research Council Starting Grant 948770-DECIPHER. S.K. was supported by the Graduate School for Experimental Plant Sciences (EPS) of Wageningen University. Work in the lab of R.M. was supported by BMBF (16GW0243), DFG and DZIF (807-5-8-0982600). A.G. and N. Ziemert thank the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany´s Excellence Strategy – EXC 2124 – 390838134 for the infrastructural support. A.G. thanks M. Direnc Mungan for valued discussions on optimizing the analysis, as well as C. Bagci for his imaginative suggestion on dealing with large data. We also thank L. do Presti for invaluable comments on the manuscript.
M.H.M. is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. The other authors declare no competing interests.
Peer review information
Nature Microbiology thanks Nigel Mouncey and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Illustrating the correlation between BGC clustering thresholds and the grouping of their pathway products.
a) a snippet of a complete-linkage hierarchical dendrogram constructed by doing a pairwise distance comparison of L2-normalized BGC features within the MIBiG dataset, highlighting the grouping of BGCs for the enediynes Uncialamycin (UCM) and Tiancimycin (TNM) under the threshold T = 0.5, and further grouping with another related enediyne BGC, Dynemicin (DNM) under the looser threshold of T = 0.7. b) Comparative genes analysis generated using the clinker tool92 v0.0.23 shows how UCM and TNM BGCs are much more similar to each other than to DNM (same-colored genes indicate <70% amino acid similarity, while colored edges indicate <50% amino acid similarity), which is consistent with the structural diversity of their compounds (pictured).
Extended Data Fig. 2 Intersections and distribution of biosynthetic diversity values among different ecosystem types.
The bar plot on the left depicts the number of Gene Cluster Families (GCFs as defined by BiG-SLiCE with T = 0.4) found in each biome type. The bar plot on top shows the size (number of GCFs) of each intersection. Which sets (biome types) are included in each intersection can be seen in the matrix below the bar plot, where the dark dots pinpoint included sets. If more than one set is part of an intersection, connecting lines are drawn for better visibility. The data presented in this graph come only from the MAGs in the GEMS dataset (see Supplementary Table 1), which was the only one with sufficient metadata. Only the top 63 most sizable intersections are depicted here, and only the 35 ecosystem types (with the most GCFs out of the 63) that were part of them are shown on the left. The data indicate that there is barely any overlap between the ecosystem types; most GCFs (74.43 %) are specific to a single biome (a complete overview of unique GCFs per ecosystem type can be found in Supplementary Table 7), while the largest intersection (the one including most habitats - not visible in this Figure) includes 50 of the 63 ecosystem types.
Extended Data Fig. 3 Overview of actual and potential biosynthetic diversity of bacterial kingdom, compared at REDgroup level.
Extended Data Fig. 3 is interactive and can be accessed online on iTOL: https://itol.embl.de/shared/1B6W5n9MixSdJ. GTDB bacterial tree up to REDgroup level (for more details see Methods - REDgroup definition), colour-coded by phylum, decorated with barplots of actual (orange) and potential (purple) Gene Cluster Families (GCFs) as defined by BiG-SLiCE (T = 0.4). Potential GCFs were computed by rarefaction analyses (for more details see Results - Well known and less popular taxa as sources of biosynthetic diversity). REDgroups names are displayed around the tree as leaf node labels; hovering over them provides further taxonomic information (for full REDgroup metadata see Supplementary Table 1). Phyla known to be enriched in NP producers are immediately visible (Actinobacteriota, Protobacteriota), with the most promising groups coming from the Actinobacteriota phylum (the highest peak belongs to a REDgroup containing Streptomyces strains). Simultaneously, within the underexplored phyla, there seems to be notable biosynthetic diversity and potential. This Figure is meant to be explored by zooming in and out, searching for keywords and visualizing different kinds of information by switching between Tree Views. Any other attempt at modification (for example turning datasets on and off) may result in an unreadable graph.
Unique GCFs, as defined by BiG-SLICE (T = 0.4), of bacterial phyla and Streptomyces (solid shapes) and pairwise overlaps of phyla - phyla and phyla - Streptomyces (ribbons). Each taxon has a distinct colour. The genus Streptomyces (1) appears to have a very high amount of unique GCFs comparable to entire phyla, such as Proteobacteria (43).
Supplementary Methods and Figs. 1–7.
Supplementary Tables 1–5 are available in the project’s zenodo repository (https://doi.org/10.5281/zenodo.6365726). Table 6. BGC IDs, MiBIG IDs, producer GTDB-based taxonomic information and GCF assignment (for BiG-SLiCE T = 0.4) for all MiBIG BGCs included in the creation of Fig. 1c. Table 7. Biogeography analysis of the Nayfach MAGs dataset. Number of genomes, BGCs, GCFs and unique GCFs per ecosystem type (as defined in the corresponding paper). Table 8. REDgroup full metadata: node IDs (can be used in the exploration of the tree in Extended Data 3), labels, number of members, number of BGCs, number of GCFs and potential GCFs (pGCFs) as defined by BiG-SLiCE (T = 0.4) and clust-o-matic (T = 0.5), GTDB taxonomic information and number of products in the NPASS database whose producer is a member of the REDgroup (NPASS_hits). Table 9. Comparison of random sampling analysis results to original results, including node IDs, labels, number of members, number of BGCs, number of GCFs and potential GCFs (pGCFs) as defined by BiG-SLiCE (T = 0.4), the original ranking based on the pGCFs, the average pGCFs from all random sampling iterations, the ranking based on the random sampling and GTDB taxonomic information.
Source Data for Fig. 1.
Source Data for Fig. 2 except for the tree in a, which is provided in the Zenodo repository.
Source Data for Fig. 3.
Source Data for Fig. 4 except for the tree in a and the source data of b, which are provided in the Zenodo repository.
Source Data for Fig. 5.
Source Data for Extended Data Fig. 1. Accompanying source data are provided in the Zenodo repository.
Source Data for Extended Data Fig. 2.
Source Data for Extended Data Fig. 3 except for the tree, which is provided in the Zenodo repository.
Source Data for Extended Data Fig. 4.
About this article
Cite this article
Gavriilidou, A., Kautsar, S.A., Zaburannyi, N. et al. Compendium of specialized metabolite biosynthetic diversity encoded in bacterial genomes. Nat Microbiol 7, 726–735 (2022). https://doi.org/10.1038/s41564-022-01110-2