A roadmap for natural product discovery based on large-scale genomics and metabolomics

Doroghazi, James R; Albright, Jessica C; Goering, Anthony W; Ju, Kou-San; Haines, Robert R; Tchalukov, Konstantin A; Labeda, David P; Kelleher, Neil L; Metcalf, William W

doi:10.1038/nchembio.1659

Article
Published: 28 September 2014

A roadmap for natural product discovery based on large-scale genomics and metabolomics

James R Doroghazi¹^na1,
Jessica C Albright ORCID: orcid.org/0000-0001-6368-1331^2,3,4^na1,
Anthony W Goering^2,3,4,
Kou-San Ju¹,
Robert R Haines⁵,
Konstantin A Tchalukov⁵,
David P Labeda⁶,
Neil L Kelleher^2,3,4 &
…
William W Metcalf ORCID: orcid.org/0000-0002-0182-0671^1,5

Nature Chemical Biology volume 10, pages 963–968 (2014)Cite this article

16k Accesses
342 Citations
58 Altmetric
Metrics details

Subjects

Abstract

Actinobacteria encode a wealth of natural product biosynthetic gene clusters, whose systematic study is complicated by numerous repetitive motifs. By combining several metrics, we developed a method for the global classification of these gene clusters into families (GCFs) and analyzed the biosynthetic capacity of Actinobacteria in 830 genome sequences, including 344 obtained for this project. The GCF network, comprising 11,422 gene clusters grouped into 4,122 GCFs, was validated in hundreds of strains by correlating confident mass spectrometric detection of known small molecules with the presence or absence of their established biosynthetic gene clusters. The method also linked previously unassigned GCFs to known natural products, an approach that will enable de novo, bioassay-free discovery of new natural products using large data sets. Extrapolation from the 830-genome data set reveals that Actinobacteria encode hundreds of thousands of future drug leads, and the strong correlation between phylogeny and GCFs frames a roadmap to efficiently access them.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Similarity metrics for NPGC comparisons.**

**Figure 2: Genomic NPGC content and extrapolation.**

**Figure 3: GCF conservation over genetic distance.**

A computational framework to explore large-scale biosynthetic diversity

Article 25 November 2019

Ecology and genomics of Actinobacteria: new concepts for natural product discovery

Article 01 June 2020

Compendium of specialized metabolite biosynthetic diversity encoded in bacterial genomes

Article 02 May 2022

Accession codes

Accessions

BioProject

PRJNA238534

References

Bérdy, J. Bioactive microbial metabolites. J. Antibiot. (Tokyo) 58, 1–26 (2005).
Article Google Scholar
Bérdy, J. Thoughts and facts about antibiotics: where we are now and where we are heading. J. Antibiot. (Tokyo) 65, 385–395 (2012).
Article Google Scholar
Bentley, S.D. et al. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417, 141–147 (2002).
Article PubMed Google Scholar
Lautru, S., Deeth, R.J., Bailey, L.M. & Challis, G.L. Discovery of a new peptide natural product by Streptomyces coelicolor genome mining. Nat. Chem. Biol. 1, 265–269 (2005).
Article CAS PubMed Google Scholar
Kersten, R.D. et al. A mass spectrometry–guided genome mining approach for natural product peptidogenomics. Nat. Chem. Biol. 7, 794–802 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ziemert, N. et al. The natural product domain seeker NaPDoS: a phylogeny based bioinformatic tool to classify secondary metabolite gene diversity. PLoS ONE 7, e34064 (2012).
Article CAS PubMed PubMed Central Google Scholar
Medema, M.H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346 (2011).
Article CAS PubMed PubMed Central Google Scholar
Conway, K.R. & Boddy, C.N. ClusterMine360: a database of microbial PKS/NRPS biosynthesis. Nucleic Acids Res. 41, D402–D407 (2013).
Article CAS PubMed Google Scholar
Diminic, J. et al. Databases of the thiotemplate modular systems (CSDB) and their in silico recombinants (r-CSDB). J. Ind. Microbiol. Biotechnol. 40, 653–659 (2013).
Article CAS PubMed Google Scholar
Yadav, G., Gokhale, R.S. & Mohanty, D. SEARCHPKS: a program for detection and analysis of polyketide synthase domains. Nucleic Acids Res. 31, 3654–3658 (2003).
Article CAS PubMed PubMed Central Google Scholar
Tae, H., Kong, E.-B. & Park, K. ASMPKS: an analysis system for modular polyketide synthases. BMC Bioinformatics 8, 327 (2007).
Article PubMed PubMed Central Google Scholar
Ichikawa, N. et al. DoBISCUIT: a database of secondary metabolite biosynthetic gene clusters. Nucleic Acids Res. 41, D408–D414 (2013).
Article CAS PubMed Google Scholar
Caboche, S. et al. NORINE: a database of nonribosomal peptides. Nucleic Acids Res. 36, D326–D331 (2008).
Article CAS PubMed Google Scholar
Kim, J. & Yi, G.-S. PKMiner: a database for exploring type II polyketide synthases. BMC Microbiol. 12, 169 (2012).
Article CAS PubMed PubMed Central Google Scholar
Fischbach, M.A. & Walsh, C. Assembly-line enzymology for polyketide and nonribosomal peptide antibiotics: logic, machinery, and mechanisms. Chem. Rev. 106, 3468–3496 (2006).
Article CAS PubMed Google Scholar
Raghupathy, N. & Durand, D. Gene cluster statistics with gene families. Mol. Biol. Evol. 26, 957–968 (2009).
Article CAS PubMed PubMed Central Google Scholar
Wang, X. et al. Identification and characterization of the actinomycin G gene cluster of Streptomyces iakyrus. Mol. Biosyst. 9, 1286–1289 (2013).
Article CAS PubMed Google Scholar
Colwell, R.K. et al. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol. 5, 3–21 (2012).
Article Google Scholar
Doroghazi, J.R. & Metcalf, W.W. Comparative genomics of actinomycetes with a focus on natural product biosynthetic genes. BMC Genomics 14, 611 (2013).
Article CAS PubMed PubMed Central Google Scholar
Jensen, P.R., Williams, P.G., Oh, D.C., Zeigler, L. & Fenical, W. Species-specific secondary metabolite production in marine actinomycetes of the genus Salinispora. Appl. Environ. Microbiol. 73, 1146–1152 (2007).
Article CAS PubMed Google Scholar
Dunbar, K.L., Melby, J.O. & Mitchell, D.A. YcaO domains use ATP to activate amide backbones during peptide cyclodehydrations. Nat. Chem. Biol. 8, 569–575 (2012).
Article CAS PubMed PubMed Central Google Scholar
Charlop-Powers, Z., Owen, J.G., Reddy, B.V.B., Ternei, M.A. & Brady, S.F. Chemical-biogeographic survey of secondary metabolism in soil. Proc. Natl. Acad. Sci. USA 111, 3757–3762 (2014).
Article CAS PubMed PubMed Central Google Scholar
Bunge, J., Willis, A. & Walsh, F. Estimating the number of species in microbial diversity studies. Annual Review of Statistics and Its Application 1, 427–445 (2014).
Article Google Scholar
Nguyen, D.D. et al. MS/MS networking guided analysis of molecule and gene cluster families. Proc. Natl. Acad. Sci. USA 110, E2611–E2620 (2013).
Article CAS PubMed PubMed Central Google Scholar
Cote, R. in ATCC Bacteria and Bacteriophages 19th edn (eds. Pienta, P., Tang, J. & Cote, R.) 484 (American Type Culture Collection, 1996).
Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Article CAS PubMed PubMed Central Google Scholar
Chaisson, M.J., Brinza, D. & Pevzner, P.A. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).
Article CAS PubMed PubMed Central Google Scholar
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005); erratum Nature 4, 120 (2006).
Article CAS PubMed PubMed Central Google Scholar
Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011).
Article CAS PubMed Google Scholar
Peng, Y., Leung, H.C., Yiu, S.-M. & Chin, F.Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
Article CAS PubMed Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article PubMed PubMed Central Google Scholar
Li, L., Stoeckert, C.J. & Roos, D.S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Article CAS PubMed Google Scholar
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) (eds. Simoudis, E., Han, J. & Fayyad, U.) 226–231 (AAAI Press, 1996).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
CAS PubMed PubMed Central Google Scholar
Price, M.N., Dehal, P.S. & Arkin, A.P. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article PubMed PubMed Central Google Scholar
Zhang, Q., Yu, Y., Vélasquez, J.E. & van der Donk, W.A. Evolution of lanthipeptide synthetases. Proc. Natl. Acad. Sci. USA 109, 18361–18366 (2012).
Article CAS PubMed PubMed Central Google Scholar
Yutin, N., Puigbò, P., Koonin, E.V. & Wolf, Y.I. Phylogenomics of prokaryotic ribosomal proteins. PLoS ONE 7, e36972 (2012).
Article CAS PubMed PubMed Central Google Scholar
Larkin, M.A. et al. Clustal W and clustal X version 2.0. Bioinformatics 23, 2947–2948 (2007).
Article CAS PubMed Google Scholar
Huerta-Cepas, J., Dopazo, J. & Gabaldón, T. ETE: a python Environment for Tree Exploration. BMC Bioinformatics 11, 24 (2010).
Article PubMed PubMed Central Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
El-Nakeeb, M.A. & Lechevalier, H.A. Selective isolation of aerobic actinomycetes. Appl. Microbiol. 11, 75–77 (1963).
CAS PubMed PubMed Central Google Scholar
Smith, S.E. et al. Comparative genomic and phylogenetic approaches to characterize the role of genetic recombination in mycobacterial evolution. PLoS ONE 7, e50070 (2012).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

J.R.D. was funded through an Institute for Genomic Biology fellowship. This work was supported in part by US National Institutes of Health grants GM PO1 GM077596 and GM 067725 (N.L.K.) and an Institute for Genomic Biology Proof of Concept grant. D.P.L. and the Agricultural Research Service (ARS) Culture Collection Current Research Information System project is funded through ARS National Program 301.

Author information

James R Doroghazi and Jessica C Albright: These authors contributed equally to this work.

Authors and Affiliations

Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
James R Doroghazi, Kou-San Ju & William W Metcalf
Department of Chemistry, Northwestern University, Evanston, Illinois, USA
Jessica C Albright, Anthony W Goering & Neil L Kelleher
Department of Molecular Biosciences, Northwestern University, Evanston, Illinois, USA
Jessica C Albright, Anthony W Goering & Neil L Kelleher
Feinberg School of Medicine, Northwestern University, Evanston, Illinois, USA
Jessica C Albright, Anthony W Goering & Neil L Kelleher
Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
Robert R Haines, Konstantin A Tchalukov & William W Metcalf
US Department of Agriculture, Bacterial Foodborne Pathogens and Mycology Research, Agricultural Research Service, National Center for Agricultural Utilization Research, Peoria, Illinois, USA
David P Labeda

Authors

James R Doroghazi
View author publications
You can also search for this author in PubMed Google Scholar
Jessica C Albright
View author publications
You can also search for this author in PubMed Google Scholar
Anthony W Goering
View author publications
You can also search for this author in PubMed Google Scholar
Kou-San Ju
View author publications
You can also search for this author in PubMed Google Scholar
Robert R Haines
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin A Tchalukov
View author publications
You can also search for this author in PubMed Google Scholar
David P Labeda
View author publications
You can also search for this author in PubMed Google Scholar
Neil L Kelleher
View author publications
You can also search for this author in PubMed Google Scholar
William W Metcalf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.R.D. designed and performed bioinformatic analyses. J.R.D., R.R.H. and K.A.T. produced microbial extracts. J.C.A. designed LC/MS experiments, and J.C.A. and A.W.G. collected and analyzed LC/MS data. K.-S.J. received and processed germplasm and contributed to genomic library preparation. D.P.L. selected and provided germplasm from the ARS Culture Collection. W.W.M. and N.L.K. designed and directed the work. J.R.D., J.C.A., W.W.M. and N.L.K. wrote the manuscript. Draft genomes sequenced as part of this project are available through NCBI BioProject PRJNA238534.

Corresponding authors

Correspondence to Neil L Kelleher or William W Metcalf.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Results, Supplementary Tables 1–3, Supplementary Figures 1–6 and Supplementary Note. (PDF 2814 kb)

Supplementary Data Set 1

A spreadsheet that lists the locus tags and corresponding organism for each genome in this study (XLSX 33 kb)

Supplementary Data Set 2

All gene cluster families that have at least one characterized natural product are listed in this Excel spreadsheet. (XLSX 50 kb)

Supplementary Data Set 3

A spreadsheet containing raw MS data for the known compounds identified in this study. (XLSX 33 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doroghazi, J., Albright, J., Goering, A. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat Chem Biol 10, 963–968 (2014). https://doi.org/10.1038/nchembio.1659

Download citation

Received: 11 March 2014
Accepted: 04 September 2014
Published: 28 September 2014
Issue Date: November 2014
DOI: https://doi.org/10.1038/nchembio.1659