Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Dobson, P. D., Patel, Y. & Kell, D. B. ‘Metabolite-likeness’ as a criterion in the design and selection of pharmaceutical drug libraries. Drug Discov. Today 14, 31–40 (2009).
Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 83, 770–803 (2020).
Koehn, F. E. & Carter, G. T. The evolving role of natural products in drug discovery. Nat. Rev. Drug. Discov. 4, 206–220 (2005).
Terlouw, B. R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 51, D603–D610 (2023).
Gavriilidou, A. et al. Compendium of specialized metabolite biosynthetic diversity encoded in bacterial genomes. Nat. Microbiol. 7, 726–735 (2022).
van der Hooft, J. J. J. et al. Linking genomics and metabolomics to chart specialized metabolic diversity. Chem. Soc. Rev. 49, 3297–3314 (2020).
Doerr, S. et al. TorchMD: a deep learning framework for molecular simulations. J. Chem. Theory Comput. 17, 2355–2363 (2021).
Rodríguez-Espigares, I. et al. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nat. Methods 17, 777–787 (2020).
Liu, X., IJzerman, A. P. & van Westen, G. J. P. Computational approaches for de novo drug design: past, present, and future. Methods Mol. Biol. 2190, 139–165 (2021).
Choudhury, C., Arul Murugan, N. & Priyakumar, U. D. Structure-based drug repurposing: traditional and advanced AI/ML-aided methods. Drug Discov. Today 27, 1847–1861 (2022).
Blin, K. et al. antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucleic Acids Res. 49, W29–W35 (2021).
Skinnider, M. A. et al. Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat. Commun. 11, 6058 (2020).
Medema, M. H. & Fischbach, M. A. Computational approaches to natural product discovery. Nat. Chem. Biol. 11, 639–648 (2015).
Medema, M. H., de Rond, T. & Moore, B. S. Mining genomes to illuminate the specialized chemistry of life. Nat. Rev. Genet. 22, 553–571 (2021).
Cimermancic, P. et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell 158, 412–421 (2014).
Hannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).
Carroll, L. M. et al. Accurate de novo identification of biosynthetic gene clusters with GECCO. Preprint at bioRxiv https://doi.org/10.1101/2021.05.03.442509 (2021).
Sanchez, S. et al. Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS. Preprint at bioRxiv https://doi.org/10.1101/2023.05.23.540769 (2023).
Kloosterman, A. M. et al. Expansion of RiPP biosynthetic space through integration of pan-genomics and machine learning uncovers a novel class of lanthipeptides. PLoS Biol. 18, e3001026 (2020).
de Los Santos, E. L. C. NeuRiPP: neural network identification of RiPP precursor peptides. Sci. Rep. 9, 13406 (2019).
Merwin, N. J. et al. DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products. Proc. Natl Acad. Sci. USA 117, 371–380 (2020).
Tietz, J. I. et al. A new genome-mining tool redefines the lasso peptide biosynthetic landscape. Nat. Chem. Biol. 13, 470–478 (2017).
Louwen, J. J. R. & van der Hooft, J. J. J. Comprehensive large-scale integrative analysis of omics data to accelerate specialized metabolite discovery. mSystems 6, e0072621 (2021).
Huber, F. et al. Spec2Vec: improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Biol. 17, e1008724 (2021).
Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J. Cheminform. 13, 84 (2021).
Ludwig, M. et al. Databse-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).
Hoffmann, M. A. et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol. 40, 411–421 (2022).
Dührkop, K. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat. Biotechnol. 39, 462–471 (2021).
Kim, H. W. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021).
Aalizadeh, R., Nika, M.-C. & Thomaidis, N. S. Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants. J. Hazard. Mater. 363, 277–285 (2019).
Chen, D., Wang, Z., Guo, D., Orekhov, V. & Qu, X. Review and prospect: deep learning in nuclear magnetic resonance spectroscopy. Chemistry 26, 10391–10401 (2020).
Wu, K. et al. Improvement in signal-to-noise ratio of liquid-state NMR spectroscopy via a deep neural network DN-unet. Anal. Chem. 93, 1377–1382 (2021).
Ito, K., Xu, X. & Kikuchi, J. Improved prediction of carbonless NMR spectra by the machine learning of theoretical and fragment descriptors for environmental mixture analysis. Anal. Chem. 93, 6901–6906 (2021).
Li, D.-W., Hansen, A. L., Yuan, C., Bruschweiler-Li, L. & Brüschweiler, R. DEEP picker is a deep neural network for accurate deconvolution of complex two-dimensional NMR spectra. Nat. Commun. 12, 5229 (2021).
Zheng, S. et al. Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP. Nat. Commun. 13, 3342 (2022).
Milanowski, D. J. et al. Unequivocal determination of caulamidines A and B: application and validation of new tools in the structure elucidation tool box. Chem. Sci. 9, 307–314 (2018).
Audoin, C. et al. Metabolome consistency: additional parazoanthines from the mediterranean zoanthid parazoanthus axinellae. Metabolites 4, 421–432 (2014).
Fox Ramos, A. E. et al. CANPA: computer-assisted natural products anticipation. Anal. Chem. 91, 11247–11252 (2019).
Jones, C. G. et al. The CryoEM method MicroED as a powerful tool for small molecule structure determination. ACS Cent. Sci. 4, 1587–1592 (2018).
Kim, L. J. et al. Prospecting for natural products by genome mining and microcrystal electron diffraction. Nat. Chem. Biol. 17, 872–877 (2021).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:fingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Lindsay, R. K. Applications of Artificial Intelligence for Organic Chemistry: The DENDRAL Project (McGraw-Hill, 1980).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).
Colby, S. M., Nuñez, J. R., Hodas, N. O., Corley, C. D. & Renslow, R. R. Deep learning to generate chemical property libraries and candidate molecules for small molecule identification in complex samples. Anal. Chem. 92, 1720–1729 (2020).
Burns, D. C., Mazzola, E. P. & Reynolds, W. F. The role of computer-assisted structure elucidation (CASE) programs in the structure elucidation of complex natural products. Nat. Prod. Rep. 36, 919–933 (2019).
Reher, R. et al. A convolutional neural network-based approach for the rapid annotation of molecularly diverse natural products. J. Am. Chem. Soc. 142, 4114–4120 (2020).
Kim, H. W., Zhang, C., Cottrell, G. W. & Gerwick, W. H. SMART‐Miner: a convolutional neural network‐based metabolite identification from 1H‐13C HSQC spectra. Magn. Reson. Chem. 60, 1070–1075 (2022).
Wang, C. et al. COLMAR lipids web server and ultrahigh-resolution methods for two-dimensional nuclear magnetic resonance- and mass spectrometry-based lipidomics. J. Proteome Res. 19, 1674–1683 (2020).
Smith, S. G. & Goodman, J. M. Assigning stereochemistry to single diastereoisomers by GIAO NMR calculation: the DP4 probability. J. Am. Chem. Soc. 132, 12946–12959 (2010).
Howarth, A., Ermanis, K. & Goodman, J. DP4-AI automated NMR data analysis: straight from spectrometer to structure. Chem. Sci. 11, 4351–4359 (2020).
Das, S., Edison, A. S. & Merz, K. M. Jr. Metabolite structure assignment using in silico NMR techniques. Anal. Chem. 92, 10412–10419 (2020).
Rodrigues, T., Reker, D., Schneider, P. & Schneider, G. Counting on natural products for drug design. Nat. Chem. 8, 531–541 (2016).
Lanz, J. & Riedl, R. Merging allosteric and active site binding motifs: de novo generation of target selectivity and potency via natural-product-derived fragments. ChemMedChem 10, 451–454 (2015).
Reker, D. et al. Revealing the macromolecular targets of complex natural products. Nat. Chem. 6, 1072–1078 (2014).
Wassermann, A. M. et al. A screening pattern recognition method finds new and divergent targets for drugs and natural products. ACS Chem. Biol. 9, 1622–1631 (2014).
Rollinger, J. M., Hornick, A., Langer, T., Stuppner, H. & Prast, H. Acetylcholinesterase inhibitory activity of scopolin and scopoletin discovered by virtual screening of natural products. J. Med. Chem. 47, 6248–6254 (2004).
Reker, D. et al. Machine learning uncovers food- and excipient-drug interactions. Cell Rep. 30, 3710–3716.e4 (2020).
Conde, J. et al. Allosteric antagonist modulation of TRPV2 by piperlongumine impairs glioblastoma progression. ACS Cent. Sci. 7, 868–881 (2021).
Lagunin, A., Filimonov, D. & Poroikov, V. Multi-targeted natural products evaluation based on biological activity prediction with PASS. Curr. Pharm. Des. 16, 1703–1717 (2010).
Sá, M. S. et al. Antimalarial activity of physalins B, D, F, and G. J. Nat. Prod. 74, 2269–2272 (2011).
Schneider, G. et al. Deorphaning the macromolecular targets of the natural anticancer compound doliculide. Angew. Chem. Int. Ed. Engl. 55, 12408–12411 (2016).
Bertoni, M. et al. Bioactivity descriptors for uncharacterized chemical compounds. Nat. Commun. 12, 3932 (2021).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 181, 475–483 (2020).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat. Chem. Biol. https://doi.org/10.1038/s41589-023-01349-8 (2023).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Pandey, M. et al. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intell. 4, 211–221 (2022).
Schindler, C. E. M. et al. Large-scale assessment of binding free energy calculations in active drug discovery projects. J. Chem. Inf. Model. 60, 5457–5474 (2020).
Walker, A. S. & Clardy, J. A machine learning bioinformatics method to predict biological activity from biosynthetic gene clusters. J. Chem. Inf. Model. 61, 2560–2571 (2021).
Yang, Z. et al. Deep-BGCpred: a unified deep learning genome-mining framework for biosynthetic gene cluster prediction. Preprint at bioRxiv https://doi.org/10.1101/2021.11.15.468547 (2021).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at arXiv. https://doi.org/10.48550/ARXIV.1301.3781 (2013).
Thaker, M. N. et al. Identifying producers of antibacterial compounds by screening for antibiotic resistance. Nat. Biotechnol. 31, 922–927 (2013).
Alcock, B. P. et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 48, D517–D525 (2020).
Bortolaia, V. et al. ResFinder 4.0 for predictions of phenotypes from genotypes. J. Antimicrob. Chemother. 75, 3491–3500 (2020).
Mungan, M. D. et al. ARTS 2.0: feature updates and expansion of the antibiotic resistant target seeker for comparative genome mining. Nucleic Acids Res. 48, W546–W552 (2020).
Jia, B. et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).
Sélem-Mojica, N., Aguilar, C., Gutiérrez-García, K., Martínez-Guerrero, C. E. & Barona-Gómez, F. EvoMining reveals the origin and fate of natural product biosynthetic enzymes. Microb. Genom. 5, e000260 (2019).
Chevrette, M. G. et al. Evolutionary dynamics of natural product biosynthesis in bacteria. Nat. Prod. Rep. 37, 566–599 (2020).
Cereto-Massagué, A. et al. Molecular fingerprint similarity search in virtual screening. Methods 71, 58–63 (2015).
Willighagen, E. L. et al. The chemistry development kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 9, 33 (2017).
Todeschini, R. & Consonni, V. Handbook of Molecular Descriptors (John Wiley & Sons, 2008).
Skinnider, M. A., Dejong, C. A., Franczak, B. C., McNicholas, P. D. & Magarvey, N. A. Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm. J. Cheminform. 9, 46 (2017).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Riniker, S. & Landrum, G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Cheminform. 5, 26 (2013).
O’Boyle, N. M. & Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminform. 8, 36 (2016).
Grisoni, F. et al. Scaffold hopping from natural products to synthetic mimetics by holistic molecular similarity. Commun. Chem. 1, 44 (2018).
Capecchi, A., Probst, D. & Reymond, J.-L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminform. 12, 43 (2020).
Capecchi, A. & Reymond, J.-L. Assigning the origin of microbial natural products by chemical space map and machine learning. Biomolecules 10, 1385 (2020).
Riniker, S. Molecular dynamics fingerprints (MDFP): machine learning from MD data to predict free-energy differences. J. Chem. Inf. Model. 57, 726–741 (2017).
Esposito, C., Wang, S., Lange, U. E. W., Oellien, F. & Riniker, S. Combining machine learning and molecular dynamics to predict p-glycoprotein substrates. J. Chem. Inf. Model. 60, 4730–4749 (2020).
Bannan, C. C. et al. Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge. J. Comput. Aided Mol. Des. 30, 927–944 (2016).
Wang, S. & Riniker, S. Use of molecular dynamics fingerprints (MDFPs) in SAMPL6 octanol-water log P blind challenge. J. Comput. Aided Mol. Des. 34, 393–403 (2020).
Gorostiola González, M. et al. 3DDPDs: describing protein dynamics for proteochemometric bioactivity prediction. A case for (mutant) G protein-coupled receptors. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv-2023-90082 (2023).
Durairaj, J., Akdel, M., de Ridder, D. & van Dijk, A. D. J. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics 36, i718–i725 (2020).
Paull, K. D. et al. Display and analysis of patterns of differential activity of drugs against human tumor cell lines: development of mean graph and COMPARE algorithm. J. Natl Cancer Inst. 81, 1088–1092 (1989).
Kauvar, L. M. et al. Predicting ligand binding to proteins by affinity fingerprinting. Chem. Biol. 2, 107–118 (1995).
Petrone, P. M. et al. Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem. Biol. 7, 1399–1409 (2012).
Norinder, U., Spjuth, O. & Svensson, F. Using predicted bioactivity profiles to improve predictive modeling. J. Chem. Inf. Model. 60, 2830–2837 (2020).
Mater, A. C. & Coote, M. L. Deep learning in chemistry. J. Chem. Inf. Model. 59, 2545–2559 (2019).
Bronstein, M. M., Bruna, J., Cohen, T. & Veličković, P. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. Preprint at arXiv. https://doi.org/10.48550/arXiv.2104.13478 (2021).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).
Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K. & Müller, K.-R. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (Springer Nature, 2019).
Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).
Jiménez-Luna, J., Skalic, M., Weskamp, N. & Schneider, G. Coloring molecules with explainable artificial intelligence for preclinical relevance assessment. J. Chem. Inf. Model. 61, 1083–1094 (2021).
Preuer, K., Klambauer, G., Rippmann, F., Hochreiter, S. & Unterthiner, T. in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (eds Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K. & Müller, K.-R.) 331–345 (Springer International Publishing, 2019).
Webel, H. E. et al. Revealing cytotoxic substructures in molecules using deep learning. J. Comput. Aided Mol. Des. 34, 731–746 (2020).
Kearnes, S., McCloskey, K., Berndl, M., Pande, V. & Riley, P. Molecular graph convolutions: moving beyond fingerprints. J. Comput. Aided Mol. Des. 30, 595–608 (2016).
Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inf. Model. 57, 1757–1772 (2017).
Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. in Advances in Neural Information Processing Systems 28 (NIPS 015).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. in Proceedings of the 34th International Conference on Machine Learning 1263–1272 (2017).
Nguyen, T. et al. GraphDTA: predicting drug-target binding affinity with graph neural networks. Bioinformatics 37, 1140–1147 (2021).
Yuan, W. et al. Chemical space mimicry for drug discovery. J. Chem. Inf. Model. 57, 875–882 (2017).
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P. & van Westen, G. J. P. DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. J. Cheminform. 15, 24 (2023).
Li, X. & Fourches, D. Inductive transfer learning for molecular activity prediction: next-gen QSAR models with MolPMoFiT. J. Cheminform. 12, 27 (2020).
Karpov, P., Godin, G. & Tetko, I. V. Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J. Cheminform. 12, 17 (2020).
Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).
Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
Bjerrum, E. J. & Sattarov, B. Improving chemical autoencoder latent space and molecular generation diversity with heteroencoders. Biomolecules 8, 131 (2018).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 3, 1023–1032 (2021).
Callaway, E. After AlphaFold: protein-folding contest seeks next big breakthrough. Nature 613, 13–14 (2023).
Wallner, B. AFsample: improving multimer prediction with alphafold using aggressive sampling. Preprint at bioRxiv https://doi.org/10.1101/2022.12.20.521205 (2022).
Bender, A. & Cortés-Ciriano, I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: ways to make an impact, and why we are not there yet. Drug Discov. Today 26, 511–524 (2021).
Bender, A. & Cortés-Ciriano, I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov. Today 26, 1040–1052 (2021).
Sydow, D., Rodríguez-Guerra, J. & Volkamer, A. in Teaching Programming across the Chemistry Curriculum 135–158 ACS Symposium Series vol. 1387 (American Chemical Society, 2021).
Korshunova, M., Ginsburg, B., Tropsha, A. & Isayev, O. OpenChem: a deep learning toolkit for computational chemistry and drug design. J. Chem. Inf. Model. 61, 7–13 (2021).
Sieg, J., Flachsenberg, F. & Rarey, M. In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J. Chem. Inf. Model. 59, 947–961 (2019).
Lenselink, E. B. et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J. Cheminform. 9, 45 (2017).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
Topçuoğlu, B. D., Lesniak, N. A., Ruffin, M. T. 4th, Wiens, J. & Schloss, P. D. A framework for effective application of machine learning to microbiome-based classification problems. MBio 11, e00434-20 (2020).
Quinn, T. P. & Erb, I. Examining microbe–metabolite correlations by linear methods. Nat. Methods 18, 37–39 (2021).
Morger, A. et al. KnowTox: pipeline and case study for confident prediction of potential toxic effects of compounds in early phases of development. J. Cheminform. 12, 24 (2020).
Soleimany, A. P. et al. Evidential deep learning for guided molecular property prediction and discovery. ACS Cent. Sci. 7, 1356–1367 (2021).
Manica, M. et al. Toward explainable anticancer compound sensitivity prediction via multimodal attention-based convolutional encoders. Mol. Pharm. 16, 4797–4806 (2019).
Grinsztajn, L., Oyallon, E. & Varoquaux, G. in Advances in Neural Information Processing Systems 35 (NeurIPS 2022) 507–520 (2022).
Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. Preprint at https://doi.org/10.48550/arXiv.2010.09885 (2020).
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn. Sci. Technol. 3, 015022 (2022).
Chapelle, O., Zien, A. & Schölkopf, B. (Eds) Semi-Supervised Learning (MIT, 2006).
Zhang, Y. & Lee, A. A. Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning. Chem. Sci. 10, 8154–8163 (2019).
Röttig, M. et al. NRPSpredictor2—a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res. 39, W362–W367 (2011).
Torrey, L. & Shavlik, J. in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 242–264 (IGI Global, 2010).
Cai, C. et al. Transfer learning for drug discovery. J. Med. Chem. 63, 8683–8694 (2020).
Moret, M., Helmstädter, M., Grisoni, F., Schneider, G. & Merk, D. Beam search for automated design and scoring of novel ROR ligands with machine intelligence. Angew. Chem. Int. Ed. Engl. 60, 19477–19482 (2021).
Moret, M., Friedrich, L., Grisoni, F., Merk, D. & Schneider, G. Generative molecular design in low data regimes. Nat. Mach. Intell. 2, 171–180 (2020).
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
Reker, D. Practical considerations for active machine learning in drug discovery. Drug Discov. Today Technol. 32–33, 73–79 (2019).
Reker, D., Schneider, P. & Schneider, G. Multi-objective active machine learning rapidly improves structure-activity models and reveals new protein-protein interaction inhibitors. Chem. Sci. 7, 3919–3927 (2016).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).
Reher, R. et al. Native metabolomics identifies the rivulariapeptolide family of protease inhibitors. Nat. Commun. 13, 4619 (2022).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P. & van Westen, G. J. P. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine A2A receptor. J. Cheminform. 11, 35 (2019).
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Coley, C. W. et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).
Thakkar, A., Kogej, T., Reymond, J.-L., Engkvist, O. & Bjerrum, E. J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci. 11, 154–168 (2020).
Koch, M., Duigou, T. & Faulon, J.-L. Reinforcement learning for bioretrosynthesis. ACS Synth. Biol. 9, 157–168 (2020).
Kramer, C., Kalliokoski, T., Gedeck, P. & Vulpetti, A. The experimental uncertainty of heterogeneous public ki data. J. Med. Chem. 55, 5165–5173 (2012).
Tiikkainen, P., Bellis, L., Light, Y. & Franke, L. Estimating error rates in bioactivity databases. J. Chem. Inf. Model. 53, 2499–2505 (2013).
Sorokina, M. & Steinbeck, C. Review on natural products databases: where to find data in 2020. J. Cheminform. 12, 1–51 (2020).
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Liu, T., Lin, Y., Wen, X., Jorissen, R. N. & Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 35, D198–D201 (2007).
Wimalaratne, S. M. et al. Uniform resolution of compact identifiers for biomedical data. Sci. Data 5, 180029 (2018).
Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER 1.0: deep learning for chemical image recognition using transformers. J. Cheminformatics 13, 61 (2021).
Rajan, K., Brinkhaus, H. O., Sorokina, M., Zielesny, A. & Steinbeck, C. DECIMER-segmentation: automated extraction of chemical structure depictions from scientific literature. J. Cheminform. 13, 20 (2021).
Schymanski, E. L. & Bolton, E. E. FAIR chemical structures in the Journal of Cheminformatics. J. Cheminform. 13, 50 (2021).
Kautsar, S. A. et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 48, D454–D458 (2020).
van Santen, J. A. et al. The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019).
van Santen, J. A. et al. The natural products atlas 2.0: a database of microbially-derived natural products. Nucleic Acids Res. 50, D1317–D1323 (2021).
Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).
Wishart, D. S. et al. NP-MRD: the natural products magnetic resonance database. Nucleic Acids Res. 50, D665–D677 (2022).
Flissi, A. et al. Norine: update of the nonribosomal peptide resource. Nucleic Acids Res. 48, D465–D469 (2020).
Jarmusch, S. A., van der Hooft, J. J. J., Dorrestein, P. C. & Jarmusch, A. K. Advancements in capturing and mining mass spectrometry data are transforming natural products research. Nat. Prod. Rep. 38, 2066–2082 (2021).
Jarmusch, A. K. et al. ReDU: a framework to find and reanalyze public mass spectrometry data. Nat. Methods 17, 901–904 (2020).
Proteau, P. J. Journal of Natural Products 2022: perspectives, monthly cover art, and more. J. Nat. Products 85, 1–2 (2022).
Clark, T. N. et al. Interlaboratory comparison of untargeted mass spectrometry data uncovers underlying causes for variability. J. Nat. Prod. 84, 824–835 (2021).
Fiehn, O. et al. The metabolomics standards initiative (MSI). Metabolomics 3, 175–178 (2007).
Frank, A. M. et al. Clustering millions of tandem mass spectra. J. Proteome Res. 7, 113–122 (2008).
Miller, I. J. et al. Autometa: automated extraction of microbial genomes from individual shotgun metagenomes. Nucleic Acids Res. 47, e57 (2019).
Schymanski, E. L. et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ. Sci. Technol. 48, 2097–2098 (2014).
Deutsch, E. W. et al. Universal spectrum identifier for mass spectra. Nat. Methods 18, 768–770 (2021).
Bittremieux, W. et al. Universal MS/MS visualization and retrieval with the metabolomics spectrum resolver web service. Preprint at BioRxiv https://doi.org/10.1101/2020.05.09.086066 (2020).
Gordon, J. E. Chemical inference. 2. formalization of the language of organic chemistry: generic systematic nomenclature. J. Chem. Inf. Comput. Sci. 24, 81–92 (1984).
Wang, Y. et al. PubChem’s bioassay database. Nucleic Acids Res. 40, D400–D412 (2012).
Banerjee, P. et al. Super Natural II—a database of natural products. Nucleic Acids Res. 43, D935–D939 (2015).
Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, D1217–D1222 (2018).
van der Hooft, J. J. J. A community-driven paired data platform to accelerate natural product mining by combining structural information from genomes and metabolomes. Preprint at https://doi.org/10.18174/fairdata2018.16286 (2018).
Eldjárn, G. H. et al. Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions. PLoS Comput. Biol. 17, e1008920 (2021).
Schorn, M. A. et al. A community resource for paired genomic and metabolomic data mining. Nat. Chem. Biol. 17, 363–368 (2021).
Doroghazi, J. R. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat. Chem. Biol. 10, 963–968 (2014).
McClure, R. A. et al. Elucidating the rimosamide-detoxin natural product families and their biosynthesis using metabolite/gene cluster correlations. ACS Chem. Biol. 11, 3452–3460 (2016).
Goering, A. W. et al. Metabologenomics: correlation of microbial gene clusters with metabolites drives discovery of a nonribosomal peptide with an unusual amino acid monomer. ACS Cent. Sci. 2, 99–108 (2016).
Parkinson, E. I. et al. Discovery of the tyrobetaine natural products and their biosynthetic gene cluster via metabologenomics. ACS Chem. Biol. 13, 1029–1037 (2018).
Caesar, L. K. et al. Correlative metabologenomics of 110 fungi reveals metabolite-gene cluster pairs. Nat. Chem. Biol. 19, 846–854 (2023).
Soldatou, S. et al. Comparative metabologenomics analysis of polar actinomycetes. Mar. Drugs 19, 103 (2021).
Sulheim, S. et al. Enzyme-constrained models and omics analysis of streptomyces coelicolor reveal metabolic changes that enhance heterologous production. iScience 23, 101525 (2020).
Amos, G. C. A. et al. Comparative transcriptomics as a guide to natural product discovery and biosynthetic gene cluster functionality. Proc. Natl Acad. Sci. USA 114, E11121–E11130 (2017).
Wandy, J. & Daly, R. GraphOmics: an interactive platform to explore and integrate multi-omics data. BMC Bioinform. 22, 603 (2021).
Eren, A. M. et al. Community-led, integrated, reproducible multi-omics with anvi’o. Nat. Microbiol. 6, 3–6 (2020).
Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: collection of open natural products database. J. Cheminform. 13, 2 (2021).
Rutz, A. et al. The LOTUS initiative for open knowledge management in natural products research. eLife 11, e70780 (2022).
Chen, Y., Stork, C., Hirte, S. & Kirchmair, J. NP-scout: machine learning approach for the quantification and visualization of the natural product-likeness of small molecules. Biomolecules 9, 43 (2019).
Cao, L. et al. MolDiscovery: learning mass spectrometry fragmentation of small molecules. Nat. Commun. 12, 3718 (2021).
Visser, U. et al. BioAssay Ontology (BAO): a semantic description of bioassays and high-throughput screening results. BMC Bioinform. 12, 257 (2011).
Sarntivijai, S. et al. CLO: the cell line ontology. J. Biomed. Semant. 5, 37 (2014).
Shoemaker, R. H. The NCI60 human tumour cell line anticancer drug screen. Nat. Rev. Cancer 6, 813–823 (2006).
Cooper, M. A. A community-based approach to new antibiotic discovery. Nat. Rev. Drug. Discov. 14, 587–588 (2015).
Cech, N. B., Medema, M. H. & Clardy, J. Benefiting from big data in natural products: importance of preserving foundational skills and prioritizing data quality. Nat. Prod. Rep. 38, 1947–1953 (2021).
Blin, K., Shaw, S., Kautsar, S. A., Medema, M. H. & Weber, T. The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes. Nucleic Acids Res. 49, D639–D643 (2021).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass. Spectrom. 45, 703–714 (2010).
Haug, K. et al. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 48, D440–D444 (2020).
Kuhn, S. & Schlörer, N. E. Facilitating quality control for spectra assignments of small organic molecules: nmrshiftdb2–a free in-house NMR database with integrated LIMS for academic service laboratories. Magn. Reson. Chem. 53, 582–589 (2015).
Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).
Hastings, J. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).
Martens, M. et al. WikiPathways: connecting communities. Nucleic Acids Res. 49, D613–D621 (2021).
Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).
Blaskovich, M. A. T., Zuegg, J., Elliott, A. G. & Cooper, M. A. Helping chemists discover new antibiotics. ACS Infect. Dis. 1, 285–287 (2015).
Waagmeester, A. et al. Wikidata as a knowledge graph for the life sciences. eLife 9, e52614 (2020).
Reker, D., Rodrigues, T., Schneider, P. & Schneider, G. Target prediction by cascaded self-organizing maps for ligand de-orphaning and side-effect investigation. J. Cheminform. 6, P47 (2014).
Navarro-Muñoz, J. C. et al. A computational framework to explore large-scale biosynthetic diversity. Nat. Chem. Biol. 16, 60–68 (2020).
van der Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).
Reymond, J.-L. The chemical space project. Acc. Chem. Res. 48, 722–730 (2015).
Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Deliv. Rev. 46, 3–26 (2001).
Janssen, A. P. A. et al. Drug discovery maps, a machine learning model that visualizes and predicts kinome–inhibitor interaction landscapes. J. Chem. Inf. Model. 59, 1221–1229 (2019).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open. Source Softw. 3, 861 (2018).
Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform. 12, 12 (2020).
Feher, M. & Schmidt, J. M. Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci. 43, 218–227 (2003).
Béquignon, O. J. M. et al. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. J. Cheminform. 15, 3 (2023).
All authors thank the Lorentz Center and Leiden University for funding the Lorentz Workshop ‘Artificial Intelligence for Natural Product Drug Discovery’ that laid the foundation for this Review. M.W.M. was supported by funds from the Duchossois Family Institute at the University of Chicago. K.R.D. was supported by the UK Research and Innovation Biotechnology and Biological Sciences Research Council (BB/R022054/1). N.G. was supported by an NSF CAREER award (award number 2047235). J.J.J.v.d.H. was supported by an ASDI eScience grant from the Netherlands eScience Center (award number ASDI.2017.030). N.I.M. is supported by funding from the European Research Council (ERC consolidator grant agreement no. 725523). K.B. was supported by a Novo Nordisk Foundation grant NNF20CC0035580. M.G.G. was supported by ONCODE funding. E.J.N.H. was supported by the LOEWE Center for Translational Biodiversity Genomics and the Funds of the Chemical Industry Germany. M.S. was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), Project-ID 239748522, SFB 1127 ChemBioSys. M.A.B. was supported by the National French Agency (ANR grants 15-CE29-0001 and 20-CE43-0010). C.M.C. was supported by a National Library of Medicine training grant to the Computation and Informatics in Biology and Medicine Training Program (NLM 5T15LM007359). S.F. was supported by MASTS/IbioIC/Xanthella. O.V.K. was funded by the Klaus Faber Foundation. H.K. was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (grants NRF 2018R1A5A2023127 and NRF 2022R1F1A107462311). J.M. was supported by a grant from the Research Foundation – Flanders (G061821N). E.R.R. was supported by the US National Science Foundation (DBI-1845890). D.R. was supported, in part, by a Flash Grant from NC Biotech (2021-FLG-3819), a UNC CGIBD Pilot Award (NIH NIDDK DK034987), a Duke Cancer Institute and Duke Microbiome Center Pilot Award (NIH NCI CA014236), the Engineering Research Center for Precision Microbiome Engineering (NSF EEC-2133504), and the Duke Science and Technology Initiative. P.S. acknowledges support from the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation. N.Z. was supported by Germany’s Excellence Strategy – EXC 2124-390838134. H.U.K. was supported by the KAIST Key Research Institute (Interdisciplinary Research Group) Project. R.G.L. was supported by the US NIH (U41-AT008718 and U24-AT010811). S.L.R. was supported by Eawag discretionary funding. M.H.M. was supported by the Leiden University ‘van der Klaauw’ chair for theoretical biology and an ERC Starting Grant (DECIPHER-948770). We thank A. R. Leach for discussion of the content of this manuscript. We thank all participants of the Lorentz Workshop ‘Artificial Intelligence for Natural Product Drug Discovery’ who did not participate in the review writing for providing inspiration through their talks and/or discussions.
J.J.J.v.d.H. is a member of the scientific advisory board of NAICONS Srl., Milan, Italy. C.A.D. is a founding member of Adapsyn Bioscience. M.A.S. is a consultant to Adapsyn Bioscience. M.H.M. is on the scientific advisory board of Hexagon Bio and co-founder of Design Pharmaceuticals. The other authors declare no competing interests.
Peer review information
Nature Reviews Drug Discovery thanks Hosein Mohimani and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
National Database of Antibiotic Resistant Organisms (NDARO): https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/
RDKit: Open-Source Cheminformatics Software: http://www.rdkit.org/
Wikidata Query Service: https://w.wiki/5bpq
About this article
Cite this article
Mullowney, M.W., Duncan, K.R., Elsayed, S.S. et al. Artificial intelligence for natural product drug discovery. Nat Rev Drug Discov 22, 895–916 (2023). https://doi.org/10.1038/s41573-023-00774-7