Generative AI for designing and validating easily synthesizable and structurally novel antibiotics

Swanson, Kyle; Liu, Gary; Catacutan, Denise B.; Arnold, Autumn; Zou, James; Stokes, Jonathan M.

doi:10.1038/s42256-024-00809-7

Article
Published: 22 March 2024

Generative AI for designing and validating easily synthesizable and structurally novel antibiotics

Nature Machine Intelligence volume 6, pages 338–353 (2024)Cite this article

6989 Accesses
1 Citations
507 Altmetric
Metrics details

Subjects

Abstract

The rise of pan-resistant bacteria is creating an urgent need for structurally novel antibiotics. Artificial intelligence methods can discover new antibiotics, but existing methods have notable limitations. Property prediction models, which evaluate molecules one-by-one for a given property, scale poorly to large chemical spaces. Generative models, which directly design molecules, rapidly explore vast chemical spaces but generate molecules that are challenging to synthesize. Here we introduce SyntheMol, a generative model that designs new compounds, which are easy to synthesize, from a chemical space of nearly 30 billion molecules. We apply SyntheMol to design molecules that inhibit the growth of Acinetobacter baumannii, a burdensome Gram-negative bacterial pathogen. We synthesize 58 generated molecules and experimentally validate them, with six structurally novel molecules demonstrating antibacterial activity against A. baumannii and several other phylogenetically diverse bacterial pathogens. This demonstrates the potential of generative artificial intelligence to design structurally novel, synthesizable and effective small-molecule antibiotic candidates from vast chemical spaces, with empirical validation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Generative AI for antibiotic discovery.**

**Fig. 2: Property prediction model development.**

**Fig. 4: Generative model development.**

**Fig. 5: In vitro validation of generated molecules.**

SAVI, in silico generation of billions of easily synthesizable compounds through expert-system type rules

Article Open access 11 November 2020

Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations

Article 11 March 2021

Discovering highly potent antimicrobial peptides with deep generative model HydrAMP

Article Open access 15 March 2023

Data availability

The data used in this paper, including training data and generated molecules, are available in the Supplementary Data. The data, along with trained model checkpoints and LC–MS and ¹H-NMR spectra, are available at https://doi.org/10.5281/zenodo.10257839 (ref. ⁶⁸). The ChEMBL database can be accessed from www.ebi.ac.uk/chembl.

Code availability

Code for data processing and analysis, property prediction model training and SyntheMol molecule generation is available at https://github.com/swansonk14/SyntheMol (ref. ⁶⁹). This code repository makes use of general cheminformatics functions from https://github.com/swansonk14/chemfunc as well as Chemprop model code from https://github.com/chemprop/chemprop.

References

Murray, C. J. et al. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629–655 (2022).
Article CAS Google Scholar
Rice, L. B. Federal funding for the study of antimicrobial resistance in nosocomial pathogens: No ESKAPE. J. Infect. Dis. 197, 1079–1081 (2008).
Article PubMed Google Scholar
Ma, Y. et al. Considerations and caveats in combating ESKAPE pathogens against nosocomial infections. Adv. Sci. 7, 1901872 (2020).
Article CAS Google Scholar
Tacconelli, E. et al. Discovery, research, and development of new antibiotics: the WHO priority list of antibiotic-resistant bacteria and tuberculosis. Lancet Infect. Dis. 18, 318–327 (2018).
Article PubMed Google Scholar
Lee, C. R. et al. Biology of Acinetobacter baumannii: pathogenesis, antibiotic resistance mechanisms, and prospective treatment options. Front. Cell. Infect. Microbiol. 7, 55 (2017).
Article PubMed PubMed Central Google Scholar
Carracedo-Reboredo, P. et al. A review on machine learning approaches and trends in drug discovery. Comput. Struct. Biotechnol. J. 19, 4538–4558 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gaudelet, T. et al. Utilizing graph machine learning within drug discovery and development. Brief. Bioinform. 22, bbab159 (2021).
Article PubMed PubMed Central Google Scholar
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702.e13 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rahman, A. S. M. Z. et al. A machine learning model trained on a high-throughput antibacterial screen increases the hit rate of drug discovery. PLoS Comput. Biol. 18, e1010613 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zeng, X. et al. Deep generative molecular design reshapes drug discovery. Cell Rep. Med. 3, 100794 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bilodeau, C., Jin, W., Jaakkola, T., Barzilay, R. & Jensen, K. F. Generative models for molecular discovery: recent advances and challenges. WIREs Comput. Mol. Sci. 12, e1608 (2022).
Article Google Scholar
Bian, Y. & Xie, X. Q. Generative chemistry: drug discovery with deep learning generative models. J. Mol. Model. 27, 71 (2021).
Article CAS PubMed Google Scholar
Liu, G. & Stokes, J. M. A brief guide to machine learning for antibiotic discovery. Curr. Opin. Microbiol. 69, 102190 (2022).
Article CAS PubMed Google Scholar
Gao, W. & Coley, C. W. The synthesizability of molecules proposed by generative models. J. Chem. Inf. Model. 60, 5714–5723 (2020).
Article CAS PubMed Google Scholar
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. H. S. & Hernández-Lobato, J. M. A model to search for synthesizable molecules. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F. & Fox, E. B.) 7937–7949 (Curran Associates Inc., 2019).
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. H. S. & Hernández-Lobato, J. M. Barking up the right tree: an approach to search over molecule synthesis DAGs. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 6852–6866 (Curran Associates Inc., 2020).
Gottipati, S. K. et al. Learning to navigate the synthetically accessible chemical space using reinforcement learning. In Proc. 37th International Conference on Machine Learning (eds Daumé III, H. & Singh, A.) 3668–3679 (PMLR, 2020).
Gao, W., Mercado, R. & Coley, C. W. Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design. In Proc. 10th International Conference on Learning Representations (2022); https://openreview.net/forum?id=FRxhHdnxt1
Pedawi, A., Gniewek, P., Chang, C., Anderson, B. M. & Bedem, H. van den. An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. In Proc. 36th International Conference on Neural Information Processing Systems (eds Oh, A. H., Agarwal. A., Belgrave, D. & Cho, K.) (2022); https://openreview.net/forum?id=VBbxHvbJd94
Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In Proc. European Conference on Machine Learning, ECML 2006 Vol. 4212 (eds Furnkranz, J. et al.) 282–293 (Springer, 2006).
Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In Proc. International Conference on Computers and Games, CG 2006 Vol. 4630 (eds van den Herik, H. J. et al.) 72–83 (Springer, 2007).
Grygorenko, O. O. et al. Generating multibillion chemical space of readily accessible screening compounds. iScience 23, 101681 (2020).
Article CAS PubMed PubMed Central Google Scholar
Stokes, J. M., Davis, J. H., Mangat, C. S., Williamson, J. R. & Brown, E. D. Discovery of a small molecule that inhibits bacterial ribosome biogenesis. eLife 3, e03574 (2014).
Article PubMed PubMed Central Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Article CAS PubMed Google Scholar
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Article CAS PubMed PubMed Central Google Scholar
RDKit: open-source cheminformatics. RDKit https://www.rdkit.org/. Accessed 28 Mar 2022.
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article CAS PubMed Google Scholar
Tversky, A. Features of similarity. Psychol. Rev. 84, 327–352 (1977).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Arthur, D. & Vassilvitskii, S. K-Means++: the advantages of careful seeding. In Proc. Eighteenth Annu. ACM-SIAM Symp. Discrete Algorithms 1027–1035 (SIAM, 2007).
Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular similarity in medicinal chemistry: miniperspective. J. Med. Chem. 57, 3186–3204 (2014).
Article CAS PubMed Google Scholar
Tanimoto, T. T. IBM Internal Report (IBM, 1957).
Nikaido, H. Molecular basis of bacterial outer membrane permeability revisited. Microbiol. Mol. Biol. Rev. 67, 593–656 (2003).
Article CAS PubMed PubMed Central Google Scholar
Zurawski, D. V. et al. SPR741, an antibiotic adjuvant, potentiates the in vitro and in vivo activity of rifampin against clinically relevant extensively drug-resistant Acinetobacter baumannii. Antimicrob. Agents Chemother. 61, e01239-17 (2017).
Article PubMed PubMed Central Google Scholar
Eckburg, P. B. et al. Safety, tolerability, pharmacokinetics, and drug interaction potential of SPR741, an intravenous potentiator, after single and multiple ascending doses and when combined with β-lactam antibiotics in healthy subjects. Antimicrob. Agents Chemother. 63, e00892-19 (2019).
Article PubMed PubMed Central Google Scholar
Moffatt, J. H. et al. Colistin resistance in Acinetobacter baumannii is mediated by complete loss of lipopolysaccharide production. Antimicrob. Agents Chemother. 54, 4971–4977 (2010).
Article CAS PubMed PubMed Central Google Scholar
O’Neill, A. J., Cove, J. H. & Chopra, I. Mutation frequencies for resistance to fusidic acid and rifampicin in Staphylococcus aureus. J. Antimicrob. Chemother. 47, 647–650 (2001).
Article PubMed Google Scholar
Björkholm, B. et al. Mutation frequency and biological cost of antibiotic resistance in Helicobacter pylori. Proc. Natl Acad. Sci. USA 98, 14607–14612 (2001).
Article PubMed PubMed Central Google Scholar
Nicholson, W. L. & Maughan, H. The spectrum of spontaneous rifampin resistance mutations in the rpoB Gene of Bacillussubtilis 168 spores differs from that of vegetative cells and resembles that of Mycobacterium tuberculosis. J. Bacteriol. 184, 4936–4940 (2002).
Article CAS PubMed PubMed Central Google Scholar
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article CAS PubMed Google Scholar
Melo, M. C. R., Maasch, J. R. M. A. & de la Fuente-Nunez, C. Accelerating antibiotic discovery through artificial intelligence. Commun. Biol. 4, 1050 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yan, J. et al. Recent progress in the discovery and design of antimicrobial peptides using traditional machine learning and deep learning. Antibiotics 11, 1451 (2022).
Article CAS PubMed PubMed Central Google Scholar
Mahlapuu, M., Håkansson, J., Ringstad, L. & Björn, C. Antimicrobial peptides: an emerging category of therapeutic agents. Front. Cell. Infect. Microbiol. 6, 194 (2016).
Article PubMed PubMed Central Google Scholar
Mahlapuu, M., Björn, C. & Ekblom, J. Antimicrobial peptides as therapeutic agents: opportunities and challenges. Crit. Rev. Biotechnol. 40, 978–992 (2020).
Article CAS PubMed Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article PubMed PubMed Central Google Scholar
Kang, S. & Cho, K. Conditional molecular design with deep generative models. J. Chem. Inf. Model. 59, 43–52 (2019).
Article CAS PubMed Google Scholar
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).
Article Google Scholar
Liu, Q., Allamanis, M., Brockschmidt, M. & Gaunt, A. L. Constrained graph variational autoencoders for molecule design. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Grauman, K. & Cesa-Bianchi, N.) 7806–7815 (Curran Associates Inc., 2018).
You, J., Liu, B., Ying, R., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Grauman, K. & Cesa-Bianchi, N.) 6412–6422 (Curran Associates Inc., 2018).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. ICML 80, 2323–2332 (2018).
Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. ICML 119, 4839–4848 (2020).
Google Scholar
Bilodeau, C. et al. Generating molecules with optimized aqueous solubility using iterative graph translation. React. Chem. Eng. 7, 297–309 (2022).
Article CAS Google Scholar
Sadybekov, A. A. et al. Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601, 452–459 (2022).
Article CAS PubMed Google Scholar
Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: an efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 18, 972–976 (2017).
Article CAS PubMed PubMed Central Google Scholar
Qian, H., Lin, C., Zhao, D., Tu, S. & Xu, L. AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus. 1, pgac227 (2022).
Article PubMed PubMed Central Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. ICML 119, 4849–4859 (2020).
Google Scholar
Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Article CAS PubMed Google Scholar
Coley, C. W. et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).
Article CAS PubMed Google Scholar
Walters, W. P. & Murcko, M. Assessing the impact of generative AI on medicinal chemistry. Nat. Biotechnol. 38, 143–145 (2020).
Article CAS PubMed Google Scholar
Corsello, S. M. et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).
Article CAS PubMed PubMed Central Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
CAS Google Scholar
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F. & Fox, E. B.) 8026–8037 (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Daylight Theory. SMARTS - a language for describing molecular patterns. Daylight Chemical Information Systems Inc. www.daylight.com/dayhtml/doc/theory/theory.smarts.html (2022).
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Article CAS Google Scholar
Swanson, K. et al. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics: data and models. Zenodo https://doi.org/10.5281/zenodo.10257839 (2023).
Swanson, K. & Liu, G. swansonk/SyntheMol: SyntheMol. Zenodo https://doi.org/10.5281/zenodo.10278151 (2023).
Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat. Chem. Biol. 19, 1342–1350 (2023).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This research was kindly supported by the Weston Family Foundation (POP and Catalyst to J.M.S.); the David Braley Centre for Antibiotic Discovery (J.M.S.); the Canadian Institutes of Health Research (J.M.S.); a generous gift from M. and M. Heersink (J.M.S.) and the Chan-Zuckerberg Biohub (J.Z.). We thank Y. Moroz for his help accessing and answering our questions about the Enamine REAL Space building blocks, reactions and molecules. We thank G. Dubinina for help obtaining generated compounds. We thank M. Karelina, J. Miguel Hernández-Lobato, A. Tripp and M. Segler for insightful discussions about our property prediction and generative methods. We thank J. Boyce and A. Wright for providing mutant strains of A. baumannii used in this study. K.S. acknowledges support from the Knight-Hennessy scholarship.

Author information

These authors contributed equally: Kyle Swanson, Gary Liu.

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, USA
Kyle Swanson & James Zou
Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
Gary Liu, Denise B. Catacutan, Autumn Arnold & Jonathan M. Stokes
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
James Zou

Authors

Kyle Swanson
View author publications
You can also search for this author in PubMed Google Scholar
Gary Liu
View author publications
You can also search for this author in PubMed Google Scholar
Denise B. Catacutan
View author publications
You can also search for this author in PubMed Google Scholar
Autumn Arnold
View author publications
You can also search for this author in PubMed Google Scholar
James Zou
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan M. Stokes
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization was carried out by K.S., G.L., J.Z. and J.M.S. Model development was performed by K.S. and G.L. Biological validation was carried out by D.B.C., A.A. and J.M.S. K.S., G.L., D.B.C., J.Z. and J.M.S. wrote the paper. J.Z. and J.M.S. supervised the work.

Corresponding authors

Correspondence to James Zou or Jonathan M. Stokes.

Ethics declarations

Competing interests

These authors declare the following competing interests: K.S. is employed part-time by Greenstone Biosciences; J.Z. is on the scientific advisory board of Greenstone Biosciences and J.M.S. is cofounder and scientific director of Phare Bio. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Feixiong Cheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Additional Property Prediction Model Development.

(a) Normalized growth of A. baumannii ATCC 17978 in biological duplicate for each of the three training set chemical libraries. Note: compounds with normalized growth > 3 are removed for visual purposes. The R² values are the coefficient of determination. (b) Receiver operating characteristic (ROC) curves and (c) precision-recall (PRC) curves for the Chemprop, Chemprop-RDKit, and random forest models. For each model, the black lines show the performance of each of the ten models in the ensemble and the blue curve shows the average. (d, e) Model performance of each of our three property prediction models when generalizing between our three training set libraries. Values on the diagonal are the average test set performance of a model on a single library across tenfold cross-validation. Values on the off-diagonals are the result of applying an ensemble of ten models trained on one library to a different library and evaluating those predictions. (d) Performance measured by area under the receiver operating characteristic curve (ROC-AUC). (e) Performance measured by area under the precision-recall curve (PRC-AUC).

Extended Data Fig. 2 REAL Space Analysis, Comparisons to Training Set, and Reactions.

(a) The cumulative percent of molecules in REAL Space that can be produced by each of the 169 REAL chemical reactions. (b) The percent of molecules in REAL Space that include each of the REAL building blocks. (c) The molecular weight distribution of a random sample of 25,000 REAL molecules (blue), the REAL building blocks (black), and our training set molecules (red). (d) The cLogP distribution of a random sample of 25,000 REAL molecules (blue), the REAL building blocks (black), and our training set molecules (red). (e) The remaining 5 REAL chemical reactions we used (first 8 in Fig. 4b), along with the number and percent of REAL molecules produced by each reaction.

Extended Data Fig. 3 REAL Building Block and Full Molecule Scores from Chemprop-RDKit and Random Forest.

(a, b) The distribution of antibacterial model scores on the REAL building blocks using the (a) Chemprop-RDKit or (b) random forest models. (c, d) The correlation between the antibacterial model score of a REAL molecule and the average antibacterial model score of its constituent building blocks using the (c) Chemprop-RDKit or (d) random forest models. The R² values are the coefficient of determination.

Extended Data Fig. 4 Comparison of Generated Sets with and without Building Block Diversity.

(a–c) The frequency with which building blocks were used in the generated molecules of SyntheMol, with and without the building block diversity score penalty for (a) Chemprop, (b) Chemprop-RDKit, and (c) random forest.

Extended Data Fig. 5 Model Scores by Rollout from Chemprop-RDKit and Random Forest.

(a, b) Violin plots of the distribution of antibacterial model scores for every 2,000 rollouts of the MCTS algorithm over 20,000 rollouts. SyntheMol uses the (a) Chemprop-RDKit or (b) random forest models for antibacterial prediction scores. The lines in each violin indicate the first quartile, the median, and the third quartile.

Extended Data Fig. 6 Additional Analysis of Chemprop Generated and Selected Sets.

(a) The percent of building blocks that appear at different frequencies among the generated or selected compounds by SyntheMol with Chemprop. Building blocks are assigned to bins on the x-axis based on the number of generated or selected compounds that contain that building block, with the final bin including building blocks that appear in at least six compounds (max 137). (b) The distribution of chemical reactions used by the generated or selected compounds by SyntheMol with Chemprop. (c) A t-SNE visualization of the training set along with all generated and selected molecules from each of the three property predictor models.

Extended Data Fig. 7 Analysis of Chemprop-RDKit Generated and Selected Sets.

(a) The percent of building blocks that appear at different frequencies among the generated or selected compounds by SyntheMol with Chemprop-RDKit. Building blocks are assigned to bins on the x-axis based on the number of generated or selected compounds that contain that building block, with the final bin including building blocks that appear in at least six compounds (max 185). (b) The distribution of chemical reactions used by the generated or selected compounds by SyntheMol with Chemprop-RDKit. (c–f) A comparison of the properties of the 25,828 molecules generated by SyntheMol with the Chemprop-RDKit antibacterial model and the 50 molecules selected from that set after applying post-hoc filters. (c) The distribution of nearest neighbour Tversky similarities between the generated or selected compounds and the active molecules in the training set. (d) The distribution of nearest neighbor Tversky similarities between the generated or selected compounds and the known antibacterial compounds from ChEMBL. (e) The distribution of Chemprop-RDKit antibacterial model scores on the generated or selected compounds, as well as on a random set of 25,000 REAL molecules. (f) The distribution of nearest neighbor Tanimoto similarities among the generated or selected compounds.

Extended Data Fig. 8 Analysis of Random Forest Generated and Selected Sets.

(a) The percent of building blocks that appear at different frequencies among the generated or selected compounds by SyntheMol with random forest. Building blocks are assigned to bins on the x-axis based on the number of generated or selected compounds that contain that building block, with the final bin including building blocks that appear in at least six compounds (max 212). (b) The distribution of chemical reactions used by the generated or selected compounds by SyntheMol with random forest. (c–f) A comparison of the properties of the 27,396 molecules generated by SyntheMol with the random forest antibacterial model and the 50 molecules selected from that set after applying post-hoc filters. (c) The distribution of nearest neighbor Tversky similarities between the generated or selected compounds and the active molecules in the training set. (d) The distribution of nearest neighbor Tversky similarities between the generated or selected compounds and the known antibacterial compounds from ChEMBL. (e) The distribution of random forest antibacterial model scores on the generated or selected compounds as well as on a random set of 25,000 REAL molecules. (f) The distribution of nearest neighbor Tanimoto similarities among the generated or selected compounds.

Extended Data Fig. 9 Additional In Vitro Validation.

(a) Gram-negative bacterial isolates tested for growth inhibition against SPR 741 or colistin. Experiments were performed in biological duplicate. Error bars represent absolute range of optical density measurements at 600 nm. (b) Heat map summarizing MICs of 58 randomly selected compounds from the REAL Space against A. baumannii ATCC 17978 in I) LB medium, II) LB medium + a quarter MIC SPR 741, and III) LB medium + a quarter MIC colistin. Compounds were tested at concentrations from 256 µg/mL to 4 µg/mL in two-fold serial dilutions. Lighter colours indicate lower MIC values for each random REAL molecule. No compounds displayed potent antibacterial activity using the threshold of MIC ≤ 8 µg/mL. Experiments were performed in at least biological duplicate. (c, d) Chequerboard analysis to quantify synergy, as defined by FICI, with SPR 741 or colistin against Gram-negative isolates. Chequerboard experiments were performed using two-fold serial dilution series with the maximum and minimum concentrations of the potentiator (x-axis) and compound (y-axis) shown in µg/mL. Darker blue represents higher bacterial growth. Experiments were performed in biological duplicate. The mean growth of each well is shown. (c) Chequerboard assays using the six bioactive compounds, in combination with colistin, against A. baumannii ATCC 17978. (d) Chequerboard assays using rifampicin – a control antibiotic – in combination with SPR 741 or colistin against a panel of Gram-negative bacterial species.

Extended Data Fig. 10 Toxicity Predictions.

Predictions of the probability of clinical toxicity using an ensemble of ten Chemprop-RDKit models trained on the ClinTox dataset. ‘Non-toxic compounds’ show the toxicity predictions of the model on the non-toxic molecules in the dataset (n = 1,372), where each molecule’s prediction score comes from the one model in the ensemble for which that molecule was in the test set. ‘Toxic compounds’ shows the same toxicity predictions for the toxic molecules in the dataset (n = 112). ‘Selected six’ shows the average prediction of the ensemble of ten models on the six potent generated molecules. Blue horizontal lines represent the mean predictions for each set.

Supplementary information

Supplementary Information

Supplementary Tables 1–6 and Extended Discussion.

Reporting Summary

Supplementary Data 1–9

Supplementary Data 1: Training sets for antibacterial activity and clinical toxicity. Supplementary Data 2: Antibiotic and antibacterial molecules from ChEMBL. Supplementary Data 3: Antibacterial Chemprop/Chemprop-RDKit/random forest models performance summary. Supplementary Data 4: Analysis of the Enamine REAL Space. Supplementary Data 5: cLogP Chemprop model performance summary trained for 1 and 30 epochs. Supplementary Data 6: Molecules generated by SyntheMol, using the antibacterial Chemprop model as a scoring function. Supplementary Data 7: Molecules generated by SyntheMol, using the antibacterial Chemprop-RDKit model as a scoring function. Supplementary Data 8: Molecules generated by SyntheMol, using the antibacterial random forest model as a scoring function. Supplementary Data 9: Molecules selected for synthesis by Enamine.

Supplementary Data

Quality control LC–MS data for SyntheMol-generated compounds.

Supplementary Data

Quality control LC–MS data for randomly selected control compounds.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Swanson, K., Liu, G., Catacutan, D.B. et al. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nat Mach Intell 6, 338–353 (2024). https://doi.org/10.1038/s42256-024-00809-7

Download citation

Received: 10 March 2023
Accepted: 08 February 2024
Published: 22 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1038/s42256-024-00809-7