Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Generative AI for designing and validating easily synthesizable and structurally novel antibiotics

Abstract

The rise of pan-resistant bacteria is creating an urgent need for structurally novel antibiotics. Artificial intelligence methods can discover new antibiotics, but existing methods have notable limitations. Property prediction models, which evaluate molecules one-by-one for a given property, scale poorly to large chemical spaces. Generative models, which directly design molecules, rapidly explore vast chemical spaces but generate molecules that are challenging to synthesize. Here we introduce SyntheMol, a generative model that designs new compounds, which are easy to synthesize, from a chemical space of nearly 30 billion molecules. We apply SyntheMol to design molecules that inhibit the growth of Acinetobacter baumannii, a burdensome Gram-negative bacterial pathogen. We synthesize 58 generated molecules and experimentally validate them, with six structurally novel molecules demonstrating antibacterial activity against A. baumannii and several other phylogenetically diverse bacterial pathogens. This demonstrates the potential of generative artificial intelligence to design structurally novel, synthesizable and effective small-molecule antibiotic candidates from vast chemical spaces, with empirical validation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Generative AI for antibiotic discovery.
Fig. 2: Property prediction model development.
Fig. 3: SyntheMol.
Fig. 4: Generative model development.
Fig. 5: In vitro validation of generated molecules.

Similar content being viewed by others

Data availability

The data used in this paper, including training data and generated molecules, are available in the Supplementary Data. The data, along with trained model checkpoints and LC–MS and 1H-NMR spectra, are available at https://doi.org/10.5281/zenodo.10257839 (ref. 68). The ChEMBL database can be accessed from www.ebi.ac.uk/chembl.

Code availability

Code for data processing and analysis, property prediction model training and SyntheMol molecule generation is available at https://github.com/swansonk14/SyntheMol (ref. 69). This code repository makes use of general cheminformatics functions from https://github.com/swansonk14/chemfunc as well as Chemprop model code from https://github.com/chemprop/chemprop.

References

  1. Murray, C. J. et al. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629–655 (2022).

    Article  CAS  Google Scholar 

  2. Rice, L. B. Federal funding for the study of antimicrobial resistance in nosocomial pathogens: No ESKAPE. J. Infect. Dis. 197, 1079–1081 (2008).

    Article  PubMed  Google Scholar 

  3. Ma, Y. et al. Considerations and caveats in combating ESKAPE pathogens against nosocomial infections. Adv. Sci. 7, 1901872 (2020).

    Article  CAS  Google Scholar 

  4. Tacconelli, E. et al. Discovery, research, and development of new antibiotics: the WHO priority list of antibiotic-resistant bacteria and tuberculosis. Lancet Infect. Dis. 18, 318–327 (2018).

    Article  PubMed  Google Scholar 

  5. Lee, C. R. et al. Biology of Acinetobacter baumannii: pathogenesis, antibiotic resistance mechanisms, and prospective treatment options. Front. Cell. Infect. Microbiol. 7, 55 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Carracedo-Reboredo, P. et al. A review on machine learning approaches and trends in drug discovery. Comput. Struct. Biotechnol. J. 19, 4538–4558 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Gaudelet, T. et al. Utilizing graph machine learning within drug discovery and development. Brief. Bioinform. 22, bbab159 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702.e13 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Rahman, A. S. M. Z. et al. A machine learning model trained on a high-throughput antibacterial screen increases the hit rate of drug discovery. PLoS Comput. Biol. 18, e1010613 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Zeng, X. et al. Deep generative molecular design reshapes drug discovery. Cell Rep. Med. 3, 100794 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Bilodeau, C., Jin, W., Jaakkola, T., Barzilay, R. & Jensen, K. F. Generative models for molecular discovery: recent advances and challenges. WIREs Comput. Mol. Sci. 12, e1608 (2022).

    Article  Google Scholar 

  12. Bian, Y. & Xie, X. Q. Generative chemistry: drug discovery with deep learning generative models. J. Mol. Model. 27, 71 (2021).

    Article  CAS  PubMed  Google Scholar 

  13. Liu, G. & Stokes, J. M. A brief guide to machine learning for antibiotic discovery. Curr. Opin. Microbiol. 69, 102190 (2022).

    Article  CAS  PubMed  Google Scholar 

  14. Gao, W. & Coley, C. W. The synthesizability of molecules proposed by generative models. J. Chem. Inf. Model. 60, 5714–5723 (2020).

    Article  CAS  PubMed  Google Scholar 

  15. Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. H. S. & Hernández-Lobato, J. M. A model to search for synthesizable molecules. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F. & Fox, E. B.) 7937–7949 (Curran Associates Inc., 2019).

  16. Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. H. S. & Hernández-Lobato, J. M. Barking up the right tree: an approach to search over molecule synthesis DAGs. In Proc. 34th International Conference on Neural Information Processing Systems (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 6852–6866 (Curran Associates Inc., 2020).

  17. Gottipati, S. K. et al. Learning to navigate the synthetically accessible chemical space using reinforcement learning. In Proc. 37th International Conference on Machine Learning (eds Daumé III, H. & Singh, A.) 3668–3679 (PMLR, 2020).

  18. Gao, W., Mercado, R. & Coley, C. W. Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design. In Proc. 10th International Conference on Learning Representations (2022); https://openreview.net/forum?id=FRxhHdnxt1

  19. Pedawi, A., Gniewek, P., Chang, C., Anderson, B. M. & Bedem, H. van den. An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. In Proc. 36th International Conference on Neural Information Processing Systems (eds Oh, A. H., Agarwal. A., Belgrave, D. & Cho, K.) (2022); https://openreview.net/forum?id=VBbxHvbJd94

  20. Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In Proc. European Conference on Machine Learning, ECML 2006 Vol. 4212 (eds Furnkranz, J. et al.) 282–293 (Springer, 2006).

  21. Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In Proc. International Conference on Computers and Games, CG 2006 Vol. 4630 (eds van den Herik, H. J. et al.) 72–83 (Springer, 2007).

  22. Grygorenko, O. O. et al. Generating multibillion chemical space of readily accessible screening compounds. iScience 23, 101681 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Stokes, J. M., Davis, J. H., Mangat, C. S., Williamson, J. R. & Brown, E. D. Discovery of a small molecule that inhibits bacterial ribosome biogenesis. eLife 3, e03574 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  24. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  25. Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).

    Article  CAS  PubMed  Google Scholar 

  26. Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. RDKit: open-source cheminformatics. RDKit https://www.rdkit.org/. Accessed 28 Mar 2022.

  28. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  29. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    Article  CAS  PubMed  Google Scholar 

  30. Tversky, A. Features of similarity. Psychol. Rev. 84, 327–352 (1977).

    Article  Google Scholar 

  31. Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article  CAS  PubMed  Google Scholar 

  32. Arthur, D. & Vassilvitskii, S. K-Means++: the advantages of careful seeding. In Proc. Eighteenth Annu. ACM-SIAM Symp. Discrete Algorithms 1027–1035 (SIAM, 2007).

  33. Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular similarity in medicinal chemistry: miniperspective. J. Med. Chem. 57, 3186–3204 (2014).

    Article  CAS  PubMed  Google Scholar 

  34. Tanimoto, T. T. IBM Internal Report (IBM, 1957).

  35. Nikaido, H. Molecular basis of bacterial outer membrane permeability revisited. Microbiol. Mol. Biol. Rev. 67, 593–656 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Zurawski, D. V. et al. SPR741, an antibiotic adjuvant, potentiates the in vitro and in vivo activity of rifampin against clinically relevant extensively drug-resistant Acinetobacter baumannii. Antimicrob. Agents Chemother. 61, e01239-17 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Eckburg, P. B. et al. Safety, tolerability, pharmacokinetics, and drug interaction potential of SPR741, an intravenous potentiator, after single and multiple ascending doses and when combined with β-lactam antibiotics in healthy subjects. Antimicrob. Agents Chemother. 63, e00892-19 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Moffatt, J. H. et al. Colistin resistance in Acinetobacter baumannii is mediated by complete loss of lipopolysaccharide production. Antimicrob. Agents Chemother. 54, 4971–4977 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. O’Neill, A. J., Cove, J. H. & Chopra, I. Mutation frequencies for resistance to fusidic acid and rifampicin in Staphylococcus aureus. J. Antimicrob. Chemother. 47, 647–650 (2001).

    Article  PubMed  Google Scholar 

  40. Björkholm, B. et al. Mutation frequency and biological cost of antibiotic resistance in Helicobacter pylori. Proc. Natl Acad. Sci. USA 98, 14607–14612 (2001).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Nicholson, W. L. & Maughan, H. The spectrum of spontaneous rifampin resistance mutations in the rpoB Gene of Bacillussubtilis 168 spores differs from that of vegetative cells and resembles that of Mycobacterium tuberculosis. J. Bacteriol. 184, 4936–4940 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    Article  CAS  PubMed  Google Scholar 

  43. Melo, M. C. R., Maasch, J. R. M. A. & de la Fuente-Nunez, C. Accelerating antibiotic discovery through artificial intelligence. Commun. Biol. 4, 1050 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Yan, J. et al. Recent progress in the discovery and design of antimicrobial peptides using traditional machine learning and deep learning. Antibiotics 11, 1451 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Mahlapuu, M., Håkansson, J., Ringstad, L. & Björn, C. Antimicrobial peptides: an emerging category of therapeutic agents. Front. Cell. Infect. Microbiol. 6, 194 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Mahlapuu, M., Björn, C. & Ekblom, J. Antimicrobial peptides as therapeutic agents: opportunities and challenges. Crit. Rev. Biotechnol. 40, 978–992 (2020).

    Article  CAS  PubMed  Google Scholar 

  47. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Kang, S. & Cho, K. Conditional molecular design with deep generative models. J. Chem. Inf. Model. 59, 43–52 (2019).

    Article  CAS  PubMed  Google Scholar 

  49. Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).

    Article  Google Scholar 

  50. Liu, Q., Allamanis, M., Brockschmidt, M. & Gaunt, A. L. Constrained graph variational autoencoders for molecule design. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Grauman, K. & Cesa-Bianchi, N.) 7806–7815 (Curran Associates Inc., 2018).

  51. You, J., Liu, B., Ying, R., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Grauman, K. & Cesa-Bianchi, N.) 6412–6422 (Curran Associates Inc., 2018).

  52. Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. ICML 80, 2323–2332 (2018).

    Google Scholar 

  53. Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. ICML 119, 4839–4848 (2020).

    Google Scholar 

  54. Bilodeau, C. et al. Generating molecules with optimized aqueous solubility using iterative graph translation. React. Chem. Eng. 7, 297–309 (2022).

    Article  CAS  Google Scholar 

  55. Sadybekov, A. A. et al. Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601, 452–459 (2022).

    Article  CAS  PubMed  Google Scholar 

  56. Yang, X., Zhang, J., Yoshizoe, K., Terayama, K. & Tsuda, K. ChemTS: an efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 18, 972–976 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Qian, H., Lin, C., Zhao, D., Tu, S. & Xu, L. AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus. 1, pgac227 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. ICML 119, 4849–4859 (2020).

    Google Scholar 

  59. Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).

    Article  CAS  PubMed  Google Scholar 

  60. Coley, C. W. et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).

    Article  CAS  PubMed  Google Scholar 

  61. Walters, W. P. & Murcko, M. Assessing the impact of generative AI on medicinal chemistry. Nat. Biotechnol. 38, 143–145 (2020).

    Article  CAS  PubMed  Google Scholar 

  62. Corsello, S. M. et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).

    CAS  Google Scholar 

  64. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F. & Fox, E. B.) 8026–8037 (2019).

  65. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  66. Daylight Theory. SMARTS - a language for describing molecular patterns. Daylight Chemical Information Systems Inc. www.daylight.com/dayhtml/doc/theory/theory.smarts.html (2022).

  67. Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).

    Article  CAS  Google Scholar 

  68. Swanson, K. et al. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics: data and models. Zenodo https://doi.org/10.5281/zenodo.10257839 (2023).

  69. Swanson, K. & Liu, G. swansonk/SyntheMol: SyntheMol. Zenodo https://doi.org/10.5281/zenodo.10278151 (2023).

  70. Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat. Chem. Biol. 19, 1342–1350 (2023).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This research was kindly supported by the Weston Family Foundation (POP and Catalyst to J.M.S.); the David Braley Centre for Antibiotic Discovery (J.M.S.); the Canadian Institutes of Health Research (J.M.S.); a generous gift from M. and M. Heersink (J.M.S.) and the Chan-Zuckerberg Biohub (J.Z.). We thank Y. Moroz for his help accessing and answering our questions about the Enamine REAL Space building blocks, reactions and molecules. We thank G. Dubinina for help obtaining generated compounds. We thank M. Karelina, J. Miguel Hernández-Lobato, A. Tripp and M. Segler for insightful discussions about our property prediction and generative methods. We thank J. Boyce and A. Wright for providing mutant strains of A. baumannii used in this study. K.S. acknowledges support from the Knight-Hennessy scholarship.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization was carried out by K.S., G.L., J.Z. and J.M.S. Model development was performed by K.S. and G.L. Biological validation was carried out by D.B.C., A.A. and J.M.S. K.S., G.L., D.B.C., J.Z. and J.M.S. wrote the paper. J.Z. and J.M.S. supervised the work.

Corresponding authors

Correspondence to James Zou or Jonathan M. Stokes.

Ethics declarations

Competing interests

These authors declare the following competing interests: K.S. is employed part-time by Greenstone Biosciences; J.Z. is on the scientific advisory board of Greenstone Biosciences and J.M.S. is cofounder and scientific director of Phare Bio. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Feixiong Cheng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Additional Property Prediction Model Development.

(a) Normalized growth of A. baumannii ATCC 17978 in biological duplicate for each of the three training set chemical libraries. Note: compounds with normalized growth > 3 are removed for visual purposes. The R2 values are the coefficient of determination. (b) Receiver operating characteristic (ROC) curves and (c) precision-recall (PRC) curves for the Chemprop, Chemprop-RDKit, and random forest models. For each model, the black lines show the performance of each of the ten models in the ensemble and the blue curve shows the average. (d, e) Model performance of each of our three property prediction models when generalizing between our three training set libraries. Values on the diagonal are the average test set performance of a model on a single library across tenfold cross-validation. Values on the off-diagonals are the result of applying an ensemble of ten models trained on one library to a different library and evaluating those predictions. (d) Performance measured by area under the receiver operating characteristic curve (ROC-AUC). (e) Performance measured by area under the precision-recall curve (PRC-AUC).

Extended Data Fig. 2 REAL Space Analysis, Comparisons to Training Set, and Reactions.

(a) The cumulative percent of molecules in REAL Space that can be produced by each of the 169 REAL chemical reactions. (b) The percent of molecules in REAL Space that include each of the REAL building blocks. (c) The molecular weight distribution of a random sample of 25,000 REAL molecules (blue), the REAL building blocks (black), and our training set molecules (red). (d) The cLogP distribution of a random sample of 25,000 REAL molecules (blue), the REAL building blocks (black), and our training set molecules (red). (e) The remaining 5 REAL chemical reactions we used (first 8 in Fig. 4b), along with the number and percent of REAL molecules produced by each reaction.

Extended Data Fig. 3 REAL Building Block and Full Molecule Scores from Chemprop-RDKit and Random Forest.

(a, b) The distribution of antibacterial model scores on the REAL building blocks using the (a) Chemprop-RDKit or (b) random forest models. (c, d) The correlation between the antibacterial model score of a REAL molecule and the average antibacterial model score of its constituent building blocks using the (c) Chemprop-RDKit or (d) random forest models. The R2 values are the coefficient of determination.

Extended Data Fig. 4 Comparison of Generated Sets with and without Building Block Diversity.

(ac) The frequency with which building blocks were used in the generated molecules of SyntheMol, with and without the building block diversity score penalty for (a) Chemprop, (b) Chemprop-RDKit, and (c) random forest.

Extended Data Fig. 5 Model Scores by Rollout from Chemprop-RDKit and Random Forest.

(a, b) Violin plots of the distribution of antibacterial model scores for every 2,000 rollouts of the MCTS algorithm over 20,000 rollouts. SyntheMol uses the (a) Chemprop-RDKit or (b) random forest models for antibacterial prediction scores. The lines in each violin indicate the first quartile, the median, and the third quartile.

Extended Data Fig. 6 Additional Analysis of Chemprop Generated and Selected Sets.

(a) The percent of building blocks that appear at different frequencies among the generated or selected compounds by SyntheMol with Chemprop. Building blocks are assigned to bins on the x-axis based on the number of generated or selected compounds that contain that building block, with the final bin including building blocks that appear in at least six compounds (max 137). (b) The distribution of chemical reactions used by the generated or selected compounds by SyntheMol with Chemprop. (c) A t-SNE visualization of the training set along with all generated and selected molecules from each of the three property predictor models.

Extended Data Fig. 7 Analysis of Chemprop-RDKit Generated and Selected Sets.

(a) The percent of building blocks that appear at different frequencies among the generated or selected compounds by SyntheMol with Chemprop-RDKit. Building blocks are assigned to bins on the x-axis based on the number of generated or selected compounds that contain that building block, with the final bin including building blocks that appear in at least six compounds (max 185). (b) The distribution of chemical reactions used by the generated or selected compounds by SyntheMol with Chemprop-RDKit. (cf) A comparison of the properties of the 25,828 molecules generated by SyntheMol with the Chemprop-RDKit antibacterial model and the 50 molecules selected from that set after applying post-hoc filters. (c) The distribution of nearest neighbour Tversky similarities between the generated or selected compounds and the active molecules in the training set. (d) The distribution of nearest neighbor Tversky similarities between the generated or selected compounds and the known antibacterial compounds from ChEMBL. (e) The distribution of Chemprop-RDKit antibacterial model scores on the generated or selected compounds, as well as on a random set of 25,000 REAL molecules. (f) The distribution of nearest neighbor Tanimoto similarities among the generated or selected compounds.

Extended Data Fig. 8 Analysis of Random Forest Generated and Selected Sets.

(a) The percent of building blocks that appear at different frequencies among the generated or selected compounds by SyntheMol with random forest. Building blocks are assigned to bins on the x-axis based on the number of generated or selected compounds that contain that building block, with the final bin including building blocks that appear in at least six compounds (max 212). (b) The distribution of chemical reactions used by the generated or selected compounds by SyntheMol with random forest. (cf) A comparison of the properties of the 27,396 molecules generated by SyntheMol with the random forest antibacterial model and the 50 molecules selected from that set after applying post-hoc filters. (c) The distribution of nearest neighbor Tversky similarities between the generated or selected compounds and the active molecules in the training set. (d) The distribution of nearest neighbor Tversky similarities between the generated or selected compounds and the known antibacterial compounds from ChEMBL. (e) The distribution of random forest antibacterial model scores on the generated or selected compounds as well as on a random set of 25,000 REAL molecules. (f) The distribution of nearest neighbor Tanimoto similarities among the generated or selected compounds.

Extended Data Fig. 9 Additional In Vitro Validation.

(a) Gram-negative bacterial isolates tested for growth inhibition against SPR 741 or colistin. Experiments were performed in biological duplicate. Error bars represent absolute range of optical density measurements at 600 nm. (b) Heat map summarizing MICs of 58 randomly selected compounds from the REAL Space against A. baumannii ATCC 17978 in I) LB medium, II) LB medium + a quarter MIC SPR 741, and III) LB medium + a quarter MIC colistin. Compounds were tested at concentrations from 256 µg/mL to 4 µg/mL in two-fold serial dilutions. Lighter colours indicate lower MIC values for each random REAL molecule. No compounds displayed potent antibacterial activity using the threshold of MIC ≤ 8 µg/mL. Experiments were performed in at least biological duplicate. (c, d) Chequerboard analysis to quantify synergy, as defined by FICI, with SPR 741 or colistin against Gram-negative isolates. Chequerboard experiments were performed using two-fold serial dilution series with the maximum and minimum concentrations of the potentiator (x-axis) and compound (y-axis) shown in µg/mL. Darker blue represents higher bacterial growth. Experiments were performed in biological duplicate. The mean growth of each well is shown. (c) Chequerboard assays using the six bioactive compounds, in combination with colistin, against A. baumannii ATCC 17978. (d) Chequerboard assays using rifampicin – a control antibiotic – in combination with SPR 741 or colistin against a panel of Gram-negative bacterial species.

Extended Data Fig. 10 Toxicity Predictions.

Predictions of the probability of clinical toxicity using an ensemble of ten Chemprop-RDKit models trained on the ClinTox dataset. ‘Non-toxic compounds’ show the toxicity predictions of the model on the non-toxic molecules in the dataset (n = 1,372), where each molecule’s prediction score comes from the one model in the ensemble for which that molecule was in the test set. ‘Toxic compounds’ shows the same toxicity predictions for the toxic molecules in the dataset (n = 112). ‘Selected six’ shows the average prediction of the ensemble of ten models on the six potent generated molecules. Blue horizontal lines represent the mean predictions for each set.

Supplementary information

Supplementary Information

Supplementary Tables 1–6 and Extended Discussion.

Reporting Summary

Supplementary Data 1–9

Supplementary Data 1: Training sets for antibacterial activity and clinical toxicity. Supplementary Data 2: Antibiotic and antibacterial molecules from ChEMBL. Supplementary Data 3: Antibacterial Chemprop/Chemprop-RDKit/random forest models performance summary. Supplementary Data 4: Analysis of the Enamine REAL Space. Supplementary Data 5: cLogP Chemprop model performance summary trained for 1 and 30 epochs. Supplementary Data 6: Molecules generated by SyntheMol, using the antibacterial Chemprop model as a scoring function. Supplementary Data 7: Molecules generated by SyntheMol, using the antibacterial Chemprop-RDKit model as a scoring function. Supplementary Data 8: Molecules generated by SyntheMol, using the antibacterial random forest model as a scoring function. Supplementary Data 9: Molecules selected for synthesis by Enamine.

Supplementary Data

Quality control LC–MS data for SyntheMol-generated compounds.

Supplementary Data

Quality control LC–MS data for randomly selected control compounds.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Swanson, K., Liu, G., Catacutan, D.B. et al. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nat Mach Intell 6, 338–353 (2024). https://doi.org/10.1038/s42256-024-00809-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00809-7

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing