Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Machine learning-enabled retrobiosynthesis of molecules

Abstract

Retrobiosynthesis provides an effective and sustainable approach to producing functional molecules. The past few decades have witnessed a rapid expansion of biosynthetic approaches. With the recent advances in data-driven sciences, machine learning (ML) is enriching the retrobiosynthesis design toolbox and being applied to each step of the synthesis design workflow, including retrosynthesis planning, enzyme identification and engineering, and pathway optimization. The ability to learn from existing knowledge, recognize complex patterns and generalize to the unknown has made ML a promising solution to biological problems. In this Review, we summarize the recent progress in the development of ML models for assisting with molecular synthesis. We highlight the key advantages of ML-based biosynthesis design methods and discuss the challenges and outlook for the further development of ML-based approaches.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: An overview of an ML-enabled retrobiosynthesis workflow.
Fig. 2: A conceptualized pathway design workflow based on a retrobiosynthesis tool.
Fig. 3: Multi-task transfer learning for retrobiosynthesis.
Fig. 4: Similarity- and ML-based EC number prediction tools.
Fig. 5: An overview of the MLDE workflow.
Fig. 6: Overview of the ML approaches for designing novel enzymes.
Fig. 7: Reaction network optimization in vivo and in vitro.

Similar content being viewed by others

References

  1. Lin, G.-M., Warden-Rothman, R. & Voigt, C. A. Retrosynthetic design of metabolic pathways to chemicals not found in nature. Curr. Opin. Syst. Biol. 14, 82–107 (2019).

    Article  Google Scholar 

  2. Bornscheuer, U. T. et al. Engineering the third wave of biocatalysis. Nature 485, 185–194 (2012).

    Article  CAS  PubMed  Google Scholar 

  3. Sheldon, R. A. & Woodley, J. M. Role of biocatalysis in sustainable chemistry. Chem. Rev. 118, 801–838 (2018).

    Article  CAS  PubMed  Google Scholar 

  4. de Souza, R. O. M. A., Miranda, L. S. M. & Bornscheuer, U. T. A retrosynthesis approach for biocatalysis in organic synthesis. Chem. Eur. J. 23, 12040–12063 (2017).

    Article  PubMed  Google Scholar 

  5. Turner, N. J. & O’Reilly, E. Biocatalytic retrosynthesis. Nat. Chem. Biol. 9, 285–288 (2013).

    Article  CAS  PubMed  Google Scholar 

  6. The UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

    Article  Google Scholar 

  7. Khan, A. Z., Bilal, M., Rasheed, T. & Iqbal, H. M. N. Advancements in biocatalysis: from computational to metabolic engineering. Chin. J. Catal. 39, 1861–1868 (2018).

    Article  CAS  Google Scholar 

  8. Feehan, R., Montezano, D. & Slusky, J. S. G. Machine learning for enzyme engineering, selection and design. Protein Eng. Des. Sel. 34, gzab019 (2021).

    PubMed  PubMed Central  Google Scholar 

  9. Probst, D. et al. Biocatalysed synthesis planning using data-driven learning. Nat. Commun. 13, 964 (2022). This paper describes the development of a template-free retrobiosynthesis tool by training a molecular transformer with multi-task transfer learning using both enzymatic and chemical reaction databases.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opin. Struct. Biol. 72, 145–152 (2022).

    Article  CAS  PubMed  Google Scholar 

  11. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  CAS  PubMed  Google Scholar 

  12. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Wang, L., Dash, S., Ng, C. Y. & Maranas, C. D. A review of computational tools for design and reconstruction of metabolic pathways. Synth. Syst. Biotechnol. 2, 243–252 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Koch, M., Duigou, T. & Faulon, J.-L. Reinforcement learning for bioretrosynthesis. ACS Synth. Biol. 9, 157–168 (2020).

    Article  CAS  PubMed  Google Scholar 

  16. Zheng, S. et al. Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP. Nat. Commun. 13, 3342 (2022). This paper introduces a useful retrobiosynthesis tool for navigating biosynthetic pathways to complex natural products from simple building blocks.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Fuji, T., Nakazawa, S. & Ito, K. Feasible-metabolic-pathway-exploration technique using chemical latent space. Bioinformatics 36, i770–i778 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Finnigan, W., Hepworth, L. J., Flitsch, S. L. & Turner, N. J. RetroBioCat as a computer-aided synthesis planning tool for biocatalytic reactions and cascades. Nat. Catal. 4, 98–104 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kumar, A., Wang, L., Ng, C. Y. & Maranas, C. D. Pathway design using de novo steps through uncharted biochemical spaces. Nat. Commun. 9, 184 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Delépine, B. et al. RetroPath2.0: a retrosynthesis workflow for metabolic engineers. Metab. Eng. 45, 158–170 (2018).

    Article  PubMed  Google Scholar 

  21. Kim, Y., Ryu, J. Y., Kim, H. U., Jang, W. D. & Lee, S. Y. A deep learning approach to evaluate the feasibility of enzymatic reactions generated by retrobiosynthesis. Biotechnol. J. 16, 2000605 (2021).

    Article  CAS  Google Scholar 

  22. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).

    Article  CAS  Google Scholar 

  23. Pesciullesi, G., Schwaller, P., Laino, T. & Reymond, J.-L. Transfer learning enables the molecular transformer to predict regio- and stereoselective reactions on carbohydrates. Nat. Commun. 11, 4874 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Hasic, H. & Ishida, T. Single-step retrosynthesis prediction based on the identification of potential disconnection sites using molecular substructure fingerprints. J. Chem. Inf. Model. 61, 641–652 (2021).

    Article  CAS  PubMed  Google Scholar 

  25. Somnath, V. R., Bunne, C., Coley, C., Krause, A. & Barzilay, R. Learning graph models for retrosynthesis prediction. In Proc. 34th Advances in Neural Information Processing Systems (eds Ranzato, M. et al.) 9405–9415 (Curran Associates, Inc., 2021).

  26. Wang, H. et al. Chemical-reaction-aware molecule representation learning. Preprint at https://arxiv.org/abs/2109.09888 (2021).

  27. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Tempke, R. & Musho, T. Autonomous design of new chemical reactions using a variational autoencoder. Commun. Chem. 5, 40 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60, 47–55 (2020).

    Article  CAS  PubMed  Google Scholar 

  30. Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. SCScore: synthetic complexity learned from a reaction corpus. J. Chem. Inf. Model. 58, 252–261 (2018).

    Article  CAS  PubMed  Google Scholar 

  31. Segler, M. H. S. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry 23, 5966–5971 (2017).

    Article  CAS  PubMed  Google Scholar 

  32. Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).

    Article  CAS  PubMed  Google Scholar 

  34. Chen, B., Li, C., Dai, H. & Song, L. Retro*: learning retrosynthetic planning with neural guided A* search. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A) 1608–1616 (PMLR, 2020).

  35. Krogh, A., Brown, M., Mian, I. S., Sjölander, K. & Haussler, D. Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994).

    Article  CAS  PubMed  Google Scholar 

  36. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  37. Ryu, J. Y., Kim, H. U. & Lee, S. Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl Acad. Sci. USA 116, 13996–14001 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Price, M. N. et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature 557, 503–509 (2018).

    Article  CAS  PubMed  Google Scholar 

  39. Nallapareddy, V. et al. CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models. Bioinformatics 39, btad029 (2022).

    Article  Google Scholar 

  40. Sanderson, T., Bileschi, M. L., Belanger, D. & Colwell, L. J. ProteInfer: deep networks for protein functional inference. Preprint at bioRxiv https://doi.org/10.1101/2021.09.20.461077 (2021). This paper reports a state-of-the-art ML-based protein annotation tool capable of predicting both EC number and Gene Ontology (GO) from amino acid sequences.

  41. Heinzinger, M. et al. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom. Bioinform. 4, lqac043 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Carbonell, P. et al. Selenzyme: enzyme selection tool for pathway design. Bioinformatics 34, 2153–2154 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Cho, A., Yun, H., Park, J. H., Lee, S. Y. & Park, S. Prediction of novel synthetic pathways for the production of desired chemicals. BMC Syst. Biol. 4, 35 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Visani, G. M., Hughes, M. C. & Hassoun, S. Enzyme promiscuity prediction using hierarchy-informed multi-label classification. Bioinformatics 37, 2017–2024 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Goldman, S., Das, R., Yang, K. K. & Coley, C. W. Machine learning modeling of family wide enzyme–substrate specificity screens. PLoS Comput. Biol. 18, e1009853 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Xu, Z., Wu, J., Song, Y. S. & Mahadevan, R. Enzyme activity prediction of sequence variants on novel substrates using improved substrate encodings and convolutional pooling. In Proc. 16th Machine Learning in Computational Biology meeting (eds Knowles, D. A. et al) 78–87 (PMLR, 2022).

  47. Musil, M., Konegger, H., Hon, J., Bednar, D. & Damborsky, J. Computational design of stable and soluble biocatalysts. ACS Catal. 9, 1033–1054 (2019).

    Article  CAS  Google Scholar 

  48. Hon, J. et al. SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics 37, 23–28 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Li, G., Rabe, K. S., Nielsen, J. & Engqvist, M. K. M. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019).

    Article  CAS  PubMed  Google Scholar 

  50. Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H. & Winther, O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33, 3387–3395 (2017).

    Article  PubMed  Google Scholar 

  51. Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv. 1, vbab035 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Chai, M. et al. Application of machine learning algorithms to estimate enzyme loading, immobilization yield, activity retention, and reusability of enzyme–metal–organic framework biocatalysts. Chem. Mater. 33, 8666–8676 (2021).

    Article  CAS  Google Scholar 

  53. Li, F. et al. Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction. Nat. Catal. 5, 662–672 (2022).

    Article  CAS  Google Scholar 

  54. Dhakal, A., McKay, C., Tanner, J. J. & Cheng, J. Artificial intelligence in the prediction of protein–ligand interactions: recent advances and future directions. Brief. Bioinform. 23, bbab476 (2022).

    Article  PubMed  Google Scholar 

  55. Wang, Y. et al. Directed evolution: methodologies and applications. Chem. Rev. 121, 12384–12444 (2021).

    Article  CAS  PubMed  Google Scholar 

  56. Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021).

    Article  CAS  PubMed  Google Scholar 

  57. Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  CAS  PubMed  Google Scholar 

  59. Hsu, C. et al. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022). This paper reports an in depth evaluation and discussion of ML models predicting variant effects under a low-N situation.

    Article  CAS  PubMed  Google Scholar 

  60. Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045.e7 (2021).

    Article  CAS  PubMed  Google Scholar 

  61. Greenhalgh, J. C., Fahlberg, S. A., Pfleger, B. F. & Romero, P. A. Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production. Nat. Commun. 12, 5825 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).

    Article  Google Scholar 

  64. Chan, A., Madani, A., Krause, B. & Naik, N. Deep extrapolation for attribute-enhanced generation. In Proc. 34th Advances in Neural Information Processing Systems (eds Ranzato, M. et al.) 14084–14096 (Curran Associates, Inc., 2021).

  65. Schmitt, L. et al. Prediction of designer-recombinases for DNA editing with generative deep learning. Nat. Commun. 13, 7966 (2022).

  66. Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://doi.org/10.1101/2021.07.18.452833 (2021). Using a global language model trained on protein sequences and annotations, the authors demonstrate a universal generative model capable of generating protein sequences with desired properties with a varying degree of sequence similarity.

  67. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Anand, N. et al. Protein sequence design with a learned potential. Nat. Commun. 13, 746 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 32nd Advances in Neural Information Processing Systems (eds Wallach, H. et al.) (Curran Associates, Inc., 2019).

  70. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Wang, J. et al. Scaffolding protein functional sites using deep learning. Science 377, 387–394 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Wu, Z., Johnston, K. E., Arnold, F. H. & Yang, K. K. Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65, 18–27 (2021).

    Article  CAS  PubMed  Google Scholar 

  74. Strokach, A. & Kim, P. M. Deep generative modeling for protein design. Curr. Opin. Struct. Biol. 72, 226–236 (2022). An insightful review of various deep learning approaches to protein design.

    Article  CAS  PubMed  Google Scholar 

  75. Liew, F. E. et al. Carbon-negative production of acetone and isopropanol by gas fermentation at industrial pilot scale. Nat. Biotechnol. 40, 335–344 (2022). This paper describes the in vitro machine learning screening method iPROBE used to engineer Clostridium autoethanogenum for the overproduction of acetone and isopropanol at the industrial scale.

    Article  CAS  PubMed  Google Scholar 

  76. Sun, X., Xu, Y. & Huang, H. Thraustochytrid cell factories for producing lipid compounds. Trends Biotechnol. 39, 648–650 (2021).

    Article  CAS  PubMed  Google Scholar 

  77. Antonakoudis, A., Barbosa, R., Kotidis, P. & Kontoravdi, C. The era of big data: genome-scale modelling meets machine learning. Comput. Struct. Biotechnol. J. 18, 3287–3300 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Kim, Y., Kim, G. B. & Lee, S. Y. Machine learning applications in genome-scale metabolic modeling. Curr. Opin. Syst. Biol. 25, 42–49 (2021).

    Article  Google Scholar 

  79. Zampieri, G., Vijayakumar, S., Yaneske, E. & Angione, C. Machine and deep learning meet genome-scale metabolic modeling. PLoS Comput. Biol. 15, e1007084 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Heckmann, D. et al. Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models. Nat. Commun. 9, 5252 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Heckmann, D. et al. Kinetic profiling of metabolic specialists demonstrates stability and consistency of in vivo enzyme turnover numbers. Proc. Natl Acad. Sci. USA 117, 23182–23190 (2020). This paper reports fluxomic and proteomic data for estimating in vivo kcat values that were used to parameterize a metabolic model that could then be used for more accurate gene expression prediction.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Shah, H. A., Liu, J., Yang, Z. & Feng, J. Review of machine learning methods for the prediction and reconstruction of metabolic pathways. Front. Mol. Biosci. 8, 634141 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  83. Fang, X., Lloyd, C. J. & Palsson, B. O. Reconstructing organisms in silico: genome-scale models and their emerging applications. Nat. Rev. Microbiol. 18, 731–743 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. HamediRad, M. et al. Towards a fully automated algorithm driven platform for biosystems design. Nat. Commun. 10, 5150 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  85. Radivojević, T., Costello, Z., Workman, K. & Garcia Martin, H. A machine learning automated recommendation tool for synthetic biology. Nat. Commun. 11, 4879 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  86. Zhang, J. et al. Combining mechanistic and machine learning models for predictive engineering and optimization of tryptophan metabolism. Nat. Commun. 11, 4880 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  87. Zhou, Y. et al. MiYA, an efficient machine-learning workflow in conjunction with the YeastFab assembly strategy for combinatorial optimization of heterologous metabolic pathways in Saccharomyces cerevisiae. Metab. Eng. 47, 294–302 (2018).

    Article  CAS  PubMed  Google Scholar 

  88. Opgenorth, P. et al. Lessons from two design–build–test–learn cycles of dodecanol production in Escherichia coli aided by machine learning. ACS Synth. Biol. 8, 1337–1351 (2019).

    Article  CAS  PubMed  Google Scholar 

  89. Jervis, A. J. et al. Machine learning of designed translational control allows predictive pathway optimization in Escherichia coli. ACS Synth. Biol. 8, 127–136 (2019).

    Article  CAS  PubMed  Google Scholar 

  90. Karim, A. S. et al. In vitro prototyping and rapid optimization of biosynthetic enzymes for cell design. Nat. Chem. Biol. 16, 912–919 (2020).

    Article  CAS  PubMed  Google Scholar 

  91. Huffman, M. A. et al. Design of an in vitro biocatalytic cascade for the manufacture of islatravir. Science 366, 1255–1259 (2019).

    Article  CAS  PubMed  Google Scholar 

  92. Peters, R. J. R. W. et al. Cascade reactions in multicompartmentalized polymersomes. Angew. Chem. Int. Ed. 126, 150–154 (2014).

    Article  Google Scholar 

  93. Nobeli, I., Favia, A. D. & Thornton, J. M. Protein promiscuity and its implications for biotechnology. Nat. Biotechnol. 27, 157–167 (2009).

    Article  CAS  PubMed  Google Scholar 

  94. Wan, Z., Wang, Q.-D., Liu, D. & Liang, J. Accelerating the optimization of enzyme-catalyzed synthesis conditions via machine learning and reactivity descriptors. Org. Biomol. Chem. 19, 6267–6273 (2021).

    Article  CAS  PubMed  Google Scholar 

  95. Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).

    Article  CAS  PubMed  Google Scholar 

  96. Gao, H. et al. Using machine learning to predict suitable conditions for organic reactions. ACS Cent. Sci. 4, 1465–1476 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).

    Article  CAS  PubMed  Google Scholar 

  98. Morselli Gysi, D. et al. Network medicine framework for identifying drug-repurposing opportunities for COVID-19. Proc. Natl Acad. Sci. USA 118, e2025581118 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  99. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci. 3, 80 (2016).

    Article  Google Scholar 

  100. Gasteiger, E. ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31, 3784–3788 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Wang, C. Y. et al. ProtaBank: a repository for protein design and engineering data. Protein Sci. 27, 1113–1124 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Pezeshgi Modarres, H., Mofrad, M. R. & Sanati-Nezhad, A. ProtDataTherm: a database for thermostability analysis and engineering of proteins. PLoS ONE 13, e0191222 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  103. Stourac, J. et al. FireProtDB: database of manually curated protein stability data. Nucleic Acids Res. 49, D319–D324 (2021).

    Article  CAS  PubMed  Google Scholar 

  104. Hastings, J. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).

    Article  CAS  PubMed  Google Scholar 

  105. Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).

    Article  CAS  PubMed  Google Scholar 

  106. Sud, M. et al. LMSD: LIPID MAPS structure database. Nucleic Acids Res. 35, D527–D532 (2007).

    Article  CAS  PubMed  Google Scholar 

  107. Aimo, L. et al. The SwissLipids knowledgebase for lipid biology. Bioinformatics 31, 2860–2866 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Jeske, L., Placzek, S., Schomburg, I., Chang, A. & Schomburg, D. BRENDA in 2019: a European ELIXIR core data resource. Nucleic Acids Res. 47, D542–D549 (2018).

    Article  PubMed Central  Google Scholar 

  109. Lombardot, T. et al. Updates in Rhea: SPARQLing biochemical reaction data. Nucleic Acids Res. 47, D596–D600 (2019).

    Article  CAS  PubMed  Google Scholar 

  110. Buchholz, P. C. F. et al. BioCatNet: a database system for the integration of enzyme sequences and biocatalytic experiments. ChemBioChem 17, 2093–2098 (2016).

    Article  CAS  PubMed  Google Scholar 

  111. Wittig, U., Rey, M., Weidemann, A., Kania, R. & Müller, W. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res. 46, D656–D660 (2018).

    Article  CAS  PubMed  Google Scholar 

  112. Lang, M., Stelzer, M. & Schomburg, D. BKM-react, an integrated biochemical reaction database. BMC Biochem. 12, 42 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Kanehisa, M. et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36, D480–D484 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  114. Wicker, J. et al. enviPath—the environmental contaminant biotransformation pathway resource. Nucleic Acids Res. 44, D502–D508 (2016).

    Article  CAS  PubMed  Google Scholar 

  115. King, Z. A. et al. BiGG models: a platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res. 44, D515–D522 (2016).

    Article  CAS  PubMed  Google Scholar 

  116. Gillespie, M. et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 50, D687–D692 (2022).

    Article  CAS  PubMed  Google Scholar 

  117. Wishart, D. S. et al. PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Res. 48, D470–D478 (2020).

    Article  CAS  PubMed  Google Scholar 

  118. Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes. Nucleic Acids Res. 46, D633–D639 (2017).

    Article  PubMed Central  Google Scholar 

  119. Moretti, S., Tran, V. D. T., Mehl, F., Ibberson, M. & Pagni, M. MetaNetX/MNXref: unified namespace for metabolites and biochemical reactions in the context of metabolic models. Nucleic Acids Res. 49, D570–D574 (2021).

    Article  CAS  PubMed  Google Scholar 

  120. Mazurenko, S., Prokop, Z. & Damborsky, J. Machine learning in enzyme engineering. ACS Catal. 10, 1210–1223 (2020).

    Article  CAS  Google Scholar 

  121. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  122. Torng, W. & Altman, R. B. High precision protein functional site detection using 3D convolutional neural networks. Bioinformatics 35, 1503–1512 (2019).

    Article  CAS  PubMed  Google Scholar 

  123. Dallago, C. et al. FLIP: benchmark tasks in fitness landscape inference for proteins. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.11.09.467890v2 (2022).

  124. Hulo, N. The PROSITE database. Nucleic Acids Res. 34, D227–D230 (2006).

    Article  CAS  PubMed  Google Scholar 

  125. Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  126. Cui, Y., Dong, Q., Hong, D. & Wang, X. Predicting protein–ligand binding residues with deep convolutional neural networks. BMC Bioinform. 20, 93 (2019).

    Article  Google Scholar 

  127. Xia, C.-Q., Pan, X. & Shen, H.-B. Protein–ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 36, 3018–3027 (2020).

    Article  CAS  PubMed  Google Scholar 

  128. Stepniewska-Dziubinska, M. M., Zielenkiewicz, P. & Siedlecki, P. Improving detection of protein–ligand binding sites with 3D segmentation. Sci. Rep. 10, 5035 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  129. Mylonas, S. K., Axenopoulos, A. & Daras, P. DeepSurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics 37, 1681–1690 (2021).

    Article  CAS  Google Scholar 

  130. Kandel, J., Tayara, H. & Chong, K. T. PUResNet: prediction of protein–ligand binding sites using deep residual neural network. J. Cheminform. 13, 65 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  131. Kroll, A., Engqvist, M. K. M., Heckmann, D. & Lercher, M. J. Deep learning allows genome-scale prediction of Michaelis constants from structural features. PLoS Biol. 19, e3001402 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  132. Kavvas, E. S., Yang, L., Monk, J. M., Heckmann, D. & Palsson, B. O. A biochemically interpretable machine learning classifier for microbial GWAS. Nat. Commun. 11, 2580 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  133. Ajjolli Nagaraja, A. et al. A machine learning approach for efficient selection of enzyme concentrations and its application for flux optimization. Catalysts 10, 291 (2020).

    Article  Google Scholar 

  134. Caschera, F. et al. Coping with complexity: machine learning optimization of cell-free protein synthesis. Biotechnol. Bioeng. 108, 2218–2228 (2021).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Molecule Maker Lab Institute, an AI Research Institutes programme supported by the US National Science Foundation under grant no. 2019897 (H.Z.). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huimin Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Catalysis thanks William Finnigan, Pablo Carbonell and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, T., Boob, A.G., Volk, M.J. et al. Machine learning-enabled retrobiosynthesis of molecules. Nat Catal 6, 137–151 (2023). https://doi.org/10.1038/s41929-022-00909-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41929-022-00909-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing