Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

PocketFlow is a data-and-knowledge-driven structure-based molecular generative model

A preprint version of the article is available at Research Square.

Abstract

Deep learning-based molecular generation has extensive applications in many fields, particularly drug discovery. However, the majority of current deep generative models are ligand-based and do not consider chemical knowledge in the molecular generation process, often resulting in a relatively low success rate. We herein propose a structure-based molecular generative framework with chemical knowledge explicitly considered (named PocketFlow), which generates novel ligand molecules inside protein binding pockets. In various computational evaluations, PocketFlow showed state-of-the-art performance, with generated molecules being 100% chemically valid and highly drug-like. Ablation experiments prove the critical role of chemical knowledge in ensuring the validity and drug-likeness of the generated molecules. We applied PocketFlow to two new target proteins that are related to epigenetic regulation, HAT1 and YTHDC1, and successfully obtained wet-lab validated bioactive compounds. The binding modes of the active compounds with target proteins are close to those predicted by molecular docking and further confirmed by the X-ray crystal structure. All the results suggest that PocketFlow is a useful deep generative model, capable of generating innovative bioactive molecules from scratch given a protein binding pocket.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Architecture and generative process of PocketFlow.
Fig. 2: Evaluation of the geometry for generated molecules.
Fig. 3: Distributions of atom positions for 1,000 molecules randomly selected from molecules generated by different DGMs.
Fig. 4: Application of PocketFlow leads to the discovery of new small molecule inhibitors of HAT1 and YTHDC1.

Similar content being viewed by others

Data availability

The pretrain dataset of this study was randomly selected from ZINC database: https://zinc.docking.org. The fine-tuning dataset of this study was extracted from CrosDocked2020: https://bits.csb.pitt.edu/files/crossdock2020/. The pretrain and fine-tuning data of this study are available at Zenodo (https://doi.org/10.5281/zenodo.10142813).

Code availability

Computer codes of PocketFlow are available at https://github.com/Saoge123/PocketFlow (https://doi.org/10.5281/zenodo.10460455)64.

References

  1. Li, Y. et al. Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor. Nat. Commun. 13, 6891 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Isert, C., Atz, K. & Schneider, G. Structure-based drug design with geometric deep learning. Curr. Opin. Struct. Biol. 79, 102548 (2023).

    Article  CAS  PubMed  Google Scholar 

  3. Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Ramesh, A. et al. Hierarchical text-conditional image generation with clip latents. Preprint at https://doi.org/10.48550/arXiv.2204.06125 (2022).

  5. Tong, X. et al. Generative models for de novo drug design. J. Med. Chem. 64, 14011–14027 (2021).

    Article  CAS  PubMed  Google Scholar 

  6. Wang, J. et al. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 3, 914–922 (2021).

    Article  Google Scholar 

  7. Li, Y., Pei, J. & Lai, L. Structure-based de novo drug design using 3D deep generative models. Chem. Sci. 12, 13664–13675 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Zheng, S. et al. Accelerated rational PROTAC design via deep learning and molecular simulations. Nat. Mach. Intell. 4, 739–748 (2022).

    Article  Google Scholar 

  9. Zhang, J. & Chen, H. De novo molecule design using molecular generative models constrained by ligand–protein interactions. J. Chem. Inf. Model. 62, 3291–3306 (2022).

    Article  CAS  PubMed  Google Scholar 

  10. Godinez, W. J. et al. Design of potent antimalarials with generative chemistry. Nat. Mach. Intell. 4, 180–186 (2022).

    Article  Google Scholar 

  11. Bagal, V. et al. MolGPT: molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 62, 2064–2076 (2022).

    Article  CAS  PubMed  Google Scholar 

  12. Blaschke, T. et al. REINVENT 2.0: An AI tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020).

    Article  CAS  PubMed  Google Scholar 

  13. Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Moret, M. et al. Beam search for automated design and scoring of novel ROR ligands with machine intelligence. Angew. Chem. Int. Ed. 60, 19477–19482 (2021).

    Article  CAS  Google Scholar 

  15. Liu, M. et al. Generating 3d molecules for target protein binding. Preprint at https://doi.org/10.48550/arXiv.2204.09410 (2022).

  16. Peng, X., et al. Pocket2mol: efficient molecular sampling based on 3d protein pockets. In Proceedings of the International Conference on Machine Learning 162, 17644–17655 (2022).

  17. Ragoza, M., Masuda, T. & Koes, D. R. Generating 3D molecules conditional on receptor binding sites with deep generative models. Chem. Sci. 13, 2701–2713 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Pearl, J. Radical empiricism and machine learning research. J. Causal Inference 9, 78–82 (2021).

    Article  MathSciNet  Google Scholar 

  19. Pan, Y. Heading toward artificial intelligence 2.0. Engineering 2, 409–413 (2016).

    Article  Google Scholar 

  20. Cheng, G., Gong, X.-G. & Yin, W.-J. Crystal structure prediction by combining graph network and optimization algorithm. Nat. Commun. 13, 1492 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Jiang, Y. et al. Coupling complementary strategy to flexible graph neural network for quick discovery of coformer in diverse co-crystal materials. Nat. Commun. 12, 5950 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. O’Boyle, N. M. et al. Open Babel: an open chemical toolbox. J. Cheminform. 3, 33 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Bickerton, G. R. et al. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Polykovskiy, D. et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Francoeur, P. G. et al. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 60, 4200–4215 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Eldridge, M. D. et al. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J. Comput.-Aided Mol. Des. 11, 425–445 (1997).

    Article  CAS  PubMed  Google Scholar 

  28. Hartshorn, M. J. et al. Diverse, high-quality test set for the validation of protein-ligand docking performance. J. Med. Chem. 50, 726–741 (2007).

    Article  CAS  PubMed  Google Scholar 

  29. Hopkins, A. L., Groom, C. R. & Alex, A. Ligand efficiency: a useful metric for lead selection. Drug Discov. Today 9, 430–431 (2004).

    Article  PubMed  Google Scholar 

  30. Kenny, P. W. The nature of ligand efficiency. J. Cheminform. 11, 8 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Chen, H. et al. in Comprehensive Medicinal Chemistry III (eds Chackalamannil, S. et al.) Ch. 2.08 (Elsevier, 2017).

  32. Verdonk, M. L. et al. Docking performance of fragments and druglike compounds. J. Med. Chem. 54, 5422–5431 (2011).

    Article  CAS  PubMed  Google Scholar 

  33. Wu, H. et al. Structural basis for substrate specificity and catalysis of human histone acetyltransferase 1. Proc. Natl Acad. Sci. USA 109, 8925–8930 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Fan, P. et al. Overexpressed histone acetyltransferase 1 regulates cancer immunity by increasing programmed death-ligand 1 expression in pancreatic cancer. J. Exp. Clin. Cancer Res. 38, 47 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Xue, L. et al. RNAi screening identifies HAT1 as a potential drug target in esophageal squamous cell carcinoma. Int. J. Clin. Exp. Pathol. 7, 3898–3907 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Xia, P. et al. MicroRNA-377 exerts a potent suppressive role in osteosarcoma through the involvement of the histone acetyltransferase 1-mediated Wnt axis. J. Cell. Physiol. 234, 22787–22798 (2019).

    Article  CAS  PubMed  Google Scholar 

  37. Kumar, N. et al. Histone acetyltransferase 1 (HAT1) acetylates hypoxia-inducible factor 2 alpha (HIF2A) to execute hypoxia response. Biochim. Biophys. Acta Gene Regul. Mech. 194900, 2023 (1866).

    Google Scholar 

  38. Lahue, B. R. et al. Diversity & tractability revisited in collaborative small molecule phenotypic screening library design. Bioorg. Med. Chem. 28, 115192 (2020).

    Article  CAS  PubMed  Google Scholar 

  39. Roundtree, I. A. et al. YTHDC1 mediates nuclear export of N6-methyladenosine methylated mRNAs. eLife 6, e31311 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Xiao, W. et al. Nuclear m6A reader YTHDC1 regulates mRNA splicing. Mol. Cell 61, 507–519 (2016).

    Article  CAS  PubMed  Google Scholar 

  41. Sheng, Y. et al. A critical role of nuclear m6A reader YTHDC1 in leukemogenesis by regulating MCM complex–mediated DNA replication. Blood 138, 2838–2852 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Bubeck, S. & Sellke, M. A universal law of robustness via isoperimetry. J. ACM 70, 1–18 (2023).

    Article  MathSciNet  Google Scholar 

  43. Nakkiran, P. et al. Deep double descent: where bigger models and more data hurt. J. Stat. Mech.: Theory Exp. 2021, 124003 (2021).

    Article  MathSciNet  Google Scholar 

  44. Schulman, J. et al. Proximal policy optimization algorithms. Preprint at https://doi.org/10.48550/arXiv.1707.06347 (2017).

  45. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    Article  CAS  PubMed  Google Scholar 

  46. Sutton, R. S. et al. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2018/hash/d60678e8f2ba9c540798ebbde31177e8-Abstract.html (1999).

  47. Haarnoja, T. et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning 80, 1861–1870 (2018).

  48. Jing, B. et al. Learning from protein structure with geometric vector perceptrons. Preprint at https://doi.org/10.48550/arXiv.2009.01411 (2020).

  49. Aykent S. and T. Xia. Gbpnet: Universal geometric representation learning on protein structures. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining https://doi.org/10.1145/3534678.3539441 (2022).

  50. Deng, C. et al. Vector neurons: a general framework for so (3)-equivariant networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision https://openaccess.thecvf.com/content/ICCV2021/html/Deng_Vector_Neurons_A_General_Framework_for_SO3-Equivariant_Networks_ICCV_2021_paper.html (2021).

  51. He, K. et al. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html (2016).

  52. Gasteiger, J. et al. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. Preprint at https://doi.org/10.48550/arXiv.2011.14115 (2020).

  53. Yu, D. & Seltzer, M. L. Improved bottleneck features using pretrained deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association https://jackyguo624.github.io/img/2020-02-12-bottle-feature-for-asr/Bottleneck-Interspeech2011-pub.pdf (2011).

  54. Ranzato, M. A. et al. Sequence level training with recurrent neural networks. Preprint at https://doi.org/10.48550/arXiv.1511.06732 (2015).

  55. Schmidt, F. J. Generalization in generation: a closer look at exposure bias. Preprint at https://doi.org/10.48550/arXiv.1910.00292 (2019).

  56. Bishop, C. M. Mixture density networks. Technical Report. https://publications.aston.ac.uk/id/eprint/373/ (Aston University, 1994).

  57. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Luo, Y., Yan, K. & Ji, S. Graphdf: a discrete flow model for molecular graph generation. In Proceedings of the 38th International Conference on Machine Learning 139, 7192–7203 (2021).

  59. Shi, C. et al. Graphaf: a flow-based autoregressive model for molecular graph generation. Preprint at https://doi.org/10.48550/arXiv.2001.09382 (2020).

  60. You, J. et al. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2018/hash/d60678e8f2ba9c540798ebbde31177e8-Abstract.html (2018).

  61. Popova, M. et al. MolecularRNN: generating realistic molecular graphs with optimized properties. Preprint at https://doi.org/10.48550/arXiv.1905.13372 (2019).

  62. Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).

  64. Jiang, Y. et al. PocketFlow is a data-and-knowledge driven structure-based molecular generative model. Zenodo https://doi.org/10.5281/zenodo.10460455 (2024).

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (grant no. 2023YFF1204905, S.Y.); the National Natural Science Foundation of China (grant nos. T2221004, S.Y.; 81930125, S.Y.; and 82273787, L.L.); the New Cornerstone Science Foundation; Major Project of Guangzhou National Laboratory (grant no. GZNL2024A01005); 1.3.5 project for disciplines of excellence, West China Hospital, Sichuan University (grant nos. ZYXY21001, S.Y.; ZYGD23006, S.Y.); the Frontiers Medical Center, Tianfu Jincheng Laboratory Foundation (grant no. TFJC2023010009, S.Y.); the Natural Science Foundation of Sichuan Province (grant no. 24NSFSC6411) and Sichuan University Postdoctoral Interdisciplinary Innovation Fund (grant no. JCXK2227). We also thank the staff from beamlines BL18U1 and BL19U1 at Shanghai Synchrotron Radiation Facility of the National Facility for Protein Science (Shanghai, China) for great support.

Author information

Authors and Affiliations

Authors

Contributions

S.Y. conceived and supervised the research and designed the experiments. S.Y. and Y.J. established and validated the DGM model. Y.J., H.X. and M.D. performed molecular docking. G.Z., J.Y., Z.X. and Y.W. carried out chemical synthesis. H.Z. and R.Y. performed the bioactivity assays. S.Y., Y.J. G.Z., J.Y., H.Z., L.Z., R.Y. and L.L. analysed the data. S.Y. and Y.J. wrote the manuscript.

Corresponding author

Correspondence to Shengyong Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Bond angle distributions of molecules generated by different generative models and molecules in the CrossDocked2020 dataset.

(a) CCC, (b) CC = C, (c) COC, (d) CNC, (e) OC = O, (f) CN = O, (g) CN = C, (h) CSC. Results for molecules generated by PocketFlow, Pocket2Mol, GraphBP, and LiGAN, and molecules in the CrossDocked2020 dataset are shown in yellow, blue, green, red, and cyan, respectively.

Extended Data Fig. 2 Distributions of atom positions for 1000 molecules randomly selected from molecules generated by different DGMs.

Target proteins and their active pockets are displayed as protein surfaces (white). Green points indicate heavy atoms of molecules.

Extended Data Table 1 KL divergence between the bond length distribution of molecules generated by each DGM and that of molecules in CrossDocked2020
Extended Data Table 2 KL divergence between the bond angle distribution of different DGM and that of molecules in CrossDocked2020

Supplementary information

Supplementary Information

Supplementary Tables 1–23, Figs. 1–14, Results and Methods.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, Y., Zhang, G., You, J. et al. PocketFlow is a data-and-knowledge-driven structure-based molecular generative model. Nat Mach Intell 6, 326–337 (2024). https://doi.org/10.1038/s42256-024-00808-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00808-8

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing