Abstract
Deep learning-based molecular generation has extensive applications in many fields, particularly drug discovery. However, the majority of current deep generative models are ligand-based and do not consider chemical knowledge in the molecular generation process, often resulting in a relatively low success rate. We herein propose a structure-based molecular generative framework with chemical knowledge explicitly considered (named PocketFlow), which generates novel ligand molecules inside protein binding pockets. In various computational evaluations, PocketFlow showed state-of-the-art performance, with generated molecules being 100% chemically valid and highly drug-like. Ablation experiments prove the critical role of chemical knowledge in ensuring the validity and drug-likeness of the generated molecules. We applied PocketFlow to two new target proteins that are related to epigenetic regulation, HAT1 and YTHDC1, and successfully obtained wet-lab validated bioactive compounds. The binding modes of the active compounds with target proteins are close to those predicted by molecular docking and further confirmed by the X-ray crystal structure. All the results suggest that PocketFlow is a useful deep generative model, capable of generating innovative bioactive molecules from scratch given a protein binding pocket.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The pretrain dataset of this study was randomly selected from ZINC database: https://zinc.docking.org. The fine-tuning dataset of this study was extracted from CrosDocked2020: https://bits.csb.pitt.edu/files/crossdock2020/. The pretrain and fine-tuning data of this study are available at Zenodo (https://doi.org/10.5281/zenodo.10142813).
Code availability
Computer codes of PocketFlow are available at https://github.com/Saoge123/PocketFlow (https://doi.org/10.5281/zenodo.10460455)64.
References
Li, Y. et al. Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor. Nat. Commun. 13, 6891 (2022).
Isert, C., Atz, K. & Schneider, G. Structure-based drug design with geometric deep learning. Curr. Opin. Struct. Biol. 79, 102548 (2023).
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
Ramesh, A. et al. Hierarchical text-conditional image generation with clip latents. Preprint at https://doi.org/10.48550/arXiv.2204.06125 (2022).
Tong, X. et al. Generative models for de novo drug design. J. Med. Chem. 64, 14011–14027 (2021).
Wang, J. et al. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 3, 914–922 (2021).
Li, Y., Pei, J. & Lai, L. Structure-based de novo drug design using 3D deep generative models. Chem. Sci. 12, 13664–13675 (2021).
Zheng, S. et al. Accelerated rational PROTAC design via deep learning and molecular simulations. Nat. Mach. Intell. 4, 739–748 (2022).
Zhang, J. & Chen, H. De novo molecule design using molecular generative models constrained by ligand–protein interactions. J. Chem. Inf. Model. 62, 3291–3306 (2022).
Godinez, W. J. et al. Design of potent antimalarials with generative chemistry. Nat. Mach. Intell. 4, 180–186 (2022).
Bagal, V. et al. MolGPT: molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 62, 2064–2076 (2022).
Blaschke, T. et al. REINVENT 2.0: An AI tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Moret, M. et al. Beam search for automated design and scoring of novel ROR ligands with machine intelligence. Angew. Chem. Int. Ed. 60, 19477–19482 (2021).
Liu, M. et al. Generating 3d molecules for target protein binding. Preprint at https://doi.org/10.48550/arXiv.2204.09410 (2022).
Peng, X., et al. Pocket2mol: efficient molecular sampling based on 3d protein pockets. In Proceedings of the International Conference on Machine Learning 162, 17644–17655 (2022).
Ragoza, M., Masuda, T. & Koes, D. R. Generating 3D molecules conditional on receptor binding sites with deep generative models. Chem. Sci. 13, 2701–2713 (2022).
Pearl, J. Radical empiricism and machine learning research. J. Causal Inference 9, 78–82 (2021).
Pan, Y. Heading toward artificial intelligence 2.0. Engineering 2, 409–413 (2016).
Cheng, G., Gong, X.-G. & Yin, W.-J. Crystal structure prediction by combining graph network and optimization algorithm. Nat. Commun. 13, 1492 (2022).
Jiang, Y. et al. Coupling complementary strategy to flexible graph neural network for quick discovery of coformer in diverse co-crystal materials. Nat. Commun. 12, 5950 (2021).
O’Boyle, N. M. et al. Open Babel: an open chemical toolbox. J. Cheminform. 3, 33 (2011).
Bickerton, G. R. et al. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
Polykovskiy, D. et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).
Francoeur, P. G. et al. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 60, 4200–4215 (2020).
Eldridge, M. D. et al. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J. Comput.-Aided Mol. Des. 11, 425–445 (1997).
Hartshorn, M. J. et al. Diverse, high-quality test set for the validation of protein-ligand docking performance. J. Med. Chem. 50, 726–741 (2007).
Hopkins, A. L., Groom, C. R. & Alex, A. Ligand efficiency: a useful metric for lead selection. Drug Discov. Today 9, 430–431 (2004).
Kenny, P. W. The nature of ligand efficiency. J. Cheminform. 11, 8 (2019).
Chen, H. et al. in Comprehensive Medicinal Chemistry III (eds Chackalamannil, S. et al.) Ch. 2.08 (Elsevier, 2017).
Verdonk, M. L. et al. Docking performance of fragments and druglike compounds. J. Med. Chem. 54, 5422–5431 (2011).
Wu, H. et al. Structural basis for substrate specificity and catalysis of human histone acetyltransferase 1. Proc. Natl Acad. Sci. USA 109, 8925–8930 (2012).
Fan, P. et al. Overexpressed histone acetyltransferase 1 regulates cancer immunity by increasing programmed death-ligand 1 expression in pancreatic cancer. J. Exp. Clin. Cancer Res. 38, 47 (2019).
Xue, L. et al. RNAi screening identifies HAT1 as a potential drug target in esophageal squamous cell carcinoma. Int. J. Clin. Exp. Pathol. 7, 3898–3907 (2014).
Xia, P. et al. MicroRNA-377 exerts a potent suppressive role in osteosarcoma through the involvement of the histone acetyltransferase 1-mediated Wnt axis. J. Cell. Physiol. 234, 22787–22798 (2019).
Kumar, N. et al. Histone acetyltransferase 1 (HAT1) acetylates hypoxia-inducible factor 2 alpha (HIF2A) to execute hypoxia response. Biochim. Biophys. Acta Gene Regul. Mech. 194900, 2023 (1866).
Lahue, B. R. et al. Diversity & tractability revisited in collaborative small molecule phenotypic screening library design. Bioorg. Med. Chem. 28, 115192 (2020).
Roundtree, I. A. et al. YTHDC1 mediates nuclear export of N6-methyladenosine methylated mRNAs. eLife 6, e31311 (2017).
Xiao, W. et al. Nuclear m6A reader YTHDC1 regulates mRNA splicing. Mol. Cell 61, 507–519 (2016).
Sheng, Y. et al. A critical role of nuclear m6A reader YTHDC1 in leukemogenesis by regulating MCM complex–mediated DNA replication. Blood 138, 2838–2852 (2021).
Bubeck, S. & Sellke, M. A universal law of robustness via isoperimetry. J. ACM 70, 1–18 (2023).
Nakkiran, P. et al. Deep double descent: where bigger models and more data hurt. J. Stat. Mech.: Theory Exp. 2021, 124003 (2021).
Schulman, J. et al. Proximal policy optimization algorithms. Preprint at https://doi.org/10.48550/arXiv.1707.06347 (2017).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Sutton, R. S. et al. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2018/hash/d60678e8f2ba9c540798ebbde31177e8-Abstract.html (1999).
Haarnoja, T. et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning 80, 1861–1870 (2018).
Jing, B. et al. Learning from protein structure with geometric vector perceptrons. Preprint at https://doi.org/10.48550/arXiv.2009.01411 (2020).
Aykent S. and T. Xia. Gbpnet: Universal geometric representation learning on protein structures. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining https://doi.org/10.1145/3534678.3539441 (2022).
Deng, C. et al. Vector neurons: a general framework for so (3)-equivariant networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision https://openaccess.thecvf.com/content/ICCV2021/html/Deng_Vector_Neurons_A_General_Framework_for_SO3-Equivariant_Networks_ICCV_2021_paper.html (2021).
He, K. et al. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html (2016).
Gasteiger, J. et al. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. Preprint at https://doi.org/10.48550/arXiv.2011.14115 (2020).
Yu, D. & Seltzer, M. L. Improved bottleneck features using pretrained deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association https://jackyguo624.github.io/img/2020-02-12-bottle-feature-for-asr/Bottleneck-Interspeech2011-pub.pdf (2011).
Ranzato, M. A. et al. Sequence level training with recurrent neural networks. Preprint at https://doi.org/10.48550/arXiv.1511.06732 (2015).
Schmidt, F. J. Generalization in generation: a closer look at exposure bias. Preprint at https://doi.org/10.48550/arXiv.1910.00292 (2019).
Bishop, C. M. Mixture density networks. Technical Report. https://publications.aston.ac.uk/id/eprint/373/ (Aston University, 1994).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Luo, Y., Yan, K. & Ji, S. Graphdf: a discrete flow model for molecular graph generation. In Proceedings of the 38th International Conference on Machine Learning 139, 7192–7203 (2021).
Shi, C. et al. Graphaf: a flow-based autoregressive model for molecular graph generation. Preprint at https://doi.org/10.48550/arXiv.2001.09382 (2020).
You, J. et al. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2018/hash/d60678e8f2ba9c540798ebbde31177e8-Abstract.html (2018).
Popova, M. et al. MolecularRNN: generating realistic molecular graphs with optimized properties. Preprint at https://doi.org/10.48550/arXiv.1905.13372 (2019).
Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).
Jiang, Y. et al. PocketFlow is a data-and-knowledge driven structure-based molecular generative model. Zenodo https://doi.org/10.5281/zenodo.10460455 (2024).
Acknowledgements
This work was supported by National Key R&D Program of China (grant no. 2023YFF1204905, S.Y.); the National Natural Science Foundation of China (grant nos. T2221004, S.Y.; 81930125, S.Y.; and 82273787, L.L.); the New Cornerstone Science Foundation; Major Project of Guangzhou National Laboratory (grant no. GZNL2024A01005); 1.3.5 project for disciplines of excellence, West China Hospital, Sichuan University (grant nos. ZYXY21001, S.Y.; ZYGD23006, S.Y.); the Frontiers Medical Center, Tianfu Jincheng Laboratory Foundation (grant no. TFJC2023010009, S.Y.); the Natural Science Foundation of Sichuan Province (grant no. 24NSFSC6411) and Sichuan University Postdoctoral Interdisciplinary Innovation Fund (grant no. JCXK2227). We also thank the staff from beamlines BL18U1 and BL19U1 at Shanghai Synchrotron Radiation Facility of the National Facility for Protein Science (Shanghai, China) for great support.
Author information
Authors and Affiliations
Contributions
S.Y. conceived and supervised the research and designed the experiments. S.Y. and Y.J. established and validated the DGM model. Y.J., H.X. and M.D. performed molecular docking. G.Z., J.Y., Z.X. and Y.W. carried out chemical synthesis. H.Z. and R.Y. performed the bioactivity assays. S.Y., Y.J. G.Z., J.Y., H.Z., L.Z., R.Y. and L.L. analysed the data. S.Y. and Y.J. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Bond angle distributions of molecules generated by different generative models and molecules in the CrossDocked2020 dataset.
(a) CCC, (b) CC = C, (c) COC, (d) CNC, (e) OC = O, (f) CN = O, (g) CN = C, (h) CSC. Results for molecules generated by PocketFlow, Pocket2Mol, GraphBP, and LiGAN, and molecules in the CrossDocked2020 dataset are shown in yellow, blue, green, red, and cyan, respectively.
Extended Data Fig. 2 Distributions of atom positions for 1000 molecules randomly selected from molecules generated by different DGMs.
Target proteins and their active pockets are displayed as protein surfaces (white). Green points indicate heavy atoms of molecules.
Supplementary information
Supplementary Information
Supplementary Tables 1–23, Figs. 1–14, Results and Methods.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, Y., Zhang, G., You, J. et al. PocketFlow is a data-and-knowledge-driven structure-based molecular generative model. Nat Mach Intell 6, 326–337 (2024). https://doi.org/10.1038/s42256-024-00808-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00808-8