Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Application of variational graph encoders as an effective generalist algorithm in computer-aided drug design

A preprint version of the article is available at bioRxiv.

Abstract

Although there has been considerable progress in molecular property prediction in computer-aided drug design, there is a critical need to have fast and accurate models. Many of the currently available methods are mostly specialize in predicting specific properties, leading to the use of many models side-by-side that lead to impossibly high computational overheads for the common researcher. Henceforth, the authors propose a single, generalist unified model exploiting graph convolutional variational encoders that can simultaneously predict multiple properties such as absorption, distribution, metabolism, excretion and toxicity, target-specific docking score prediction, and drug–drug interactions. The use of such a method allows for state-of-the-art virtual screening with a considerable acceleration advantage of up to two orders of magnitude. The minimization of a graph variational encoder’s latent space also allows for accelerated development of specific drugs for targets with Pareto optimality principles considered, and has the added advantage of explainability.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Molecules are encoded into a graph format, which is then passed through an autoencoder, with intermediate mathematical latent space used for property prediction through surrogate models.
Fig. 2: Variational graph encoder showed high accuracy in deriving fingerprints and other molecular descriptors while maintaining a Gaussian-distributed latent space.
Fig. 3: Surrogate models exploiting the variational graph encoder’s latent space can accurately predict single- and multiclassification problems, and regression problems for common datasets, even if the data is skewed.
Fig. 4: Ligand-based drug discovery is doable with latent space-trained surrogate models, with substantial speedup.
Fig. 5: Desired molecular properties can be engineered with surrogate model optimization, with explainability as to how one molecule is preferred over another in property prediction.

Similar content being viewed by others

Data availability

Data used in this study are all publicly available from various datasets cited. Molecular clusters used to train the model are available at https://doi.org/10.34740/kaggle/dsv/5657232 (~19 GB). The trained model encoder weights are also provided in the GitHub repository.

Code availability

Most of the updated code is available at https://github.com/Chokyotager/NotYetAnotherNightshade (ref. 68).

References

  1. Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).

    Google Scholar 

  2. Hutchinson, L. & Kirk, R. High drug attrition rates–where are we going wrong? Nat. Rev. Clin. Oncol. 8, 189–190 (2011).

    Google Scholar 

  3. Wouters, O. J., McKee, M. & Luyten, J. Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA 323, 844–853 (2020).

    Google Scholar 

  4. Baig, M. H., Ahmad, K., Rabbani, G., Danishuddin, M. & Choi, I. Computer aided drug design and its application to the development of potential drugs for neurodegenerative disorders. Curr. Neuropharmacol. 16, 740–748 (2018).

    Google Scholar 

  5. Liu, T. et al. Applying high-performance computing in drug discovery and molecular simulation. Natl Sci. Rev. 3, 49–63 (2016).

    Google Scholar 

  6. Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? Acta Pharm. Sin. B 12, 3049–3062 (2022).

    Google Scholar 

  7. Tornio, A., Filppula, A. M., Niemi, M. & Backman, J. T. Clinical studies on drug–drug interactions involving metabolism and transport: methodology, pitfalls, and interpretation. Clin. Pharmacol. Ther. 105, 1345–1361 (2019).

    Google Scholar 

  8. Wang, J. Comprehensive assessment of ADMET risks in drug discovery. Curr. Pharm. Des. 15, 2195–2219 (2009).

    Google Scholar 

  9. Kwon, S., Bae, H., Jo, J. & Yoon, S. Comprehensive ensemble in QSAR prediction for drug discovery. BMC Bioinf. 20, 521 (2019).

    Google Scholar 

  10. Wang, J. & Skolnik, S. Recent advances in physicochemical and ADMET profiling in drug discovery. Chem. Biodivers. 6, 1887–1899 (2009).

    Google Scholar 

  11. Wu, F. et al. Computational approaches in preclinical studies on drug discovery and development. Front. Chem. 8, 726 (2020).

    Google Scholar 

  12. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).

  13. Li, Y. et al. Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor. Nat. Commun. 13, 6891 (2022).

    Google Scholar 

  14. Yang, L. et al. Transformer-based generative model accelerating the development of novel BRAF Inhibitors. ACS Omega 6, 33864–33873 (2021).

    Google Scholar 

  15. Gomez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Google Scholar 

  16. Lee, M. & Min, K. MGCVAE: multi-objective inverse design via molecular graph conditional variational autoencoder. J. Chem. Inf. Model. 62, 2943–2950 (2022).

    Google Scholar 

  17. Martin Simonovsky, N. K. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (ed. HI Honolulu, USA) (2017).

  18. Richard, A. M. et al. The Tox21 10K compound library: collaborative chemistry advancing toxicology. Chem. Res. Toxicol. 34, 189–216 (2021).

    Google Scholar 

  19. Huang, K. et al. Artificial intelligence foundation for therapeutic science. Nat. Chem. Biol. 18, 1033–1036 (2022).

    Google Scholar 

  20. Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).

    Google Scholar 

  21. Maia, E. H. B., Assis, L. C., de Oliveira, T. A., da Silva, A. M. & Taranto, A. G. Structure-based virtual screening: from classical to artificial intelligence. Front. Chem. 8, 00343 (2020).

    Google Scholar 

  22. International Classification of Diseases, Eleventh Revision (ICD-11) (World Health Organization, 2019).

  23. Lagunin, A. A., Dearden, J. C., Filimonov, D. A. & Poroikov, V. V. Computer-aided rodent carcinogenicity prediction. Mutat. Res. 586, 138–146 (2005).

    Google Scholar 

  24. Hansen, P. & Bichel, J. Carcinogenic effect of sulfonamides. Acta Radiol. 37, 258–265 (1952).

    Google Scholar 

  25. Littlefield, N. A., Sheldon, W. G., Allen, R. & Gaylor, D. W. Chronic toxicity/carcinogenicity studies of sulphamethazine in Fischer 344/N rats: two-generation exposure. Food Chem. Toxicol. 28, 157–167 (1990).

    Google Scholar 

  26. Masumshah, R., Aghdam, R. & Eslahchi, C. A neural network-based method for polypharmacy side effects prediction. BMC Bioinform. 22, 385 (2021).

    Google Scholar 

  27. Wang, L. et al. Long short-term memory neural network with transfer learning and ensemble learning for remaining useful life prediction. Sensors 22, 5744 (2022).

  28. Wallraven, K. et al. Adapting free energy perturbation simulations for large macrocyclic ligands: how to dissect contributions from direct binding and free ligand flexibility. Chem. Sci. 11, 2269–2276 (2020).

    Google Scholar 

  29. Price, W. N. Big data and black-box medical algorithms. Sci. Transl. Med. 10, aao5333 (2018).

    Google Scholar 

  30. Zeng, X. et al. Deep generative molecular design reshapes drug discovery. Cell Rep. Med. 3, 100794 (2022).

    Google Scholar 

  31. Stumpfe, D., Hu, H. & Bajorath, J. Advances in exploring activity cliffs. J. Comput. Aided Mol. Des. 34, 929–942 (2020).

    Google Scholar 

  32. Musigmann, M. et al. Testing the applicability and performance of Auto ML for potential applications in diagnostic neuroradiology. Sci. Rep. 12, 13648 (2022).

    Google Scholar 

  33. Irwin, J. J. & Shoichet, B. K. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).

    Google Scholar 

  34. RDKit. RDKit: Open-source cheminformatics., https://www.rdkit.org

  35. Moriwaki, H., Tian, Y. S., Kawashita, N. & Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminform. 10, 4 (2018).

    Google Scholar 

  36. Platt, J. Probabilistic Outputs For Support Vector Machines and Comparisons to Regularized Likelihood Methods (Univ. Colorado, 1999).

  37. Wang, S. et al. ADMET evaluation in drug discovery. 16. Predicting hERG blockers by combining multiple pharmacophores and machine learning approaches. Mol. Pharm. 13, 2855–2866 (2016).

    Google Scholar 

  38. Veith, H. et al. Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries. Nat. Biotechnol. 27, 1050–1055 (2009).

    Google Scholar 

  39. Carbon-Mangels, M. & Hutter, M. C. Selecting relevant descriptors for classification by Bayesian estimates: a comparison with decision trees and support vector machines approaches for disparate data sets. Mol. Inform. 30, 885–895 (2011).

    Google Scholar 

  40. Cheng, F. et al. admetSAR: a comprehensive source and free tool for assessment of chemical ADMET properties. J. Chem. Inf. Model. 52, 3099–3105 (2012).

    Google Scholar 

  41. Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcao, A. O. A Bayesian approach to in silico blood–brain barrier penetration modeling. J. Chem. Inf. Model. 52, 1686–1697 (2012).

    Google Scholar 

  42. Xu, C. et al. In silico prediction of chemical Ames mutagenicity. J. Chem. Inf. Model. 52, 2840–2847 (2012).

    Google Scholar 

  43. Hou, T., Wang, J., Zhang, W. & Xu, X. ADME evaluation in drug discovery. 7. Prediction of oral absorption by correlation and classification. J. Chem. Inf. Model. 47, 208–218 (2007).

    Google Scholar 

  44. Xu, Y. et al. Deep learning for drug-induced liver injury. J. Chem. Inf. Model. 55, 2085–2093 (2015).

    Google Scholar 

  45. Alves, V. M. et al. Predicting chemically-induced skin reactions. Part I: QSAR models of skin sensitization and their application to identify potentially hazardous compounds. Toxicol. Appl. Pharmacol. 284, 262–272 (2015).

    Google Scholar 

  46. National Institute of Environmental Health Sciences (NIEHS); the murine local lymph node assay: a test method for assessing the allergic contact dermatitis potential of chemicals/compounds, report now available. Public health service. Fed. Regist. 64, 14006–14007 (1999).

  47. Zhu, H. et al. Quantitative structure–activity relationship modeling of rat acute toxicity by oral exposure. Chem. Res. Toxicol. 22, 1913–1921 (2009).

    Google Scholar 

  48. Lombardo, F. & Jing, Y. In silico prediction of volume of distribution in humans. Extensive data set and the exploration of linear and nonlinear methods coupled with molecular interaction fields descriptors. J. Chem. Inf. Model. 56, 2042–2052 (2016).

    Google Scholar 

  49. Wenlock, M. & Tomkinson, N. Experimental In Vitro DMPK and Physicochemical Data on a Set of Publicly Disclosed Compounds (ChEMBL); https://doi.org/10.6019/CHEMBL3301361

  50. Obach, R. S., Lombardo, F. & Waters, N. J. Trend analysis of a database of intravenous pharmacokinetic parameters in humans for 670 drug compounds. Drug Metab. Dispos. 36, 1385–1405 (2008).

    Google Scholar 

  51. Di, L. et al. Mechanistic insights from comparing intrinsic clearance values between human liver microsomes and hepatocytes to guide drug design. Eur. J. Med. Chem. 57, 441–448 (2012).

    Google Scholar 

  52. Ma, C. Y. et al. Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA-CG-SVM method. J. Pharm. Biomed. Anal. 47, 677–682 (2008).

    Google Scholar 

  53. Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    Google Scholar 

  54. Sorkun, M. C., Khetan, A. & Er, S. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci. Data 6, 143 (2019).

    Google Scholar 

  55. Mobley, D. L. & Guthrie, J. P. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J. Comput. Aided Mol. Des. 28, 711–720 (2014).

    Google Scholar 

  56. Touret, F. et al. In vitro screening of a FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. Sci. Rep. 10, 13093 (2020).

    Google Scholar 

  57. Main Protease Structure and XChem Fragment Screen (Diamond, 2020).

  58. Tatonetti, N. P., Ye, P. P., Daneshjou, R. & Altman, R. B. Data-driven prediction of drug effects and interactions. Sci. Transl. Med. 4, 125ra131 (2012).

    Google Scholar 

  59. Ryu, J. Y., Kim, H. U. & Lee, S. Y. Deep learning improves prediction of drug–drug and drug–food interactions. Proc. Natl Acad. Sci. USA 115, E4304–E4311 (2018).

    Google Scholar 

  60. Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucl. Acids Res. 46, D1074–D1082 (2018).

    Google Scholar 

  61. Ravindranath, P. A., Forli, S., Goodsell, D. S., Olson, A. J. & Sanner, M. F. AutoDockFR: advances in protein–ligand docking with explicitly specified binding site flexibility. PLoS Comput. Biol. 11, e1004586 (2015).

    Google Scholar 

  62. Alhossary, A., Handoko, S. D., Mu, Y. & Kwoh, C. K. Fast, accurate, and reliable molecular docking with QuickVina 2. Bioinformatics 31, 2214–2216 (2015).

    Google Scholar 

  63. McNutt, A. T. et al. GNINA 1.0: molecular docking with deep learning. J. Cheminform. 13, 43 (2021).

    Google Scholar 

  64. Zheng, L. et al. Improving protein–ligand docking and screening accuracies by incorporating a scoring function correction term. Brief. Bioinform. 23, bbac051 (2022).

    Google Scholar 

  65. Shen, C. et al. Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer. J. Med. Chem. 65, 10691–10706 (2022).

    MathSciNet  Google Scholar 

  66. Wang, Z. et al. A fully differentiable ligand pose optimization framework guided by deep learning and a traditional scoring function. Brief. Bioinform. 24, bbac520 (2022).

    MathSciNet  Google Scholar 

  67. Pincus, M. Letter to the editor—a Monte Carlo method for the approximate solution of certain types of constrained optimization problems. Oper. Res. 18, 1225–1228 (1970).

    MathSciNet  MATH  Google Scholar 

  68. Chokyotager/NotYetAnotherNightshade v.1.1 (Zenodo, 2022); https://doi.org/10.5281/zenodo.7827194

Download references

Acknowledgements

We thank T. L. Heng and S. Shikhar for their comments in the initial phase of the work, and T. L. Dawson Jr for his continued support. We would also like to dedicate this work to the memory of Jamie Hinks, a friend, colleague, and co-author of this paper, who sadly passed away in March 2023. This work is supported by the Singapore Ministry of Education (MOE), tier 1 grants RG27/21 and RG97/22 (M.Y.). H.L.Y.I. is also supported by funding from the Agency for Science, Technology and Research (A*STAR), and A*STAR BMRC EDB IAF-PP grants (H17/01/a0/004, Skin Research Institute of Singapore; H18/01a0/016 and H22J1a0040, Asian Skin Microbiome Program). Computations were mainly performed using the resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg) and the HADLEY high-performance computing cluster of SCELSE. SCELSE is funded by Singapore’s National Research Foundation, the Ministry of Education, NTU, and the National University of Singapore (NUS), and is hosted by NTU in partnership with NUS.

Author information

Authors and Affiliations

Authors

Contributions

H.L.Y.I, R.P. and M.Y. conceptualized the work. H.L.Y.I, R.P., H.H. and M.Y. designed the methodology. H.L.Y.I. and R.P. wrote the software. HL.Y.I., H.H. and M.Y. validated the work. H.L.Y.I. and W.Z. performed a formal analysis. HL.Y.I., R.P., H.H., W.Z. and O.X.E. performed investigations. H.L.Y.I., O.X.E. and W.Z. curated the data. H.L.Y.I. & M.Y. wrote the original draft, whereas R.P., H.H., O.X.E., W.Z., J.H., W.Y., L.W., Z.L. and M.Y. reviewed and edited the manuscript. H.L.Y.I. and O.X.E. visualized the work. H.L.Y.I. and M.Y. supervised the work. J.H. and M.Y. attained resources. M.Y. administered the project and acquired funding.

Corresponding authors

Correspondence to Weifeng Li, Liangzhen Zheng or Yuguang Mu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Shivam Patel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs 1–4 and legends for Supplementary Figs. 1 and 2.

Supplementary Data 1

Scores and comparisons of all surrogate models.

Supplementary Data 2

TWOSIDES polypharmacy labels reclassified using ICD-11 as a reference.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lam, H.Y.I., Pincket, R., Han, H. et al. Application of variational graph encoders as an effective generalist algorithm in computer-aided drug design. Nat Mach Intell 5, 754–764 (2023). https://doi.org/10.1038/s42256-023-00683-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00683-9

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research