Exploiting machine learning for end-to-end drug discovery and development


A variety of machine learning methods such as naive Bayesian, support vector machines and more recently deep neural networks are demonstrating their utility for drug discovery and development. These leverage the generally bigger datasets created from high-throughput screening data and allow prediction of bioactivities for targets and molecular properties with increased levels of accuracy. We have only just begun to exploit the potential of these techniques but they may already be fundamentally changing the research process for identifying new molecules and/or repurposing old drugs. The integrated application of such machine learning models for end-to-end (E2E) application is broadly relevant and has considerable implications for developing future therapies and their targeting.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Implementing end-to-end (E2E) machine learning models at all stages of drug discovery and development illustrating some of the key areas that could be modelled.
Fig. 2: Demonstrating iterative drug discovery using machine learning.


  1. 1.

    Butler, L. D. et al. Current nonclinical testing paradigms in support of safe clinical trials: an IQ Consortium DruSafe perspective. Regul. Toxicol. Pharmacol. 87, S1–S15 (2017).

  2. 2.

    Kola, I. & Landis, J. Can the pharmaceutical industry reduce attrition rates. Nat. Rev. Drug. Discov. 3, 711–715 (2004).

  3. 3.

    Bowes, J. et al. Reducing safety-related drug attrition: the use of in vitro pharmacological profiling. Nat. Rev. Drug. Discov. 11, 909–922 (2012).

  4. 4.

    DiMasi, J. A., Grabowski, H. G. & Hansen, R. W. Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 47, 20–33 (2016).

  5. 5.

    Kenna, J. G. Human biology-based drug safety evaluation: scientific rationale, current status and future challenges. Expert Opin. Drug Metab. Toxicol. 13, 567–574 (2017).

  6. 6.

    Gayvert, K. M., Madhukar, N. S. & Elemento, O. A data-driven approach to predicting successes and failures of clinical trials. Cell Chem. Biol. 23, 1294–1301 (2016).

  7. 7.

    Wagner, J. A. et al. Application of a dynamic map for learning, communicating, navigating, and improving therapeutic development. Clin. Transl. Sci. 11, 166–174 (2018).

  8. 8.

    Paul, S. M. et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat. Rev. Drug Discov. 9, 203–214 (2010).

  9. 9.

    Zhavoronkov, A. Artificial intelligence for drug discovery, biomarker development, and generation of novel chemistry. Mol. Pharm. 15, 4311–4313 (2018).

  10. 10.

    Davies, D. W., Butler, K. T., Isayev, O. & Walsh, A. Materials discovery by chemical analogy: role of oxidation states in structure prediction. Faraday Discuss. 211, 553–568 (2018).

  11. 11.

    Drouin, A. et al. Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genom. 17, 754 (2016).

  12. 12.

    Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today. 23, 1241–1250 (2018).

  13. 13.

    Ekins, S. et al. Machine learning models and pathway genome data base for trypanosoma cruzi drug discovery. PLoS Negl. Trop. Dis. 9, e0003878 (2015).

  14. 14.

    Lampa, S. et al. Predicting off-target binding profiles with confidence using conformal prediction. Front. Pharmacol. 9, 1256 (2018).

  15. 15.

    Reker, D., Rodrigues, T., Schneider, P. & Schneider, G. Identifying the macromolecular targets of de novo-designed chemical entities through self-organizing map consensus. Proc. Natl Acad. Sci. USA 111, 4067–4072 (2014).

  16. 16.

    Kim, S. et al. PubChem substance and compound databases. Nucleic Acids Res. 44, D1202–1213 (2016).

  17. 17.

    Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–1107 (2012).

  18. 18.

    Mayr, A. et al. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9, 5441–5451 (2018).

  19. 19.

    Clark, A. M., Williams, A. J. & Ekins, S. Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data. J. Cheminform. 7, 9 (2015).

  20. 20.

    Christianini, N. & Shawe-Taylor, J. Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge Univ. Press, 2000).

  21. 21.

    Shen, M., Xiao, Y., Golbraikh, A., Gombar, V. K. & Tropsha, A. Development and validation of K-nearest neighbour QSPR models of metabolic stability of drug candidates. J. Med. Chem. 46, 3013–3020 (2003).

  22. 22.

    Bender, A. et al. Analysis of pharmacology data and the prediction of adverse drug reactions and off-target effects from chemical structure. ChemMedChem 2, 861–873 (2007).

  23. 23.

    Susnow, R. G. & Dixon, S. L. Use of robust classification techniques for the prediction of human cytochrome P450 2D6 inhibition. J. Chem. Inf. Comput. Sci. 43, 1308–1315 (2003).

  24. 24.

    Mitchell, J. B. Machine learning methods in chemoinformatics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4, 468–481 (2014).

  25. 25.

    Schmidhuber, J. Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015).

  26. 26.

    Aliper, A. et al. Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13, 2524–2530 (2016).

  27. 27.

    Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55, 263–274 (2015).

  28. 28.

    Wu, K., Zhao, Z., Wang, R. & Wei, G.-W. TopPS: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. J. Comput. Chem. 39, 1444–1454 (2018).

  29. 29.

    Wen, M. et al. Deep-learning-based drug-target interaction prediction. J. Proteome Res. 16, 1401–1409 (2017).

  30. 30.

    Ekins, S. The next era: Deep learning in pharmaceutical research. Pharm. Res. 33, 2594–2603 (2016).

  31. 31.

    Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

  32. 32.

    Altae-Tran, H., Ramsundar, B., Pappu, A. S. & Pande, V. Low data drug discovery with one-shot learning. ACS Cent. Sci. 3, 283–293 (2017).

  33. 33.

    Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017).

  34. 34.

    Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).

  35. 35.

    Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform. https://doi.org/10.1093/bib/bby061 (2018).

  36. 36.

    Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).

  37. 37.

    Putin, E. et al. Adversarial threshold neural computer for molecular de novo design. Mol. Pharm. 15, 4386–4397 (2018).

  38. 38.

    McGaughey, G. B. et al. Comparison of topological, shape, and docking methods in virtual screening. J. Chem. Inf. Model. 47, 1504–1519 (2007).

  39. 39.

    Johnson, K. W. et al. Enabling precision cardiology through multiscale biology and systems medicine. JACC Basic Transl. Sci. 2, 311–327 (2017).

  40. 40.

    Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1, 18 (2018).

  41. 41.

    Ekins, S. et al. Machine learning models identify molecules active against Ebola virus in vitro. F1000Research 4, 1091 (2015).

  42. 42.

    Perryman, A. L., Stratton, T. P., Ekins, S. & Freundlich, J. S. Predicting mouse liver microsomal stability with “pruned’ machine learning models and public data. Pharm. Res. 33, 433–449 (2015).

  43. 43.

    Clark, A. M. et al. Open source Bayesian models: 1. Application to ADME/Tox and drug discovery datasets. J. Chem. Inf. Model. 55, 1231–1245 (2015).

  44. 44.

    Perryman, A. L. et al. Naive Bayesian models for vero cell cytotoxicity. Pharm. Res. 35, 170 (2018).

  45. 45.

    Sandoval, P. J., Zorn, K. M., Clark, A. M., Ekins, S. & Wright, S. H. Assessment of substrate dependent ligand interactions at the organic cation transporter OCT2 using six model substrates. Mol. Pharmacol. 94, 1057–1068 (2018).

  46. 46.

    Russo, D. P., Zorn, K. M., Clark, A. M., Zhu, H. & Ekins, S. Comparing multiple machine learning algorithms and metrics for estrogen receptor binding prediction. Mol. Pharm. 15, 4361–4370 (2018).

  47. 47.

    Lusci, A., Pollastri, G. & Baldi, P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model. 53, 1563–1575 (2013).

  48. 48.

    Stratton, T. P. et al. Addressing the metabolic stability of antituberculars through machine learning. ACS Med. Chem. Lett. 8, 1099–1104 (2017).

  49. 49.

    Korotcov, A., Tkachenko, V., Russo, D. P. & Ekins, S. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery datasets. Mol. Pharm. 14, 4462–4475 (2018).

  50. 50.

    Lenselink, E. B. et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J. Cheminform. 9, 45 (2017).

  51. 51.

    Koutsoukas, A., Monaghan, K. J., Li, X. & Huan, J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J. Cheminform. 9, 42 (2017).

  52. 52.

    Lane, T. et al. Comparing and validating machine learning models for mycobacterium tuberculosis drug discovery. Mol. Pharm. 15, 4346–4360 (2018).

  53. 53.

    Ridley, D. B. Priorities for the priority review voucher. Am. J. Trop. Med. Hyg. 96, 14–15 (2017).

  54. 54.

    Ekins, S. et al. Bayesian models leveraging bioactivity and cytotoxicity information for drug discovery. Chem. Biol. 20, 370–378 (2013).

  55. 55.

    Hernandez, H. W. et al. High throughput and computational repurposing for neglected diseases. Pharm. Res. 36, 27 (2018).

  56. 56.

    Ekins, S. Industrializing rare disease therapy discovery and development. Nat. Biotechnol. 35, 117–118 (2017).

  57. 57.

    Ekins, S. & Perlstein, E. O. Doing it all – how families are reshaping rare disease research. Pharm. Res. 35, 192 (2018).

  58. 58.

    Chen, B. & Altman, R. B. Opportunities for developing therapies for rare genetic diseases: focus on gain-of-function and allostery. Orphanet. J. Rare Dis. 12, 61 (2017).

  59. 59.

    Trujillano, D. et al. A comprehensive global genotype-phenotype database for rare diseases. Mol. Genet. Genomic Med. 5, 66–75 (2017).

  60. 60.

    Thompson, R. et al. RD-Connect: an integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research. J. Gen. Intern. Med. 29, 780–787 (2014).

  61. 61.

    Rath, A. et al. Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Hum. Mutat. 33, 803–808 (2012).

  62. 62.

    Rare Disease InfoHub https://rarediseases.oscar.ncsu.edu (2018).

  63. 63.

    Fleming, N. How artificial intelligence is changing drug discovery. Nature 557, 55–57 (2018).

  64. 64.

    Chuang, K. V. & Keiser, M. J. Adversarial controls for scientific machine learning. ACS Chem. Biol. 13, 2819–2821 (2018).

  65. 65.

    Marchese Robinson, R. L., Palczewska, A., Palczewski, J. & Kidley, N. Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets. J. Chem. Inf. Model. 57, 1773–1792 (2017).

  66. 66.

    Jones, D. E., Ghandehari, H. & Facelli, J. C. A review of the applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles. Comput. Methods Programs Biomed. 132, 93–103 (2016).

  67. 67.

    Shamay, Y. et al. Quantitative self-assembly prediction yields targeted nanomedicines. Nat. Mater. 17, 361–368 (2018).

  68. 68.

    de la Iglesia, D. et al. A machine learning approach to identify clinical trials involving nanodrugs and nanodevices from ClinicalTrials.gov. PLOS ONE 9, e110331 (2014).

  69. 69.

    Tropsha, A., Mills, K. C. & Hickey, A. J. Reproducibility, sharing and progress in nanomaterial databases. Nat. Nanotechnol. 12, 1111–1114 (2017).

  70. 70.

    Baker, N. C., Ekins, S., Williams, A. J. & Tropsha, A. A bibliometric review of drug repurposing. Drug Discov. Today 23, 661–672 (2018).

  71. 71.

    Lamb, J. et al. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006).

  72. 72.

    Dudley, J. T. et al. Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Sci. Transl. Med. 3, 96ra76 (2011).

  73. 73.

    Schadt, E. E., Buchanan, S., Brennand, K. J. & Merchant, K. M. Evolving toward a human-cell based and multiscale approach to drug discovery for CNS disorders. Front. Pharmacol. 5, 252 (2014).

  74. 74.

    Napolitano, F. et al. Drug repositioning: a machine-learning approach through data integration. J. Cheminform. 5, 30 (2013).

  75. 75.

    Cruz, S. et al. In silico HCT116 human colon cancer cell-based models en route to the discovery of lead-like anticancer drugs. Biomolecules 8, 56 (2018).

  76. 76.

    Fröhlich, H. et al. From hype to reality: data science enabling personalized medicine. BMC Med. 16, 150 (2018).

  77. 77.

    Chen, R., Liu, X., Jin, S., Lin, J. & Liu, J. Machine learning for drug-target interaction prediction. Molecules 23, 2208 (2018).

  78. 78.

    Lin, J. & Wong, K. C. Off-target predictions in CRISPR-Cas9 gene editing using deep learning. Bioinformatics 34, i656–i663 (2018).

  79. 79.

    Chang, Y. et al. Cancer drug response profile scan (CDRscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Sci. Rep. 8, 8857 (2018).

  80. 80.

    Boland, M. R., Polubriaginof, F. & Tatonetti, N. P. Development of A machine learning algorithm to classify drugs of unknown fetal effect. Sci. Rep. 7, 12839 (2017).

  81. 81.

    Rannals, M. D. et al. Psychiatric risk gene transcription factor 4 regulates intrinsic excitability of prefrontal neurons via repression of SCN10a and KCNQ1. Neuron 90, 43–55 (2016).

  82. 82.

    Zang, Q. et al. In silico prediction of physicochemical properties of environmental chemicals using molecular fingerprints and machine learning. J. Chem. Inf. Model. 57, 36–49 (2017).

  83. 83.

    Hong, H., Thakkar, S., Chen, M. & Tong, W. Development of decision forest models for prediction of drug-induced liver injury in humans using a large set of FDA-approved drugs. Sci. Rep. 7, 17311 (2017).

  84. 84.

    Korotcov, A., Tkachenko, V., Russo, D. P. & Ekins, S. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol. Pharm. 14, 4462–4475 (2017).

  85. 85.

    Wang, W., Kim, M. T., Sedykh, A. & Zhu, H. Developing enhanced blood-brain barrier permeability models: integrating external bio-assay data in QSAR modeling. Pharm. Res. 32, 3055–3065 (2015).

  86. 86.

    Baba, H., Takahara, J., Yamashita, F. & Hashida, M. Modeling and prediction of solvent effect on human skin permeability using support vector regression and random forest. Pharm. Res. 32, 3604–3617 (2015).

  87. 87.

    Xu, C. et al. In silico prediction of chemical Ames mutagenicity. J. Chem. Inf. Model. 52, 2840–2847 (2012).

  88. 88.

    Huang, W. et al. Prediction of human clearance based on animal data and molecular properties. Chem. Biol. Drug Des. 86, 990–997 (2015).

  89. 89.

    Basant, N., Gupta, S. & Singh, K. P. QSAR modeling for predicting reproductive toxicity of chemicals in rats for regulatory purposes. Toxicol. Res. 5, 1029–1038 (2016).

  90. 90.

    Alhalaweh, A. et al. Computational predictions of glass-forming ability and crystallization tendency of drug molecules. Mol. Pharm. 11, 3123–3132 (2014).

  91. 91.

    Miller, T. H. et al. Prediction of bioconcentration factors in fish and invertebrates using machine learning. Sci. Total Environ. 648, 80–89 (2019).

  92. 92.

    Rose, S., Bergquist, S. L. & Layton, T. J. Computational health economics for identification of unprofitable health care enrollees. Biostatistics 18, 682–694 (2017).

  93. 93.

    Calderon, C. P., Daniels, A. L. & Randolph, T. W. Deep convolutional neural network analysis of flow imaging microscopy data to classify subvisible particles in protein formulations. J. Pharm. Sci. 107, 999–1008 (2018).

  94. 94.

    Degardin, K., Guillemain, A., Guerreiro, N. V. & Roggo, Y. Near infrared spectroscopy for counterfeit detection using a large database of pharmaceutical tablets. J. Pharm. Biomed. Anal. 128, 89–97 (2016).

  95. 95.

    Page, D. et al. Identifying adverse drug events by relational learning. Proc. Conf. AAAI Artif. Intell. 2012, 790–793 (2012).

Download references


In memory of Rebecca J. Williams. J. Freundlich, R. J. G. Arnold, P. Madrid, J. Lage de Siqueira-Neto, A. Williams, A. Tropsha, A. Gerlach, J. Gerlach, D. Chipman, A. Davidow and M. Hupcey are kindly acknowledged for discussions and some of the collaborations described herein. S.E. acknowledges funding to Collaborations Pharmaceuticals, Inc., from NIGMS R44 GM122196-02A1, NINDS 1R43NS107079-01, NINDS 3R43NS107079-01S1, NCATS 1UH2TR002084-01 and FY2018 UNC Research Opportunities Initiative (ROI) award. Research reported in this publication was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award number R43NS107079. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Correspondence to Sean Ekins.

Ethics declarations

Competing interests

S.E. is founder and CEO, A.C.P., K.M.Z., T.L. and J.J.K. are employees, and D.P.R. and A.M.C. are consultants of Collaborations Pharmaceuticals, Inc. A.M.C. is also the founder and owner of Molecular Materials Informatics, Inc. A.J.H. has no conflicts of interest.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ekins, S., Puhl, A.C., Zorn, K.M. et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 18, 435–441 (2019). https://doi.org/10.1038/s41563-019-0338-z

Download citation

Further reading