Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Applications of machine learning in drug discovery and development


Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Machine learning applications in the drug discovery pipeline and their required data characteristics.
Fig. 2: Machine learning tools and their drug discovery applications.
Fig. 3: The challenges of compound structure representation in machine learning models.
Fig. 4: Utilizing predictive biomarkers to support drug discovery and development.
Fig. 5: Computational pathology tasks for machine learning applications.


  1. 1.

    Mamoshina, P. et al. Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification. Front. Genet. 9, 242 (2018).

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).

    CAS  PubMed  Google Scholar 

  3. 3.

    Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018). This article is the first effort to highlight the recent applications of DL in drug discovery research and is an introduction to some popular DL architectures.

    PubMed  Google Scholar 

  4. 4.

    Hinton, G. Deep learning — a technology with the potential to transform health care. JAMA 320, 1101–1102 (2018).

    PubMed  Google Scholar 

  5. 5.

    Wong, C. H., Siah, K. W. & Lo, A. W. Estimation of clinical trial success rates and related parameters. Biostatistics (2018).

    Article  PubMed Central  Google Scholar 

  6. 6.

    Jeon, J. et al. A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening. Genome Med. 6, 57 (2014).

    PubMed  PubMed Central  Google Scholar 

  7. 7.

    Ferrero, E., Dunham, I. & Sanseau, P. In silico prediction of novel therapeutic targets using gene-disease association data. J. Transl Med. 15, 182 (2017).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Riniker, S., Wang, Y., Jenkins, J. & Landrum, G. Using information from historical high-throughput screens to predict active compounds. J. Chem. Inf. Model. 54, 1880–1891 (2014).

    CAS  PubMed  Google Scholar 

  9. 9.

    Godinez, W. J., Hossain, I., Lazic, S. E., Davies, J. W. & Zhang, X. A multi-scale convolutional neural network for phenotyping high-content cellular images. Bioinformatics 33, 2010–2019 (2017).

    CAS  PubMed  Google Scholar 

  10. 10.

    Olsen, T. et al. Diagnostic performance of deep learning algorithms applied to three common diagnoses in dermatopathology. J. Pathol. Inform. 9, 32–32 (2018).

    PubMed  PubMed Central  Google Scholar 

  11. 11.

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    Google Scholar 

  12. 12.

    Jiao, Y. & Pufeng, D. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant. Biol. 4, 320 (2016).

    Google Scholar 

  13. 13.

    Czodrowski, P. Count on kappa. J. Comput. Aided Mol. Des. 28, 1049–1055 (2014).

    CAS  PubMed  Google Scholar 

  14. 14.

    Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief. Bioinform. (2018).

    Article  PubMed  Google Scholar 

  15. 15.

    Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504 (2006).

    CAS  PubMed  Google Scholar 

  16. 16.

    Koscielny, G. et al. Open targets: a platform for therapeutic target identification and validation. Nucleic Acids Res. 45, D985–D994 (2017).

    CAS  PubMed  Google Scholar 

  17. 17.

    Costa, P. R., Acencio, M. L. & Lemke, N. A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data. BMC Genomics 11, S9–S9 (2010).

    PubMed  PubMed Central  Google Scholar 

  18. 18.

    Ament, S. A. et al. Transcriptional regulatory networks underlying gene expression changes in Huntington’s disease. Mol. Systems Biol. 14, e7435 (2018).

    Google Scholar 

  19. 19.

    Bravo, A., Pinero, J., Queralt-Rosinach, N., Rautschka, M. & Furlong, L. I. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 16, 55 (2015).

    PubMed  PubMed Central  Google Scholar 

  20. 20.

    Kim, J., Kim, J.-j. & Lee, H. An analysis of disease-gene relationship from Medline abstracts by DigSee. Sci. Rep. 7, 40154 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Leung, M. K. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, i121–i129 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Jha, A., Gazzara, M. R. & Barash, Y. Integrative deep models for alternative splicing. Bioinformatics 33, i274–i282 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Vaquero-Garcia, J. et al. A new view of transcriptome complexity and regulation through the lens of local splicing variations. eLife 5, e11752 (2016).

    PubMed  PubMed Central  Google Scholar 

  24. 24.

    Sotillo, E. et al. Convergence of acquired mutations and alternative splicing of CD19 enables resistance to CART-19 immunotherapy. Cancer Discov. 5, 1282–1295 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Rohacek, A. M. et al. ESRP1 mutations cause hearing loss due to defects in alternative splicing that disrupt cochlear development. Dev. Cell 43, 318–331 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Xiong, H. Y. et al. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015). This article describes a computational model based on DL that predicts splicing regulation for any mRNA sequence and has been applied to more than half a million human mRNA splicing sequence variants. Thousands of known disease-causing mutations are identified as well as new disease-linked genes.

    PubMed  Google Scholar 

  27. 27.

    Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016). This paper applies ML to data from somatic mutations, copy number alterations, DNA methylation and gene expression from 1,000 cancer cell lines to model drug response of the cell lines and demonstrates the importance of genomic features for prediction.

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    McMillan, E. A. et al. Chemistry-first approach for nomination of personalized treatment in lung cancer. Cell 173, 864–878 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Al-Lazikani, B. et al. in Bioinformatics — From Genomes to Therapies Ch. 36 (Wiley-VCH, 2008).

  31. 31.

    Nayal, M. & Honig, B. On the nature of cavities on protein surfaces: application to the identification of drug-binding sites. Proteins 63, 892–906 (2006). This article describes a classifier to identify drug-binding cavities on the basis of physicochemical, structural and geometric attributes of proteins.

    CAS  PubMed  Google Scholar 

  32. 32.

    Li, Q. & Lai, L. Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics 8, 353 (2007).

    PubMed  PubMed Central  Google Scholar 

  33. 33.

    Bakheet, T. M. & Doig, A. J. Properties and identification of human protein drug targets. Bioinformatics 25, 451–457 (2009).

    CAS  PubMed  Google Scholar 

  34. 34.

    Wang, Q., Feng, Y., Huang, J., Wang, T. & Cheng, G. A novel framework for the identification of drug target proteins: combining stacked auto-encoders with a biased support vector machine. PLOS ONE 12, e0176486 (2017).

    PubMed  PubMed Central  Google Scholar 

  35. 35.

    Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol. 6, 366–366 (2015).

    PubMed  PubMed Central  Google Scholar 

  36. 36.

    Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856–860 (2015).

    CAS  PubMed  Google Scholar 

  37. 37.

    Morgan, P. et al. Impact of a five-dimensional framework on R&D productivity at AstraZeneca. Nat. Rev. Drug Discov. 17, 167–181 (2018).

    CAS  PubMed  Google Scholar 

  38. 38.

    Rouillard, A. D., Hurle, M. R. & Agarwal, P. Systematic interrogation of diverse Omic data reveals interpretable, robust, and generalizable transcriptomic features of clinically successful therapeutic targets. PLOS Comput. Biol. 14, e1006142 (2018).

    PubMed  PubMed Central  Google Scholar 

  39. 39.

    Kumar, V., Sanseau, P., Simola, D. F., Hurle, M. R. & Agarwal, P. Systematic analysis of drug targets confirms expression in disease-relevant tissues. Sci. Rep. 6, 36205 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Ramsundar, B. et al. Is multitask deep learning practical for pharma? J. Chem. Inf. Model. 57, 2068–2076 (2017).

    CAS  PubMed  Google Scholar 

  41. 41.

    Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure–activity relationships. J. Chem. Inf. Model. 55, 263–274 (2015).

    CAS  PubMed  Google Scholar 

  42. 42.

    Barati Farimani, A., Feinberg, E. & Pande, V. Binding pathway of opiates to μ-opioid receptors revealed by machine learning. Biophys. J. 114, 62a–63a (2018).

    Google Scholar 

  43. 43.

    Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    CAS  PubMed  Google Scholar 

  44. 44.

    Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604 (2018). This seminal paper describes a very thorough approach to retrosynthetic analysis. The authors show that their method can compete with retrosynthesis done by experienced chemists who are experts in this field.

    CAS  PubMed  Google Scholar 

  45. 45.

    Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017).

    CAS  PubMed  Google Scholar 

  47. 47.

    Smith, J. S., Roitberg, A. E. & Isayev, O. Transforming computational drug discovery with machine learning and AI. ACS Med. Chem. Lett. 9, 1065–1069 (2018).

    CAS  PubMed  Google Scholar 

  48. 48.

    Lenselink, E. B. et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J. Cheminform. 9, 45 (2017).

    PubMed  PubMed Central  Google Scholar 

  49. 49.

    Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2017).

    CAS  PubMed  Google Scholar 

  50. 50.

    Ramsundar, B. et al. Massively multitask networks for drug discovery. Preprint at arXiv (2015).

  51. 51.

    Gutlein, M. & Kramer, S. Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. J. Cheminform. 8, 60 (2016).

    PubMed  PubMed Central  Google Scholar 

  52. 52.

    Mayr, A. et al. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9, 5441–5451 (2018). This research paper describes the methodology being used by the winners of almost all categories of the Tox21 Challenge.

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Keiser, M. J. et al. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 25, 197 (2007).

    CAS  PubMed  Google Scholar 

  54. 54.

    Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet Distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).

    CAS  PubMed  Google Scholar 

  55. 55.

    Unterthiner, T., Mayr, A., Klambauer, G. & Hochreiter, S. Toxicity prediction using deep learning. Preprint at arXiv (2015).

  56. 56.

    Li, B. et al. Development of a drug-response modeling framework to identify cell line derived translational biomarkers that can predict treatment outcome to erlotinib or sorafenib. PLOS ONE 10, e0130700 (2015). In this paper, a translational predictive biomarker is used to demonstrate that predictive models can be generated from preclinical training data sets and then be applied to clinical patient samples to stratify patients, infer the mechanism of action of a drug and select appropriate disease indications.

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    van Gool, A. J. et al. Bridging the translational innovation gap through good biomarker practice. Nat. Rev. Drug Discov. 16, 587–588 (2017).

    PubMed  Google Scholar 

  58. 58.

    Kraus, V. B. Biomarkers as drug development tools: discovery, validation, qualification and use. Nat. Rev. Rheumatol. 14, 354–362 (2018).

    CAS  PubMed  Google Scholar 

  59. 59.

    Shi, L. et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28, 827–838 (2010).

    CAS  PubMed  Google Scholar 

  60. 60.

    Zhan, F. et al. The molecular classification of multiple myeloma. Blood 108, 2020–2028 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Shaughnessy, J. D. Jr. et al. A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. Blood 109, 2276–2284 (2007).

    CAS  PubMed  Google Scholar 

  62. 62.

    Zhan, F., Barlogie, B., Mulligan, G., Shaughnessy, J. D. Jr & Bryant, B. High-risk myeloma: a gene expression based risk-stratification model for newly diagnosed multiple myeloma treated with high-dose therapy is predictive of outcome in relapsed disease treated with single-agent bortezomib or high-dose dexamethasone. Blood 111, 968–969 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. 63.

    Decaux, O. et al. Prediction of survival in multiple myeloma based on gene expression profiles reveals cell cycle and chromosomal instability signatures in high-risk patients and hyperdiploid signatures in low-risk patients: a study of the Intergroupe Francophone du Myelome. J. Clin. Oncol. 26, 4798–4805 (2008).

    CAS  PubMed  Google Scholar 

  64. 64.

    Mulligan, G. et al. Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. Blood 109, 3177–3188 (2007).

    CAS  PubMed  Google Scholar 

  65. 65.

    Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 32, 1202–1212 (2014). This paper is an effort to collect and objectively evaluate various ML approaches by teams around the world on multi-omics data sets and various compounds. The data sets and results are continuously used as benchmarks for new method developments and validation.

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Rahman, R., Otridge, J. & Pal, R. IntegratedMRF: random forest-based framework for integrating prediction from different data types. Bioinformatics 33, 1407–1410 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. 67.

    Bunte, K., Leppäaho, E., Saarinen, I. & Kaski, S. Sparse group factor analysis for biclustering of multiple data sources. Bioinformatics 32, 2457–2463 (2016).

    CAS  PubMed  Google Scholar 

  68. 68.

    Huang, C., Mezencev, R., McDonald, J. F. & Vannberg, F. Open source machine-learning algorithms for the prediction of optimal cancer drug therapies. PLOS ONE 12, e0186906 (2017).

    PubMed  PubMed Central  Google Scholar 

  69. 69.

    Hejase, H. A. & Chan, C. Improving drug sensitivity prediction using different types of data. CPT Pharmacometrics Syst. Pharmacol. 4, e2 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70.

    Kim, E. S. et al. The BATTLE trial: personalizing therapy for lung cancer. Cancer Discov. 1, 44–53 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. 71.

    Boyiadzis, M. M. et al. Significance and implications of FDA approval of pembrolizumab for biomarker-defined disease. J. Immunother. Cancer 6, 35 (2018).

    PubMed  PubMed Central  Google Scholar 

  72. 72.

    Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat. Commun. 9, 2755 (2018). This work identifies molecular signatures that are resistant to drug treatments and illustrates a multi-omics approach to understanding drug response.

    PubMed  PubMed Central  Google Scholar 

  73. 73.

    Paré, G., Mao, S. & Deng, W. Q. A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci. Rep. 7, 12665 (2017).

    PubMed  PubMed Central  Google Scholar 

  74. 74.

    Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. 75.

    Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

    PubMed  PubMed Central  Google Scholar 

  76. 76.

    Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Project Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Preprint at bioRxiv (2018).

  77. 77.

    Wang, D. & Gu, J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genomics Proteomics Bioinformatics 16, 320–331 (2017).

    Google Scholar 

  78. 78.

    Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241 (2015).

    PubMed  PubMed Central  Google Scholar 

  79. 79.

    Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414 (2017).

    CAS  PubMed  Google Scholar 

  80. 80.

    Tan, J., Hammond, J. H., Hogan, D. A. & Greene, C. A.-O. ADAGE-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. mSystems 1, e00025–15 (2016).

    PubMed  PubMed Central  Google Scholar 

  81. 81.

    Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).

    PubMed  PubMed Central  Google Scholar 

  82. 82.

    Casanova, R. et al. Morphoproteomic characterization of lung squamous cell carcinoma fragmentation, a histological marker of increased tumor invasiveness. Cancer Res. 77, 2585–2593 (2017).

    CAS  PubMed  Google Scholar 

  83. 83.

    Nirschl, J. J. et al. A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of H&E tissue. PLOS ONE 13, e0192726 (2018).

    PubMed  PubMed Central  Google Scholar 

  84. 84.

    Angermueller, C., Pärnamaa, T., Parts, L. & Stegle, O. Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016).

    PubMed  PubMed Central  Google Scholar 

  85. 85.

    Finnegan, A. & Song, J. S. Maximum entropy methods for extracting the learned features of deep neural networks. PLOS Comput. Biol. 13, e1005836 (2017).

    PubMed  PubMed Central  Google Scholar 

  86. 86.

    Hutson, M. Artificial intelligence faces reproducibility crisis. Science 359, 725–726 (2018).

    PubMed  Google Scholar 

  87. 87.

    Veltri, R. W., Partin, A. W. & Miller, M. C. Quantitative nuclear grade (QNG): a new image analysis-based biomarker of clinically relevant nuclear structure alterations. J. Cell. Biochem. Suppl. 35, S151–S157 (2000).

    Google Scholar 

  88. 88.

    Beck, A. H. et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl Med. 3, 108ra113 (2011).

    PubMed  Google Scholar 

  89. 89.

    Lee, G. et al. Nuclear shape and architecture in benign fields predict biochemical recurrence in prostate cancer patients following radical prostatectomy: preliminary findings. Eur. Urol. Focus 3, 457–466 (2017).

    PubMed  Google Scholar 

  90. 90.

    Lu, C. et al. An oral cavity squamous cell carcinoma quantitative histomorphometric-based image classifier of nuclear morphology can risk stratify patients for disease-specific survival. Mod. Pathol. 30, 1655–1665 (2017).

    PubMed  PubMed Central  Google Scholar 

  91. 91.

    Lu, C. et al. Nuclear shape and orientation features from H&E images predict survival in early-stage estrogen receptor-positive breast cancers. Lab. Invest. 98, 1438–1448 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  92. 92.

    Mani, N. L. et al. Quantitative assessment of the spatial heterogeneity of tumor-infiltrating lymphocytes in breast cancer. Breast Cancer Res. 18, 78 (2016).

    PubMed  PubMed Central  Google Scholar 

  93. 93.

    Giraldo, N. A. et al. The differential association of PD-1, PD-L1, and CD8 + cells with response to pembrolizumab and presence of Merkel cell polyomavirus (MCPyV) in patients with Merkel cell carcinoma (MCC). Cancer Res. 77, 662 (2017).

    Google Scholar 

  94. 94.

    Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J. Pathol. Informat. 7, 29 (2016). This article is the first comprehensive review of DL in the context of digital pathology images. The paper also systematically explains and presents approaches for training and validating DL classifiers for a number of image-based problems in digital pathology, including cell detection, segmentation and tissue classification.

    Google Scholar 

  95. 95.

    Sharma, H., Zerbe, N., Klempert, I., Hellwich, O. & Hufnagl, P. Deep convolutional neural networks for automatic classification of gastric carcinoma using whole slide images in digital histopathology. Comput. Med. Imaging Graph. 61, 2–13 (2017).

    PubMed  Google Scholar 

  96. 96.

    Korbar, B. et al. Deep learning for classification of colorectal polyps on whole-slide images. J. Pathol. Informat. 8, 30 (2017).

    Google Scholar 

  97. 97.

    Bychkov, D. et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci. Rep. 8, 3395 (2018).

    PubMed  PubMed Central  Google Scholar 

  98. 98.

    Cruz-Roa, A. et al. Accurate and reproducible invasive breast cancer detection in whole-slide images: A Deep Learning approach for quantifying tumor extent. Sci. Rep. 7, 46450 (2017). This is one of the first papers to apply DL to identify regions of breast cancer on digital pathology images and shows that the algorithmic approach outperforms breast cancer pathologists. It is one of the first studies to have a large data set of cases (>600) with independent training and validation sets.

    CAS  PubMed  PubMed Central  Google Scholar 

  99. 99.

    Romo-Bucheli, D., Janowczyk, A., Gilmore, H., Romero, E. & Madabhushi, A. Automated tubule nuclei quantification and correlation with oncotype DX risk categories in ER + breast cancer whole slide images. Sci. Rep. 6, 32706 (2016). This article applies DL to identify the presence and location of tubules in breast pathology images and subsequently demonstrates that the number of detected tubules correlates with the risk assessments of breast cancer via a genomic test. It is one of the first papers to show how DL can be used to establish genotype–phenotype associations.

    CAS  PubMed  PubMed Central  Google Scholar 

  100. 100.

    Romo-Bucheli, D., Janowczyk, A., Gilmore, H., Romero, E. & Madabhushi, A. A deep learning based strategy for identifying and associating mitotic activity with gene expression derived risk categories in estrogen receptor positive breast cancers. Cytometry A 91, 566–573 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  101. 101.

    Saltz, J. et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep. 23, 181–193 (2018). This large-scale study utilizes DL to identify lymphocytes across all images and relate spatial characteristics of lymphocytes to molecular assessments. This article is key to the automatic quantification of immune cells from H&E slides and the identification of sub-categories of immune infiltrate as related to therapeutic outcome.

    CAS  PubMed  PubMed Central  Google Scholar 

  102. 102.

    Corredor, G. et al. Spatial architecture and arrangement of tumor-infiltrating lymphocytes for predicting likelihood of recurrence in early-stage non-small cell lung cancer. Clin. Cancer Res. 25, 1526–1534 (2018). In this paper, the spatial arrangement, and not just the density, of tumour-infiltrating lymphocytes in early-stage lung cancer pathology images is shown to be prognostic of recurrence. A comprehensive comparison is provided, showing that computer-extracted features of spatial arrangement of tumour-infiltrating lymphocytes are more prognostic than manual (pathologist) enumeration of tumour-infiltrating lymphocyte density.

    PubMed  Google Scholar 

  103. 103.

    Cohen, O., Zhu, B. & Rosen, M. S. MR fingerprinting Deep RecOnstruction NEtwork (DRONE). Magn. Reson. Med. 80, 885–894 (2018).

    PubMed  Google Scholar 

  104. 104.

    Chen, H. et al. Low-dose CT with a residual encoder-decoder convolutional neural network (RED-CNN). Preprint at arXiv (2017).

  105. 105.

    Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018). This paper uses DL frameworks to predict mutations from H&E images, which has implications for identifying key mechanistic insights from standard whole-slide imaging as well as for patient stratification.

    CAS  PubMed  Google Scholar 

  106. 106.

    Turkki, R., Linder, N., Kovanen, P. E., Pellinen, T. & Lundin, J. Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. J. Pathol. Inform. 7, 38 (2016).

    PubMed  PubMed Central  Google Scholar 

  107. 107.

    Norgeot, B., Glicksberg, B. S. & Butte, A. J. A call for deep-learning healthcare. Nat. Med. 25, 14–15 (2019).

    CAS  PubMed  Google Scholar 

  108. 108.

    Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).

    CAS  PubMed  Google Scholar 

  109. 109.

    Yang, Z. et al. Clinical assistant diagnosis for electronic medical record based on convolutional neural network. Sci. Rep. 8, 6329 (2018).

    PubMed  PubMed Central  Google Scholar 

  110. 110.

    Steele, A. J., Denaxas, S. C., Shah, A. D., Hemingway, H. & Luscombe, N. M. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLOS ONE 13, e0202344 (2018).

    PubMed  PubMed Central  Google Scholar 

  111. 111.

    Mohr, D. C., Zhang, M. & Schueller, S. M. Personal sensing: understanding mental health using ubiquitous sensors and machine learning. Annu. Rev. Clin. Psychol. 13, 23–47 (2017).

    PubMed  Google Scholar 

  112. 112.

    Gkotsis, G. et al. Characterisation of mental health conditions in social media using Informed Deep Learning. Sci. Rep. 7, 45141 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  113. 113.

    Koscielny, S. Why most gene expression signatures of tumors have not been useful in the clinic. Sci. Transl Med. 2, 14ps12 (2010).

    Google Scholar 

  114. 114.

    Odell, S. G., Lazo, G. R., Woodhouse, M. R., Hane, D. L. & Sen, T. Z. The art of curation at a biological database: principles and application. Curr. Plant Biol. 11–12, 2–11 (2017).

    Google Scholar 

Download references


The authors thank E. Birney and E. Papa for helpful comments, M. Segler for contributing to the small-molecule optimization subsection and A. Janowczyk for providing the pathology images in Figure 4.

Author information



Corresponding author

Correspondence to Jessica Vamathevan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links


DREAM Challenges:



Graphical processing units

(GPUs). Processors designed to accelerate the rendering of graphics and that can handle tens of thousands of operations per cycle.

Central processing units

(CPUs). Processors designed to solve every computational problem in a general fashion and that can handle tens of operations per cycle. The cache and memory are designed to be optimal for any general programming problem.

Tensor processing units

(TPUs). Co-processors manufactured by Google that are designed to accelerate deep learning tasks developed using TensorFlow (a programming framework) and can handle up to 128,000 operations per cycle.

Support vector machine (SVM) classifier

A method that performs classification tasks by constructing separating lines to distinguish between objects with different class memberships in a multi-dimensional space.


Ultraviolet crosslinking immunoprecipitation (CLIP) followed by RNA sequencing to identify all RNA species bound by a protein of interest. This method can be used to map RNA protein binding sites or RNA modification sites on a genome-wide scale.

Heuristic method

A function that calculates the approximate cost of a problem (or ranks alternatives).

Chemical fingerprint

A concept used in chemical informatics to compare molecules with each other. The structure of a molecule is encoded in a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule.

Simplified molecular input line entry system (SMILES)

A line notation for entering and representing molecules and reactions; for example, carbon dioxide is represented as O = C = O.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vamathevan, J., Clark, D., Czodrowski, P. et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18, 463–477 (2019).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing