Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Additional information

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

DeepChem: https://www.deepchem.io/

DREAM Challenges: http://dreamchallenges.org/

TensorFlow: https://www.tensorflow.org/


  1. 1.

    Mamoshina, P. et al. Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification. Front. Genet. 9, 242 (2018).

  2. 2.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).

  3. 3.

    Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018). This article is the first effort to highlight the recent applications of DL in drug discovery research and is an introduction to some popular DL architectures.

  4. 4.

    Hinton, G. Deep learning — a technology with the potential to transform health care. JAMA 320, 1101–1102 (2018).

  5. 5.

    Wong, C. H., Siah, K. W. & Lo, A. W. Estimation of clinical trial success rates and related parameters. Biostatistics https://doi.org/10.1093/biostatistics/kxx069 (2018).

  6. 6.

    Jeon, J. et al. A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening. Genome Med. 6, 57 (2014).

  7. 7.

    Ferrero, E., Dunham, I. & Sanseau, P. In silico prediction of novel therapeutic targets using gene-disease association data. J. Transl Med. 15, 182 (2017).

  8. 8.

    Riniker, S., Wang, Y., Jenkins, J. & Landrum, G. Using information from historical high-throughput screens to predict active compounds. J. Chem. Inf. Model. 54, 1880–1891 (2014).

  9. 9.

    Godinez, W. J., Hossain, I., Lazic, S. E., Davies, J. W. & Zhang, X. A multi-scale convolutional neural network for phenotyping high-content cellular images. Bioinformatics 33, 2010–2019 (2017).

  10. 10.

    Olsen, T. et al. Diagnostic performance of deep learning algorithms applied to three common diagnoses in dermatopathology. J. Pathol. Inform. 9, 32–32 (2018).

  11. 11.

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

  12. 12.

    Jiao, Y. & Pufeng, D. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant. Biol. 4, 320 (2016).

  13. 13.

    Czodrowski, P. Count on kappa. J. Comput. Aided Mol. Des. 28, 1049–1055 (2014).

  14. 14.

    Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief. Bioinform. https://doi.org/10.1093/bib/bby061 (2018).

  15. 15.

    Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504 (2006).

  16. 16.

    Koscielny, G. et al. Open targets: a platform for therapeutic target identification and validation. Nucleic Acids Res. 45, D985–D994 (2017).

  17. 17.

    Costa, P. R., Acencio, M. L. & Lemke, N. A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data. BMC Genomics 11, S9–S9 (2010).

  18. 18.

    Ament, S. A. et al. Transcriptional regulatory networks underlying gene expression changes in Huntington’s disease. Mol. Systems Biol. 14, e7435 (2018).

  19. 19.

    Bravo, A., Pinero, J., Queralt-Rosinach, N., Rautschka, M. & Furlong, L. I. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics 16, 55 (2015).

  20. 20.

    Kim, J., Kim, J.-j. & Lee, H. An analysis of disease-gene relationship from Medline abstracts by DigSee. Sci. Rep. 7, 40154 (2017).

  21. 21.

    Leung, M. K. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, i121–i129 (2014).

  22. 22.

    Jha, A., Gazzara, M. R. & Barash, Y. Integrative deep models for alternative splicing. Bioinformatics 33, i274–i282 (2017).

  23. 23.

    Vaquero-Garcia, J. et al. A new view of transcriptome complexity and regulation through the lens of local splicing variations. eLife 5, e11752 (2016).

  24. 24.

    Sotillo, E. et al. Convergence of acquired mutations and alternative splicing of CD19 enables resistance to CART-19 immunotherapy. Cancer Discov. 5, 1282–1295 (2015).

  25. 25.

    Rohacek, A. M. et al. ESRP1 mutations cause hearing loss due to defects in alternative splicing that disrupt cochlear development. Dev. Cell 43, 318–331 (2017).

  26. 26.

    Xiong, H. Y. et al. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015). This article describes a computational model based on DL that predicts splicing regulation for any mRNA sequence and has been applied to more than half a million human mRNA splicing sequence variants. Thousands of known disease-causing mutations are identified as well as new disease-linked genes.

  27. 27.

    Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016). This paper applies ML to data from somatic mutations, copy number alterations, DNA methylation and gene expression from 1,000 cancer cell lines to model drug response of the cell lines and demonstrates the importance of genomic features for prediction.

  28. 28.

    Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).

  29. 29.

    McMillan, E. A. et al. Chemistry-first approach for nomination of personalized treatment in lung cancer. Cell 173, 864–878 (2018).

  30. 30.

    Al-Lazikani, B. et al. in Bioinformatics — From Genomes to Therapies Ch. 36 (Wiley-VCH, 2008).

  31. 31.

    Nayal, M. & Honig, B. On the nature of cavities on protein surfaces: application to the identification of drug-binding sites. Proteins 63, 892–906 (2006). This article describes a classifier to identify drug-binding cavities on the basis of physicochemical, structural and geometric attributes of proteins.

  32. 32.

    Li, Q. & Lai, L. Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics 8, 353 (2007).

  33. 33.

    Bakheet, T. M. & Doig, A. J. Properties and identification of human protein drug targets. Bioinformatics 25, 451–457 (2009).

  34. 34.

    Wang, Q., Feng, Y., Huang, J., Wang, T. & Cheng, G. A novel framework for the identification of drug target proteins: combining stacked auto-encoders with a biased support vector machine. PLOS ONE 12, e0176486 (2017).

  35. 35.

    Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol. 6, 366–366 (2015).

  36. 36.

    Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856–860 (2015).

  37. 37.

    Morgan, P. et al. Impact of a five-dimensional framework on R&D productivity at AstraZeneca. Nat. Rev. Drug Discov. 17, 167–181 (2018).

  38. 38.

    Rouillard, A. D., Hurle, M. R. & Agarwal, P. Systematic interrogation of diverse Omic data reveals interpretable, robust, and generalizable transcriptomic features of clinically successful therapeutic targets. PLOS Comput. Biol. 14, e1006142 (2018).

  39. 39.

    Kumar, V., Sanseau, P., Simola, D. F., Hurle, M. R. & Agarwal, P. Systematic analysis of drug targets confirms expression in disease-relevant tissues. Sci. Rep. 6, 36205 (2016).

  40. 40.

    Ramsundar, B. et al. Is multitask deep learning practical for pharma? J. Chem. Inf. Model. 57, 2068–2076 (2017).

  41. 41.

    Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure–activity relationships. J. Chem. Inf. Model. 55, 263–274 (2015).

  42. 42.

    Barati Farimani, A., Feinberg, E. & Pande, V. Binding pathway of opiates to μ-opioid receptors revealed by machine learning. Biophys. J. 114, 62a–63a (2018).

  43. 43.

    Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

  44. 44.

    Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604 (2018). This seminal paper describes a very thorough approach to retrosynthetic analysis. The authors show that their method can compete with retrosynthesis done by experienced chemists who are experts in this field.

  45. 45.

    Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).

  46. 46.

    Kadurin, A., Nikolenko, S., Khrabrov, K., Aliper, A. & Zhavoronkov, A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017).

  47. 47.

    Smith, J. S., Roitberg, A. E. & Isayev, O. Transforming computational drug discovery with machine learning and AI. ACS Med. Chem. Lett. 9, 1065–1069 (2018).

  48. 48.

    Lenselink, E. B. et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J. Cheminform. 9, 45 (2017).

  49. 49.

    Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2017).

  50. 50.

    Ramsundar, B. et al. Massively multitask networks for drug discovery. Preprint at arXiv https://arxiv.org/abs/1502.02072 (2015).

  51. 51.

    Gutlein, M. & Kramer, S. Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. J. Cheminform. 8, 60 (2016).

  52. 52.

    Mayr, A. et al. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9, 5441–5451 (2018). This research paper describes the methodology being used by the winners of almost all categories of the Tox21 Challenge.

  53. 53.

    Keiser, M. J. et al. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 25, 197 (2007).

  54. 54.

    Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet Distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).

  55. 55.

    Unterthiner, T., Mayr, A., Klambauer, G. & Hochreiter, S. Toxicity prediction using deep learning. Preprint at arXiv https://arxiv.org/abs/1503.01445 (2015).

  56. 56.

    Li, B. et al. Development of a drug-response modeling framework to identify cell line derived translational biomarkers that can predict treatment outcome to erlotinib or sorafenib. PLOS ONE 10, e0130700 (2015). In this paper, a translational predictive biomarker is used to demonstrate that predictive models can be generated from preclinical training data sets and then be applied to clinical patient samples to stratify patients, infer the mechanism of action of a drug and select appropriate disease indications.

  57. 57.

    van Gool, A. J. et al. Bridging the translational innovation gap through good biomarker practice. Nat. Rev. Drug Discov. 16, 587–588 (2017).

  58. 58.

    Kraus, V. B. Biomarkers as drug development tools: discovery, validation, qualification and use. Nat. Rev. Rheumatol. 14, 354–362 (2018).

  59. 59.

    Shi, L. et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28, 827–838 (2010).

  60. 60.

    Zhan, F. et al. The molecular classification of multiple myeloma. Blood 108, 2020–2028 (2006).

  61. 61.

    Shaughnessy, J. D. Jr. et al. A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. Blood 109, 2276–2284 (2007).

  62. 62.

    Zhan, F., Barlogie, B., Mulligan, G., Shaughnessy, J. D. Jr & Bryant, B. High-risk myeloma: a gene expression based risk-stratification model for newly diagnosed multiple myeloma treated with high-dose therapy is predictive of outcome in relapsed disease treated with single-agent bortezomib or high-dose dexamethasone. Blood 111, 968–969 (2008).

  63. 63.

    Decaux, O. et al. Prediction of survival in multiple myeloma based on gene expression profiles reveals cell cycle and chromosomal instability signatures in high-risk patients and hyperdiploid signatures in low-risk patients: a study of the Intergroupe Francophone du Myelome. J. Clin. Oncol. 26, 4798–4805 (2008).

  64. 64.

    Mulligan, G. et al. Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. Blood 109, 3177–3188 (2007).

  65. 65.

    Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 32, 1202–1212 (2014). This paper is an effort to collect and objectively evaluate various ML approaches by teams around the world on multi-omics data sets and various compounds. The data sets and results are continuously used as benchmarks for new method developments and validation.

  66. 66.

    Rahman, R., Otridge, J. & Pal, R. IntegratedMRF: random forest-based framework for integrating prediction from different data types. Bioinformatics 33, 1407–1410 (2017).

  67. 67.

    Bunte, K., Leppäaho, E., Saarinen, I. & Kaski, S. Sparse group factor analysis for biclustering of multiple data sources. Bioinformatics 32, 2457–2463 (2016).

  68. 68.

    Huang, C., Mezencev, R., McDonald, J. F. & Vannberg, F. Open source machine-learning algorithms for the prediction of optimal cancer drug therapies. PLOS ONE 12, e0186906 (2017).

  69. 69.

    Hejase, H. A. & Chan, C. Improving drug sensitivity prediction using different types of data. CPT Pharmacometrics Syst. Pharmacol. 4, e2 (2015).

  70. 70.

    Kim, E. S. et al. The BATTLE trial: personalizing therapy for lung cancer. Cancer Discov. 1, 44–53 (2011).

  71. 71.

    Boyiadzis, M. M. et al. Significance and implications of FDA approval of pembrolizumab for biomarker-defined disease. J. Immunother. Cancer 6, 35 (2018).

  72. 72.

    Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat. Commun. 9, 2755 (2018). This work identifies molecular signatures that are resistant to drug treatments and illustrates a multi-omics approach to understanding drug response.

  73. 73.

    Paré, G., Mao, S. & Deng, W. Q. A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci. Rep. 7, 12665 (2017).

  74. 74.

    Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).

  75. 75.

    Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

  76. 76.

    Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Project Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/183863v4 (2018).

  77. 77.

    Wang, D. & Gu, J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genomics Proteomics Bioinformatics 16, 320–331 (2017).

  78. 78.

    Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241 (2015).

  79. 79.

    Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414 (2017).

  80. 80.

    Tan, J., Hammond, J. H., Hogan, D. A. & Greene, C. A.-O. ADAGE-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. mSystems 1, e00025–15 (2016).

  81. 81.

    Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).

  82. 82.

    Casanova, R. et al. Morphoproteomic characterization of lung squamous cell carcinoma fragmentation, a histological marker of increased tumor invasiveness. Cancer Res. 77, 2585–2593 (2017).

  83. 83.

    Nirschl, J. J. et al. A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of H&E tissue. PLOS ONE 13, e0192726 (2018).

  84. 84.

    Angermueller, C., Pärnamaa, T., Parts, L. & Stegle, O. Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016).

  85. 85.

    Finnegan, A. & Song, J. S. Maximum entropy methods for extracting the learned features of deep neural networks. PLOS Comput. Biol. 13, e1005836 (2017).

  86. 86.

    Hutson, M. Artificial intelligence faces reproducibility crisis. Science 359, 725–726 (2018).

  87. 87.

    Veltri, R. W., Partin, A. W. & Miller, M. C. Quantitative nuclear grade (QNG): a new image analysis-based biomarker of clinically relevant nuclear structure alterations. J. Cell. Biochem. Suppl. 35, S151–S157 (2000).

  88. 88.

    Beck, A. H. et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl Med. 3, 108ra113 (2011).

  89. 89.

    Lee, G. et al. Nuclear shape and architecture in benign fields predict biochemical recurrence in prostate cancer patients following radical prostatectomy: preliminary findings. Eur. Urol. Focus 3, 457–466 (2017).

  90. 90.

    Lu, C. et al. An oral cavity squamous cell carcinoma quantitative histomorphometric-based image classifier of nuclear morphology can risk stratify patients for disease-specific survival. Mod. Pathol. 30, 1655–1665 (2017).

  91. 91.

    Lu, C. et al. Nuclear shape and orientation features from H&E images predict survival in early-stage estrogen receptor-positive breast cancers. Lab. Invest. 98, 1438–1448 (2018).

  92. 92.

    Mani, N. L. et al. Quantitative assessment of the spatial heterogeneity of tumor-infiltrating lymphocytes in breast cancer. Breast Cancer Res. 18, 78 (2016).

  93. 93.

    Giraldo, N. A. et al. The differential association of PD-1, PD-L1, and CD8 + cells with response to pembrolizumab and presence of Merkel cell polyomavirus (MCPyV) in patients with Merkel cell carcinoma (MCC). Cancer Res. 77, 662 (2017).

  94. 94.

    Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J. Pathol. Informat. 7, 29 (2016). This article is the first comprehensive review of DL in the context of digital pathology images. The paper also systematically explains and presents approaches for training and validating DL classifiers for a number of image-based problems in digital pathology, including cell detection, segmentation and tissue classification.

  95. 95.

    Sharma, H., Zerbe, N., Klempert, I., Hellwich, O. & Hufnagl, P. Deep convolutional neural networks for automatic classification of gastric carcinoma using whole slide images in digital histopathology. Comput. Med. Imaging Graph. 61, 2–13 (2017).

  96. 96.

    Korbar, B. et al. Deep learning for classification of colorectal polyps on whole-slide images. J. Pathol. Informat. 8, 30 (2017).

  97. 97.

    Bychkov, D. et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci. Rep. 8, 3395 (2018).

  98. 98.

    Cruz-Roa, A. et al. Accurate and reproducible invasive breast cancer detection in whole-slide images: A Deep Learning approach for quantifying tumor extent. Sci. Rep. 7, 46450 (2017). This is one of the first papers to apply DL to identify regions of breast cancer on digital pathology images and shows that the algorithmic approach outperforms breast cancer pathologists. It is one of the first studies to have a large data set of cases (>600) with independent training and validation sets.

  99. 99.

    Romo-Bucheli, D., Janowczyk, A., Gilmore, H., Romero, E. & Madabhushi, A. Automated tubule nuclei quantification and correlation with oncotype DX risk categories in ER + breast cancer whole slide images. Sci. Rep. 6, 32706 (2016). This article applies DL to identify the presence and location of tubules in breast pathology images and subsequently demonstrates that the number of detected tubules correlates with the risk assessments of breast cancer via a genomic test. It is one of the first papers to show how DL can be used to establish genotype–phenotype associations.

  100. 100.

    Romo-Bucheli, D., Janowczyk, A., Gilmore, H., Romero, E. & Madabhushi, A. A deep learning based strategy for identifying and associating mitotic activity with gene expression derived risk categories in estrogen receptor positive breast cancers. Cytometry A 91, 566–573 (2017).

  101. 101.

    Saltz, J. et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep. 23, 181–193 (2018). This large-scale study utilizes DL to identify lymphocytes across all images and relate spatial characteristics of lymphocytes to molecular assessments. This article is key to the automatic quantification of immune cells from H&E slides and the identification of sub-categories of immune infiltrate as related to therapeutic outcome.

  102. 102.

    Corredor, G. et al. Spatial architecture and arrangement of tumor-infiltrating lymphocytes for predicting likelihood of recurrence in early-stage non-small cell lung cancer. Clin. Cancer Res. 25, 1526–1534 (2018). In this paper, the spatial arrangement, and not just the density, of tumour-infiltrating lymphocytes in early-stage lung cancer pathology images is shown to be prognostic of recurrence. A comprehensive comparison is provided, showing that computer-extracted features of spatial arrangement of tumour-infiltrating lymphocytes are more prognostic than manual (pathologist) enumeration of tumour-infiltrating lymphocyte density.

  103. 103.

    Cohen, O., Zhu, B. & Rosen, M. S. MR fingerprinting Deep RecOnstruction NEtwork (DRONE). Magn. Reson. Med. 80, 885–894 (2018).

  104. 104.

    Chen, H. et al. Low-dose CT with a residual encoder-decoder convolutional neural network (RED-CNN). Preprint at arXiv https://arxiv.org/abs/1702.00288 (2017).

  105. 105.

    Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018). This paper uses DL frameworks to predict mutations from H&E images, which has implications for identifying key mechanistic insights from standard whole-slide imaging as well as for patient stratification.

  106. 106.

    Turkki, R., Linder, N., Kovanen, P. E., Pellinen, T. & Lundin, J. Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. J. Pathol. Inform. 7, 38 (2016).

  107. 107.

    Norgeot, B., Glicksberg, B. S. & Butte, A. J. A call for deep-learning healthcare. Nat. Med. 25, 14–15 (2019).

  108. 108.

    Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).

  109. 109.

    Yang, Z. et al. Clinical assistant diagnosis for electronic medical record based on convolutional neural network. Sci. Rep. 8, 6329 (2018).

  110. 110.

    Steele, A. J., Denaxas, S. C., Shah, A. D., Hemingway, H. & Luscombe, N. M. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLOS ONE 13, e0202344 (2018).

  111. 111.

    Mohr, D. C., Zhang, M. & Schueller, S. M. Personal sensing: understanding mental health using ubiquitous sensors and machine learning. Annu. Rev. Clin. Psychol. 13, 23–47 (2017).

  112. 112.

    Gkotsis, G. et al. Characterisation of mental health conditions in social media using Informed Deep Learning. Sci. Rep. 7, 45141 (2017).

  113. 113.

    Koscielny, S. Why most gene expression signatures of tumors have not been useful in the clinic. Sci. Transl Med. 2, 14ps12 (2010).

  114. 114.

    Odell, S. G., Lazo, G. R., Woodhouse, M. R., Hane, D. L. & Sen, T. Z. The art of curation at a biological database: principles and application. Curr. Plant Biol. 11–12, 2–11 (2017).

Download references


The authors thank E. Birney and E. Papa for helpful comments, M. Segler for contributing to the small-molecule optimization subsection and A. Janowczyk for providing the pathology images in Figure 4.

Author information


  1. European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK

    • Jessica Vamathevan
    • , Dominic Clark
    •  & Edgardo Ferran
  2. Technical University of Dortmund, Dortmund, Germany

    • Paul Czodrowski
  3. Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK

    • Ian Dunham
    •  & Michaela Spitzer
  4. Bristol-Myers Squibb, Princeton, NJ, USA

    • George Lee
  5. Takeda Pharmaceuticals International Co., Cambridge, MA, USA

    • Bin Li
  6. Case Western Reserve University, Cleveland, OH, USA

    • Anant Madabhushi
  7. Louis Stokes Cleveland Veterans Affair Medical Center, Cleveland, OH, USA

    • Anant Madabhushi
  8. EMD Serono R&D Institute, Billerica, MA, USA

    • Parantu Shah
  9. Pfizer Worldwide Research and Development, Cambridge, MA, USA

    • Shanrong Zhao


  1. Search for Jessica Vamathevan in:

  2. Search for Dominic Clark in:

  3. Search for Paul Czodrowski in:

  4. Search for Ian Dunham in:

  5. Search for Edgardo Ferran in:

  6. Search for George Lee in:

  7. Search for Bin Li in:

  8. Search for Anant Madabhushi in:

  9. Search for Parantu Shah in:

  10. Search for Michaela Spitzer in:

  11. Search for Shanrong Zhao in:

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Jessica Vamathevan.


Graphical processing units

(GPUs). Processors designed to accelerate the rendering of graphics and that can handle tens of thousands of operations per cycle.

Central processing units

(CPUs). Processors designed to solve every computational problem in a general fashion and that can handle tens of operations per cycle. The cache and memory are designed to be optimal for any general programming problem.

Tensor processing units

(TPUs). Co-processors manufactured by Google that are designed to accelerate deep learning tasks developed using TensorFlow (a programming framework) and can handle up to 128,000 operations per cycle.

Support vector machine (SVM) classifier

A method that performs classification tasks by constructing separating lines to distinguish between objects with different class memberships in a multi-dimensional space.


Ultraviolet crosslinking immunoprecipitation (CLIP) followed by RNA sequencing to identify all RNA species bound by a protein of interest. This method can be used to map RNA protein binding sites or RNA modification sites on a genome-wide scale.

Heuristic method

A function that calculates the approximate cost of a problem (or ranks alternatives).

Chemical fingerprint

A concept used in chemical informatics to compare molecules with each other. The structure of a molecule is encoded in a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule.

Simplified molecular input line entry system (SMILES)

A line notation for entering and representing molecules and reactions; for example, carbon dioxide is represented as O = C = O.

About this article

Publication history