Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A deep learning framework for drug repurposing via emulating clinical trials on real-world patient data

A preprint version of the article is available at arXiv.


Drug repurposing is an effective strategy to identify new uses for existing drugs, providing the quickest possible transition from bench to bedside. Real-world data, such as electronic health records and insurance claims, provide information on large cohorts of users for many drugs. Here we present an efficient and easily customized framework for generating and testing multiple candidates for drug repurposing using a retrospective analysis of real-world data. Building upon well-established causal inference and deep learning methods, our framework emulates randomized clinical trials for drugs present in a large-scale medical claims database. We demonstrate our framework on a coronary artery disease cohort of millions of patients. We successfully identify drugs and drug combinations that substantially improve the coronary artery disease outcomes but haven’t been indicated for treating coronary artery disease, paving the way for drug repurposing.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Flowchart of overall drug repurposing framework.
Fig. 2: Illustration of the deep learning model for predicting treatment probability (or propensity score) that we used to correct confounding from time sequence data (including diagnoses dt, prescriptions pt and demographics bt).
Fig. 3: Distribution of estimated ATE of drugs on defined outcomes across the 50 bootstrap samples.
Fig. 4: The SMD values of the top 20 well-balanced covariates.

Similar content being viewed by others

Data availability

The data we use is MarketScan Commercial Claims and Encounters (CCAE, more than 100 million patients, from 2012 to 2017) The details of source data structure and prepossessed input data demo are available at the Github repository Access to the MarketScan data analysed in this manuscript is provided by the Ohio State University. The dataset is available from IBM at

Code availability

The source code for this paper can be downloaded from the Github repository at the Zenodo repository at


  1. Langedijk, J., Mantel-Teeuwisse, A. K., Slijkerman, D. S. & Schutjens, M.-H. D. Drug repositioning and repurposing: terminology and definitions in literature. Drug Discov. Today 20, 1027–1034 (2015).

    Article  Google Scholar 

  2. Ashburn, T. T. & Thor, K. B. Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 3, 673–683 (2004).

    Article  Google Scholar 

  3. Pushpakom, S. et al. Drug repurposing: progress, challenges and recommendations. Nat. Rev. Drug Discov. 18, 41–58 (2019).

    Article  Google Scholar 

  4. Luo, H. et al. DPDR-CPI, a server that predicts drug positioning and drug repositioning via chemical-protein interactome. Sci. Rep. 6, 35996 (2016).

    Article  Google Scholar 

  5. Dakshanamurthy, S. et al. Predicting new indications for approved drugs using a proteochemometric method. J. Med. Chem. 55, 6832–6848 (2012).

    Article  Google Scholar 

  6. Sanseau, P. et al. Use of genome-wide association studies for drug repositioning. Nat. Biotechnol. 30, 317–320 (2012).

    Article  Google Scholar 

  7. Iorio, F. et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc. Natl Acad. Sci USA 107, 14621–14626 (2010).

    Article  Google Scholar 

  8. Sirota, M. et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci. Transl. Med. 3, 96ra77 (2011).

    Article  Google Scholar 

  9. Buchan, N. S. et al. The role of translational bioinformatics in drug discovery. Drug Discov. Today 16, 426–434 (2011).

    Article  Google Scholar 

  10. Sherman, R. E. et al. Real-world evidence—what is it and what can it tell us. N. Engl. J. Med. 375, 2293–2297 (2016).

    Article  Google Scholar 

  11. Cheng, F. et al. Network-based approach to prediction and population-based validation of in silico drug repurposing. Nat. Commun. 9, 2691 (2018).

    Article  Google Scholar 

  12. Xu, H. et al. Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality. J. Am. Med. Inform. Assoc. 22, 179–191 (2014).

    Article  Google Scholar 

  13. Hernán, M. A. & Robins, J. M. Using big data to emulate a target trial when a randomized trial is not available. Am. J. Epidemiol. 183, 758–764 (2016).

    Article  Google Scholar 

  14. D’Agostino, R. B. Estimating treatment effects using observational data. JAMA 297, 314–316 (2007).

    Article  Google Scholar 

  15. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  16. Hirano, K., Imbens, G. W. & Ridder, G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–1189 (2003).

    Article  MathSciNet  Google Scholar 

  17. MarketScan Research Databases. IBM (2020).

  18. Commercial Claims and Encounters: Medicare Supplemental (Truven Health Analytics, 2016).

  19. Classification of diseases, functioning, and disability. Centers for Disease Control and Prevention (2019).

  20. The Observational Health Data Sciences and Informatics (OHDSI). (2019).

  21. Causes of heart failure. American Heart Association (2017).

  22. Gheorghiade, M. & Bonow, R. O. Chronic heart failure in the united states: a manifestation of coronary artery disease. Circulation 97, 282–289 (1998).

    Article  Google Scholar 

  23. Conditions that increase risk for stroke. Centers for Disease Control and Prevention (2018).

  24. Coronary artery disease. Heart and Stroke Foundation of Canada (2019).

  25. Austin, P. C. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav. Res. 46, 399–424 (2011).

    Article  Google Scholar 

  26. Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54–75 (1986).

    MathSciNet  Google Scholar 

  27. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    Article  MathSciNet  Google Scholar 

  28. Kuhn, M., Campillos, M., Letunic, L. J. & Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6, 343 (2010).

    Article  Google Scholar 

  29. Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).

    Article  Google Scholar 

  30. Fisher, M. L. et al. Beneficial effects of metoprolol in heart failure associated with coronary artery disease: a randomized trial. J. Am. Coll. Cardiol. 23, 943–950 (1994).

    Article  Google Scholar 

  31. Wong, T. Y., Simó, R. & Mitchell, P. Fenofibrate – a potential systemic treatment for diabetic retinopathy?. Am. J. Ophthalmol. 154, 6–12 (2012).

    Article  Google Scholar 

  32. Hydrochlorothiazide. (2019).

  33. Pepine, C. J. et al. A calcium antagonist vs a non–calcium antagonist hypertension treatment strategy for patients with coronary artery disease: the international verapamil-trandolapril study (invest): a randomized controlled trial. JAMA 290, 2805–2816 (2003).

    Article  Google Scholar 

  34. Jukema, J. W. et al. Effects of lipid lowering by pravastatin on progression and regression of coronary artery disease in symptomatic men with normal to moderately elevated serum cholesterol levels: the regression growth evaluation statin study (regress). Circulation 91, 2528–2540 (1995).

    Article  Google Scholar 

  35. Kjekshus, J., Pedersen, T. R., Olsson, A. G., Færgeman, O. & Pyörälä, K. The effects of simvastatin on the incidence of heart failure in patients with coronary heart disease. J. Card. Fail. 3, 249–254 (1997).

    Article  Google Scholar 

  36. Higuchi, T., Abletshauser, C., Nekolla, S. G., Schwaiger, M. & Bengel, F. M. Effect of the angiotensin receptor blocker valsartan on coronary microvascular flow reserve in moderately hypertensive patients with stable coronary artery disease. Microcirculation 14, 805–812 (2007).

    Article  Google Scholar 

  37. Diltiazem. SIDER (2019).

  38. Ozery-Flato, M., Goldschmidt, Y., Shaham, O., Ravid, S. & Yanover, C. Framework for identifying drug repurposing candidates from observational healthcare data. Preprint at medRxiv (2020).

  39. Shimoni, Y. et al. An evaluation toolkit to guide model selection and cohort definition in causal inference. Preprint at (2019).

  40. Zhang, P., Wang, F., Hu, J. & Sorrentino, R. Exploring the relationship between drug side-effects and therapeutic indications. In AMIA Annual Symposium Proceedings 2013 1568–1577 (American Medical Informatics Association, 2013).

  41. Liang, X. et al. LRSSL: predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics 33, 1187–1196 (2017).

    Article  Google Scholar 

  42. Luo, H. et al. DRAR-CPI: a server for identifying drug repositioning potential and adverse drug reactions via the chemical–protein interactome. Nucleic Acids Res. 39, W492–W498 (2011).

    Article  Google Scholar 

  43. Dudley, J. T., Deshpande, T. & Butte, A. J. Exploiting drug–disease relationships for computational drug repositioning. Brief. Bioinform. 12, 303–311 (2011).

    Article  Google Scholar 

  44. Jarada, T. N., Rokne, J. G. & Alhajj, R. A review of computational drug repositioning: strategies, approaches, opportunities, challenges, and directions. J. Cheminf. 12, 46 (2020).

    Article  Google Scholar 

  45. Gottlieb, A., Stein, G. Y., Ruppin, E. & Sharan, R. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol. Syst. Biol. 7, 496 (2011).

    Article  Google Scholar 

  46. Rubinstein, L. V. et al. Design issues of randomized phase II trials and a proposal for phase ii screening trials. J. Clin. Oncol. 23, 7199–7206 (2005).

    Article  Google Scholar 

  47. Metformin to reduce heart failure after myocardial infarction (gips-iii). (2018).

  48. Escitalopram oxalate. (2020).

  49. Responses of myocardial ischemia to escitalopram treatment (remit). (2015).

  50. Effect of atorvastatin on fractional flow reserve in coronary artery disease (forte). (2018).

  51. Dahlöf, B. et al. Cardiovascular morbidity and mortality in the losartan intervention for endpoint reduction in hypertension study (life): a randomised trial against atenolol. Lancet 359, 995–1003 (2002).

    Article  Google Scholar 

  52. D’Agostino, R. B. Jr Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med. 17, 2265–2281 (1998).

Download references


This work was funded in part by the National Center for Advancing Translational Research of the National Institutes of Health under award number CTSA Grant UL1TR002733. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations



P.Z. conceived the project. R.L. and P.Z. developed the method. R.L. conducted the experiments. R.L., L.W. and P.Z. analysed the results. R.L., L.W. and P.Z. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ping Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Daniel Merk and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 CAD cohorts characteristics.

a, The patients’ distribution of total time in the database. b, The patient’s distribution of time before/after CAD initiation date. c, The growth of the number of patients developing outcomes after CAD initiation date. d, The gender distribution with age at CAD initiation date.

Extended Data Fig. 2 Performance comparison of LSTM-IPTW and LR-IPTW using drug candidate: diltiazem (with known CAD indication).

The three figures on the top are results obtained from LSTM-IPTW, while the figures on the bottom are from LR-IPTW. a, and (d) The absolute SMD of each covariate in the original data (orange triangles) and in the weighted data (blue circles). b, and (e) The distribution of estimated propensity scores over user (orange area) and non-user (blue area) cohorts. c, and (f) The ROC curves for the propensity model (orange), expected value (green) and weighted propensity (blue).

Extended Data Fig. 3 Distribution of estimated ATE of drug classes on defined outcomes across the 50 bootstrap samples.

All these showing drug classes satisfy two conditions: adjusted p-value less than 0.05 and post unbalanced ratio less than 2%. Within the boxplot, the central line denotes the median, and the bottom and the top edges denote the 25th(Q1) and 75th(Q3) and percentiles respectively. The whiskers extend to 1.5 times the interquartile range.

Extended Data Fig. 4 The list of significant drug classes.

The drug classes are denoted using ATC code and corresponding names.

Extended Data Fig. 5 The estimated treatment effects for CAD over balanced and statistically significant drug combinations.

The drug combinations are ranked by the estimated ATE values.

Extended Data Fig. 6 Performance comparison of proposed method and three pre-clinical methods evaluated by Precision@K.

The values of K are selected from {6, 9}.

Extended Data Fig. 7 Retrieved additional repurposing candidates under different thresholds’ setting.

The adjusted p-value is changed to 0.15 and the post unbalanced ratio remains the same as previous setting (less than 2%).

Extended Data Fig. 8 The definition of user and non-user cohorts.

Index date refers to the first prescription of the trial’s drug (user cohort) or the alternative drug (non-user cohort). The time period before the index date is the baseline period, and the time after the index date is the follow-up period. The patient covariates are collected during the baseline period and the treatment effects areevaluated at the follow-up period.

Supplementary information

Supplementary Information

Supplementary Tables 1–6 and Figs. 1 and 2.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, R., Wei, L. & Zhang, P. A deep learning framework for drug repurposing via emulating clinical trials on real-world patient data. Nat Mach Intell 3, 68–75 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing