Abstract
Despite widespread adoption of electronic health records (EHRs), most hospitals are not ready to implement data science research in the clinical pipelines. Here, we develop MEDomics, a continuously learning infrastructure through which multimodal health data are systematically organized and data quality is assessed with the goal of applying artificial intelligence for individual prognosis. Using this framework, currently composed of thousands of individuals with cancer and millions of data points over a decade of data recording, we demonstrate prognostic utility of this framework in oncology. As proof of concept, we report an analysis using this infrastructure, which identified the Framingham risk score to be robustly associated with mortality among individuals with early-stage and advanced-stage cancer, a potentially actionable finding from a real-world cohort of individuals with cancer. Finally, we show how natural language processing (NLP) of medical notes could be used to continuously update estimates of prognosis as a given individual’s disease course unfolds.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Profile of the multicenter cohort of the German Cancer Consortium’s Clinical Communication Platform
European Journal of Epidemiology Open Access 05 April 2023
-
An overview and a roadmap for artificial intelligence in hematology and oncology
Journal of Cancer Research and Clinical Oncology Open Access 15 March 2023
-
Introducing AI to the molecular tumor board: one direction toward the establishment of precision medicine using large-scale cancer clinical and biological information
Experimental Hematology & Oncology Open Access 31 October 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
The datasets that support the findings of this article are not publicly available due to reasonable privacy and security concerns. The underlying EHR data are not easily redistributable to researchers other than those engaged in the UCSF IRB approved for this study. However, access to deidentified data will be possible under a material transfer agreement (MTA) handled by the primary institution (UCSF). The datasets generated during and/or analyzed during the current study are not publicly available for privacy reasons but are available from the corresponding author on reasonable request. Test datasets for reusing the code are available from the Open Science Framework (OSF) repository at https://osf.io/ytge5/. Source data are provided with this paper.
Code availability
Code is directly available from the following GitHub repository: https://github.com/medomics/medomics_NatCancer Alternatively, code is available via the Open Science Framework (OSF) repository at https://osf.io/ytge5/.
References
Arbabshirani, M. R. et al. Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit. Med. 1, 9 (2018).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Ehteshami Bejnordi, B. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017).
Stidham, R. W. et al. Performance of a deep learning model vs human reviewers in grading endoscopic disease severity of patients with ulcerative colitis. JAMA Netw. Open 2, e193963 (2019).
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. J. Am. Med. Assoc. 316, 2402–2410 (2016).
Tomasev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
Nemati, S. et al. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit. Care Med. 46, 547–553 (2018).
Rojas, J. C. et al. Predicting intensive care unit readmission with machine learning using electronic health record data. Ann. Am. Thorac. Soc. 15, 846–853 (2018).
Frost, D. W. et al. Using the electronic medical record to identify patients at high risk for frequent emergency department visits and high system costs. Am. J. Med. 130, 601.e617–601.e622 (2017).
Institute of Medicine (US) Roundtable on Evidence-Based Medicine. The Learning Healthcare System: Workshop Summary (eds. Olsen, L. A., Aisner, D. & McGinnis, J. M.) (National Academies Press, 2007).
Jackson, T. Building the ‘continuous learning’ healthcare system. Health Inf. Manag. 43, 4–5 (2014).
Deist, T. M. et al. Machine learning algorithms for outcome prediction in (chemo)radiotherapy: an empirical comparison of classifiers. Med. Phys. 45, 3449–3459 (2018).
Gennatas, E. D. et al. Preoperative and postoperative prediction of long-term meningioma outcomes. PLoS ONE 13, e0204161 (2018).
Hong, J. C., Niedzwiecki, D., Palta, M. & Tenenbaum, J. D. Predicting emergency visits and hospital admissions during radiation and chemoradiation: an internally validated pretreatment machine learning algorithm. JCO Clin. Cancer Inform. 2, 1–11 (2018).
Morin, O. et al. Integrated models incorporating radiologic and radiomic features predict meningioma grade, local failure, and overall survival. Neurooncol. Adv. 1, vdz011 (2019).
Morin, O. et al. A deep look into the future of quantitative imaging in oncology: a statement of working principles and proposal for change. Int. J. Radiat. Oncol. Biol. Phys. 102, 1074–1082 (2018).
Chen, W. C. et al. Histopathological features predictive of local control of atypical meningioma after surgery and adjuvant radiotherapy. J. Neurosurg. 130, 443–450 (2018).
Hong, J. C. et al. System for High-Intensity Evaluation During Radiation Therapy (SHIELD-RT): a prospective randomized study of machine learning–directed clinical evaluations during radiation and chemoradiation. J. Clin. Oncol. 38, 3652–3661 (2020).
Phillips, M. et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw. Open 2, e1913436 (2019).
Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J. Natl Cancer Inst. 111, 916–922 (2019).
Kann, B. H. et al. Pretreatment identification of head and neck cancer nodal metastasis and extranodal extension using deep learning neural networks. Sci. Rep. 8, 14036 (2018).
Lin, L. et al. Deep learning for automated contouring of primary tumor volumes by MRI for nasopharyngeal carcinoma. Radiology 291, 677–686 (2019).
Banerjee, I., Bozkurt, S., Caswell-Jin, J. L., Kurian, A. W. & Rubin, D. L. Natural language processing approaches to detect the timeline of metastatic recurrence of breast cancer. JCO Clin. Cancer Inform. 3, 1–12 (2019).
Huang, S. C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit. Med. 3, 136 (2020).
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Lehne, M., Luijten, S., Vom Felde Genannt Imbusch, P. & Thun, S. The use of FHIR in digital health—a review of the scientific literature. Stud. Health Technol. Inform. 267, 52–58 (2019).
Pfaff, E. R. et al. Fast healthcare interoperability resources (FHIR) as a meta model to integrate common data models: development of a tool and quantitative validation study. JMIR Med. Inform. 7, e15199 (2019).
Semenov, I. et al. Experience in developing an FHIR medical data management platform to provide clinical decision support. Int. J. Environ. Res. Public Health 17, 73 (2019).
Lambin, P. et al. Decision support systems for personalized and participative radiation oncology. Adv. Drug Deliv. Rev. 109, 131–153 (2017).
Ta, C. N., Dumontier, M., Hripcsak, G., Tatonetti, N. P. & Weng, C. Columbia open health data, clinical concept prevalence and co-occurrence from electronic health records. Sci. Data 5, 180273 (2018).
DeSantis, C. E. et al. Breast cancer statistics, 2019. CA Cancer J. Clin. 69, 438–451 (2019).
Lu, T. et al. Trends in the incidence, treatment, and survival of patients with lung cancer in the last four decades. Cancer Manag. Res. 11, 943–953 (2019).
Foster, C.C. et al. Overall survival according to immunotherapy and radiation treatment for metastatic non-small-cell lung cancer: a National Cancer Database analysis. Radiat. Oncol. 14, 18 (2019).
Neuman, H. B. et al. Stage IV breast cancer in the era of targeted therapy: does surgery of the primary tumor matter? Cancer 116, 1226–1233 (2010).
Hirsch, F. R. et al. Lung cancer: current therapies and new targeted treatments. Lancet 389, 299–311 (2017).
Hughes, K. S. et al. Lumpectomy plus tamoxifen with or without irradiation in women age 70 years or older with early breast cancer: long-term follow-up of CALGB 9343. J. Clin. Oncol. 31, 2382–2387 (2013).
Liu, J. et al. Predictive value for the chinese population of the Framingham CHD risk assessment tool compared with the chinese multi-provincial cohort study. J. Am. Med. Assoc. 291, 2591–2599 (2004).
Triant, V. A. et al. Cardiovascular risk prediction functions underestimate risk in HIV infection. Circulation 137, 2203–2214 (2018).
Bastuji-Garin, S. et al. The Framingham prediction rule is not valid in a European population of treated hypertensive patients. J. Hypertens. 20, 1973–1980 (2002).
Gernaat, S. A. M. et al. The risk of cardiovascular disease following breast cancer by Framingham risk score. Breast Cancer Res. Treat. 170, 119–127 (2018).
Lee, K. et al. Effect of aerobic and resistance exercise intervention on cardiovascular disease risk in women with early-stage breast cancer: a randomized clinical trial. JAMA Oncol. 5, 710–714 (2019).
Beynon, R. A. et al. Tobacco smoking and alcohol drinking at diagnosis of head and neck cancer and all-cause mortality: results from head and neck 5000, a prospective observational cohort of people with head and neck cancer. Int. J. Cancer 143, 1114–1127 (2018).
Sollie, M. & Bille, C. Smoking and mortality in women diagnosed with breast cancer—a systematic review with meta-analysis based on 400,944 breast cancer cases. Gland Surg. 6, 385–393 (2017).
Sorensen, L. T. Wound healing and infection in surgery. The clinical impact of smoking and smoking cessation: a systematic review and meta-analysis. Arch. Surg. 147, 373–383 (2012).
Saquib, N., Stefanick, M. L., Natarajan, L. & Pierce, J. P. Mortality risk in former smokers with breast cancer: pack-years vs. smoking status. Int. J. Cancer 133, 2493–2497 (2013).
Elfiky, A. A., Pany, M. J., Parikh, R. B. & Obermeyer, Z. Development and application of a machine learning approach to assess short-term mortality risk among patients with cancer starting chemotherapy. JAMA Netw. Open 1, e180926 (2018).
Ganggayah, M. D., Taib, N. A., Har, Y. C., Lio, P. & Dhillon, S. K. Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med. Inform. Decis. Mak. 19, 48.
Ledford, H. Millions of black people affected by racial bias in health-care algorithms. Nature 574, 608–609 (2019).
Norgeot, B., Glicksberg, B. S. & Butte, A. J. A call for deep-learning healthcare. Nat. Med. 25, 14–15 (2019).
Norgeot, B. et al. Assessment of a deep learning model based on electronic health record data to forecast clinical outcomes in patients with rheumatoid arthritis. JAMA Netw. Open 2, e190606 (2019).
Hsu, E. R., Klemm, J. D., Kerlavage, A. R., Kusnezov, D. & Kibbe, W. A. Cancer moonshot data and technology team: enabling a national learning healthcare system for cancer to unleash the power of data. Clin. Pharmacol. Ther. 101, 613–615 (2017).
Symonds, R. P. & Duxbury, A. Personal view: learning healthcare system for radiotherapy—maximising the opportunities and minimising the threats. Clin. Oncol. 32, 397–399 (2020).
Zhang, M. Y. et al. Development of leptomeningeal metastases in breast cancer patients receiving stereotactic radiosurgery. Int. J. Radiat. Oncol. Biol. Phys. 105, E93 (2019).
Nohr, E. A. & Liew, Z. How to investigate and adjust for selection bias in cohort studies. Acta Obstet. Gynecol. Scand. 97, 407–416 (2018).
Chang, K. et al. Distributed deep learning networks among institutions for medical imaging. J. Am. Med. Inform. Assoc. 25, 945–954 (2018).
Duan, R., et al. Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm. J. Am. Med. Inform. Assoc. 27, 376–385 (2019).
Jochems, A. et al. Developing and validating a survival prediction model for NSCLC patients through distributed learning across 3 countries. Int. J. Radiat. Oncol. Biol. Phys. 99, 344–352 (2017).
Zerka, F. et al. Systematic review of privacy-preserving distributed machine learning from federated databases in health care. JCO Clin. Cancer Inform. 4, 184–200 (2020).
Zwanenburg, A., et al. The image biomarker standardization initiative: standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology 295, 328–338 (2020).
Bajard, A. et al. An in silico approach helped to identify the best experimental design, population, and outcome for future randomized clinical trials. J. Clin. Epidemiol. 69, 125–136 (2016).
Clermont, G. et al. In silico design of clinical trials: a method coming of age. Crit. Care Med. 32, 2061–2070 (2004).
Hastie, T., Tibshirani, R. & Friedman, J.H. Element of Statistical Learning, Data Mining, Inference, and Prediction 2nd edn (Springer, 2001).
Blagus, R. & Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinf. 14, 106 (2013).
Norgeot, B. et al. Protected health information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit. Med. 3, 57 (2020).
Buckley, J. M. et al. The feasibility of using natural language processing to extract clinical information from breast pathology reports. J. Pathol. Inform. 3, 23 (2012).
Acknowledgements
M.V. acknowledges funding from the Canada CIFAR AI Chairs Program. J.S. acknowledges funding from the Canadian Institutes of Health Research under foundation grant CIHR FDN-143257 and from the Natural Sciences and Engineering Research Council under grant NSERC RGPIN-2019-06746. P.L., H.C.W. and A.C. acknowledge financial support from ERC advanced grant (ERC-ADG-2015 number 694812 - Hypoximmuno), the European Union’s Horizon 2020 research and innovation programme under grant agreement MSCA-ITN-PREDICT number 766276, CHAIMELEON number 952172 and EuCanImage number 952103.
Author information
Authors and Affiliations
Contributions
O.M., M.V., S.B., J.B.G., T.U., H.C.W., A.Z., A.C., J.E.V-M., G.V., W.C., J.C.H., S.S.Y., T.D.S., S.L., J.S., C.P. and P.L. conceived and designed the overall study. O.M., S.B. and C.P. obtained data access and IRB approval. O.M., J.B.G., T.U., S.B. and P.L. created the MEDomics tables. O.M., W.C., S.B., S.S.Y., J.B.G. and T.U. performed the selection of individuals and data curation. O.M. and M.V. managed the project. O.M. and M.V. maintained the website. O.M., J.B.G. and T.U. were involved in developing the methodology, and O.M., J.B.G., T.U., M.V., A.Z., H.C.W. and A.C. were involved in developing the software. O.M., J.B.G., T.U. and W.C. designed and constructed the predictive modeling. O.M., J.B.G., T.U., S.B., W.C., J.C.H., S.S.Y., T.D.S., C.P. and P.L. interpreted the data. O.M., J.B.G., T.U. and W.C. wrote the original draft. O.M., M.V., S.B., J.B.G., T.U., H.C.W., A.Z., A.C., J.E.V-M., G.V., W.C., J.C.H., S.S.Y., T.D.S., S.L., J.S., C.P. and P.L. reviewed and edited the manuscript. O.M. and M.V. acquired funding. All five founding institutions of the MEDomics consortium provided funding.
Corresponding author
Ethics declarations
Competing interests
O.M. reports, within and outside the submitted work, grants/sponsored research agreements from Varian Medical. He received an advisor/presenter fee and/or reimbursement of travel costs/external grant writing fee and/or in kind manpower contribution from Varian. O.M. has shares in the company Oncoradiomics. H.C.W. has shares in the company Oncoradiomics. J.E.V-M. reports funding from GE Healthcare, outside the scope of this submitted work. S.S.Y. reports grants/funding from Genentech, Merck, Bristol-Myers Squibb, BioMimetix and personal fees (UpToDate, Springer), outside the scope of this submitted work. J.S. is founding advisor of Gray Oncology Solutions Inc. and has commercialization projects of inventions unrelated to this work with the companies Lifeline Software Inc. and Sun Nuclear Corporation. P.L. reports, within and outside the submitted work, grants/sponsored research agreements from Radiomics SA, ptTheragnostic/DNAmito, Health Innovation Ventures. He received an advisor/presenter fee and/or reimbursement of travel costs/consultancy fee and/or in kind manpower contribution from Radiomics SA, BHV, Merck, Varian, Elekta, ptTheragnostic, BMS and Convert pharmaceuticals. P.L. has minority shares in the company Radiomics SA, Convert pharmaceuticals, Comunicare Solutions and LivingMed Biotech, and he is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248, PCT/NL2014/050728) licensed to Radiomics SA, one issued patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, one non-issued patent on LSRT (PCT/P126537PC00) licensed to Varian Medical, three non-patented invention (softwares) licensed to ptTheragnostic/DNAmito, Radiomics SA and Health Innovation Ventures, and three non-issues, non-licensed patents on deep and handcrafted radiomics (US P125078US00, PCT/NL/2020/050794, number N2028271). He confirms that none of the above entities or funding was involved in the preparation of this paper.
Additional information
Peer review information Nature Cancer thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Kaplan-Meier survival plots on breast and lung cancer patients stratified by zodiac sign.
Comparison of Kaplan-Meier survival plots for breast and lung patients stratified by astrological zodiac sign as a negative control (breast, n = 4273 and lung, n = 2402).
Extended Data Fig. 2 Kaplan-Meier survival plots and nomograms on breast and lung cancer patients.
a, Comparison of Kaplan-Meier survival plots for breast and lung patients stratified by stage. b, Breast and lung cancer survival nomograms. Nomograms built using penalized Cox regressions for determination of the probability of breast (5 years) and lung (2 years) overall survival.
Extended Data Fig. 3 Statistical learning models for prediction of binary survival for breast and lung cancer patients (data split based on date of diagnosis).
Machine learning models created for the binary prediction of patient overall survival using patient selection and data split method 2 (Supplementary Table 2). Censored patients or patients who were alive with a follow-up less than prediction time points were removed from both training and holdout testing data. a, Comparison of statistical learning algorithms (least absolute shrinkage and selection operator - LASSO, gradient boosting machines - GBM, Classification and Regression Tree - CART, support vector machine - SVM, random forest - RF) performance (area under the receiver operating curve for cross-validation and independent testing) for the binary prediction of breast (5 years) and lung (2 years) patient survival. Inspection of variable importance from out-of-bag penalty using the random forest classifier. b, Comparison of classifier performance with receiver-operator curves and area under the curve (AUC) scores. c, Kaplan-Meier survival plots obtained from 4 quartile strata using the random forest classifier on the holdout test sets for breast (n = 568) and lung (n = 672) cancer. Survival curves were compared by using the log-rank test.
Extended Data Fig. 4 Kaplan-Meier survival plots on breast and lung cancer patients (data split based on date of diagnosis).
Comparison of Kaplan-Meier survival plots for breast and lung patients stratified by stage. The log-rank test was used to compare survival curves of groups (breast, n = 586 and lung, n = 672).
Supplementary information
Supplementary Information
Supplementary Tables 1–12.
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Source Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Rights and permissions
About this article
Cite this article
Morin, O., Vallières, M., Braunstein, S. et al. An artificial intelligence framework integrating longitudinal electronic health records with real-world data enables continuous pan-cancer prognostication. Nat Cancer 2, 709–722 (2021). https://doi.org/10.1038/s43018-021-00236-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43018-021-00236-2
This article is cited by
-
An overview and a roadmap for artificial intelligence in hematology and oncology
Journal of Cancer Research and Clinical Oncology (2023)
-
Profile of the multicenter cohort of the German Cancer Consortium’s Clinical Communication Platform
European Journal of Epidemiology (2023)
-
Introducing AI to the molecular tumor board: one direction toward the establishment of precision medicine using large-scale cancer clinical and biological information
Experimental Hematology & Oncology (2022)
-
A platform for continuous learning in oncology
Nature Cancer (2021)