Delays in the identification of acute kidney injury in hospitalized patients are a major barrier to the development of effective interventions for treatment. A recent study described a series of models that outperformed previously published models in predicting acute kidney injury up to 48 h in advance, including a recurrent neural network that achieved state-of-the-art performance (area under the curve 0.92) and a gradient-boosted decision tree model that was close behind (area under the curve 0.89). Because these models were trained in a population of US veterans that was 94% male, questions have arisen about its generalizability to other health systems where the populations are more sex balanced. In this study, we aimed to evaluate how well an acute kidney injury model trained in a population of US veterans performs in females at the Veterans Affairs and the extent to which its performance generalizes to a large academic hospital setting. We found that the model performed worse in predicting acute kidney injury in females in both populations, with miscalibration in lower stages of acute kidney injury and worse discrimination (a lower area under the curve) in higher stages of acute kidney injury. We demonstrate that, while this discrepancy in performance can be largely corrected in non-veterans by updating the original model using data from a sex-balanced academic hospital cohort, the worse model performance persists in veterans. Our study sheds light on the importance of characterizing the generalizability of artificial intelligence studies, and on the complexity of discrepancies in model performance in subgroups that cannot be explained simply on the basis of sample size.
This is a preview of subscription content, access via your institution
Subscribe to Nature+
Get immediate online access to Nature and 55 other Nature journal
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
This study used data from the national Veterans Health Administration’s Corporate Data Warehouse and the University of Michigan. Analyses were performed in secure locations within the VA and UM information systems, respectively. The data in this study are not publicly available because they contain protected health information, and restrictions apply to their use. A sample of processed data from six patients has been made available online19.
Researchers interested in obtaining deidentified Michigan Medicine patient data should contact PHDataHelp@umich.edu to obtain guidance on which regulatory and compliance requirements need to be fulfilled to obtain access to the Precision Health data resources. More details about the data and the access process are available at https://precisionhealth.umich.edu/.Source data are provided with this paper.
Hoste, E. A. J. et al. Global epidemiology and outcomes of acute kidney injury. Nat. Rev. Nephrol. 14, 607–625 (2018).
Wilson, F. P. et al. Automated, electronic alerts for acute kidney injury: a single-blind, parallel-group, randomised controlled trial. Lancet 385, 1966–1974 (2015).
Koyner, J. L., Adhikari, R., Edelson, D. P. & Churpek, M. M. Development of a multicenter ward-based AKI prediction model. Clin. J. Am. Soc. Nephrol. 11, 1935–1943 (2016).
Koyner, J. L., Carey, K. A., Edelson, D. P. & Churpek, M. M. The development of a machine learning inpatient acute kidney injury prediction model. Crit. Care Med. 46, 1070–1077 (2018).
Peng, J.-C. et al. Development of mortality prediction model in the elderly hospitalized AKI patients. Sci. Rep. 11, 15157 (2021).
Haines, R. W. et al. Acute kidney injury in trauma patients admitted to critical care: development and validation of a diagnostic prediction model. Sci. Rep. 8, 3665 (2018).
Motwani, S. S. et al. Development and validation of a risk prediction model for acute kidney injury after the first course of cisplatin. J. Clin. Oncol. 36, 682 (2018).
Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
McCradden, M. D., Stephenson, E. A. & Anderson, J. A. Clinical research underlies ethical integration of healthcare artificial intelligence. Nat. Med. 26, 1325–1326 (2020).
Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).
Google. EHR modeling framework. GitHub https://github.com/google/ehr-predictions (2021).
Haibe-Kains, B. et al. Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020).
McDermott, M. B. A. et al. Reproducibility in machine learning for health research: still a ways to go. Sci. Transl. Med. 13, eabb1655 (2021).
Stupple, A., Singerman, D. & Celi, L. A. The reproducibility crisis in the age of digital medicine. npj Digit. Med. 2, 2 (2019).
Carter, R. E., Attia, Z. I., Lopez-Jimenez, F. & Friedman, P. A. Pragmatic considerations for fostering reproducible research in artificial intelligence. npj Digit. Med. 2, 42 (2019).
Singh, K., Beam, A. L. & Nallamothu, B. K. Machine learning in clinical journals: moving from inscrutable to informative. Circ. Cardiovasc. Qual. Outcomes 13, e007491 (2020).
Robbins, R. et al. AI systems are worse at diagnosing disease when training data is skewed by sex. STAT https://www.statnews.com/2020/05/25/ai-systems-training-data-sex-bias/ (2020).
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).
Singh, K. ML4LHS/va-aki-model: initial release. Zenodo https://doi.org/10.5281/zenodo.7129945 (2022).
World Health Organization International Classification of Diseases (ICD) https://www.who.int/standards/classifications/classification-of-diseases (2022).
Sundararajan, V. et al. New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. J. Clin. Epidemiol. 57, 1288–1294 (2004).
Khwaja, A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin. Pract. 120, c179–c184 (2012).
Hand, D. J. & Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 171–186 (2001).
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Morris, N. tboot: Tilted bootstrap. R package version 0.2.1 (2020).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, 2022) https://www.R-project.org/
Singh, K. & Meyer, S. R. ML4LHS/gpmodels: initial release. Zenodo https://doi.org/10.5281/zenodo.7158501 (2022).
LeDell, E. h2o: R interface for the ‘H2O’ scalable machine learning platform. R package version 22.214.171.124 (2022).
Pafka, S. GBM performance. GitHub https://github.com/szilard/GBM-perf (2021).
This work was supported in part by the Veterans Health Association Innovation Program contract number 36C10B18C2766 (received by X.Z., V.S., H.Y., D.S., R.S., M.H. and K.S.) and through NIDDK R01DK133226 (received by M.M. and K.S.).
K.S.’s institution receives grant funding from Teva Pharmaceuticals and Blue Cross Blue Shield of Michigan for unrelated work, and K.S. serves on an advisory board for Flatiron Health. M.M. has received research grants from the US National Institutes of Health (NHLBI K01HL141701). G.N.N. is also supported by R01DK108803, U01HG007278, U01HG009610 and 1U01DK116100. G.N.N. reports personal income and equity and stock options from Renalytix and pulseData. G.N.N. is a scientific cofounder of Renalytix, Verici Dx, Pensieve Health, Nexus Health Connect and Data2Wisdom and owns equity in these companies. G.N.N. has received personal income from Siemens Healthineers, Variant Bio, AstraZeneca, Reata, BioVie, Daiichi Sankyo, Cambridge Health Consulting, Qiming Capital and GLG Consulting in the past three years. M.H. receives research grant funding from Astute Medical Inc. and Spectral Medical Inc., and serves as a consultant for Wolters-Kluwer Inc., Potrero Inc. and CardioSounds Inc. The remaining authors declare no competing interests.
Peer review information
Nature Machine Intelligence thanks Shalmali Joshi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Model performance of the original model at each VA hospital in the test set, along with characteristics of each VA hospital. A. Model performance with respect to area under the curve (AUC) with 95% CI of the original VA model for predicting AKI-1+ at each VA hospital. The center dot represents the AUC when the original model is applied to the hospital, and the 95% CI is calculated by the DeLong’s method24. B. Number of predictions (after excluding those with AKI-1+ at baseline) at each VA hospital. C. Hospitalization-level AKI-1+ incidence in the test set (after excluding those with AKI-1+ at baseline) at each VA hospital. Five VA hospitals are not shown here due to small cohort sizes (<30 patients).
The calibration of the original model on the a) VA test set and b) UM test set. The predicted probabilities (deciles) are plotted against the observed probabilities with 95% confidence intervals. The diagonal line demonstrates the ideal calibration. The model calibration is examined for all patients (red), females only (green), and males only (blue).
The calibration of the extended model in the UM test set. The predicted probabilities (deciles) are plotted against the observed probabilities with 95% CI. The diagonal line demonstrates the ideal calibration. The model calibration is examined for all patients (red), females only (green), and males only (blue).
Top 20 important predictors of the original VA model (top) and the extended VA model (bottom). Predictors are ranked by their relative importance and expressed as a percentage.
About this article
Cite this article
Cao, J., Zhang, X., Shahinian, V. et al. Generalizability of an acute kidney injury prediction model across health systems. Nat Mach Intell 4, 1121–1129 (2022). https://doi.org/10.1038/s42256-022-00563-8