Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Generalizability of an acute kidney injury prediction model across health systems

Abstract

Delays in the identification of acute kidney injury in hospitalized patients are a major barrier to the development of effective interventions for treatment. A recent study described a series of models that outperformed previously published models in predicting acute kidney injury up to 48 h in advance, including a recurrent neural network that achieved state-of-the-art performance (area under the curve 0.92) and a gradient-boosted decision tree model that was close behind (area under the curve 0.89). Because these models were trained in a population of US veterans that was 94% male, questions have arisen about its generalizability to other health systems where the populations are more sex balanced. In this study, we aimed to evaluate how well an acute kidney injury model trained in a population of US veterans performs in females at the Veterans Affairs and the extent to which its performance generalizes to a large academic hospital setting. We found that the model performed worse in predicting acute kidney injury in females in both populations, with miscalibration in lower stages of acute kidney injury and worse discrimination (a lower area under the curve) in higher stages of acute kidney injury. We demonstrate that, while this discrepancy in performance can be largely corrected in non-veterans by updating the original model using data from a sex-balanced academic hospital cohort, the worse model performance persists in veterans. Our study sheds light on the importance of characterizing the generalizability of artificial intelligence studies, and on the complexity of discrepancies in model performance in subgroups that cannot be explained simply on the basis of sample size.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Representation of the EHR data for the proposed model.

Similar content being viewed by others

Data availability

This study used data from the national Veterans Health Administration’s Corporate Data Warehouse and the University of Michigan. Analyses were performed in secure locations within the VA and UM information systems, respectively. The data in this study are not publicly available because they contain protected health information, and restrictions apply to their use. A sample of processed data from six patients has been made available online19.

Researchers interested in obtaining deidentified Michigan Medicine patient data should contact PHDataHelp@umich.edu to obtain guidance on which regulatory and compliance requirements need to be fulfilled to obtain access to the Precision Health data resources. More details about the data and the access process are available at https://precisionhealth.umich.edu/.Source data are provided with this paper.

Code availability

Data preparation code, an example of prepared data, the original and extended models trained in this study, and code to generate predictions from the provided data are available online19. Data preparation requires the gpmodels R package28.

References

  1. Hoste, E. A. J. et al. Global epidemiology and outcomes of acute kidney injury. Nat. Rev. Nephrol. 14, 607–625 (2018).

    Article  Google Scholar 

  2. Wilson, F. P. et al. Automated, electronic alerts for acute kidney injury: a single-blind, parallel-group, randomised controlled trial. Lancet 385, 1966–1974 (2015).

    Article  Google Scholar 

  3. Koyner, J. L., Adhikari, R., Edelson, D. P. & Churpek, M. M. Development of a multicenter ward-based AKI prediction model. Clin. J. Am. Soc. Nephrol. 11, 1935–1943 (2016).

    Article  Google Scholar 

  4. Koyner, J. L., Carey, K. A., Edelson, D. P. & Churpek, M. M. The development of a machine learning inpatient acute kidney injury prediction model. Crit. Care Med. 46, 1070–1077 (2018).

    Article  Google Scholar 

  5. Peng, J.-C. et al. Development of mortality prediction model in the elderly hospitalized AKI patients. Sci. Rep. 11, 15157 (2021).

    Article  Google Scholar 

  6. Haines, R. W. et al. Acute kidney injury in trauma patients admitted to critical care: development and validation of a diagnostic prediction model. Sci. Rep. 8, 3665 (2018).

    Article  Google Scholar 

  7. Motwani, S. S. et al. Development and validation of a risk prediction model for acute kidney injury after the first course of cisplatin. J. Clin. Oncol. 36, 682 (2018).

    Article  Google Scholar 

  8. Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).

    Article  Google Scholar 

  9. McCradden, M. D., Stephenson, E. A. & Anderson, J. A. Clinical research underlies ethical integration of healthcare artificial intelligence. Nat. Med. 26, 1325–1326 (2020).

    Article  Google Scholar 

  10. Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).

    Article  Google Scholar 

  11. Google. EHR modeling framework. GitHub https://github.com/google/ehr-predictions (2021).

  12. Haibe-Kains, B. et al. Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020).

    Article  Google Scholar 

  13. McDermott, M. B. A. et al. Reproducibility in machine learning for health research: still a ways to go. Sci. Transl. Med. 13, eabb1655 (2021).

    Article  Google Scholar 

  14. Stupple, A., Singerman, D. & Celi, L. A. The reproducibility crisis in the age of digital medicine. npj Digit. Med. 2, 2 (2019).

    Article  Google Scholar 

  15. Carter, R. E., Attia, Z. I., Lopez-Jimenez, F. & Friedman, P. A. Pragmatic considerations for fostering reproducible research in artificial intelligence. npj Digit. Med. 2, 42 (2019).

    Article  Google Scholar 

  16. Singh, K., Beam, A. L. & Nallamothu, B. K. Machine learning in clinical journals: moving from inscrutable to informative. Circ. Cardiovasc. Qual. Outcomes 13, e007491 (2020).

    Article  Google Scholar 

  17. Robbins, R. et al. AI systems are worse at diagnosing disease when training data is skewed by sex. STAT https://www.statnews.com/2020/05/25/ai-systems-training-data-sex-bias/ (2020).

  18. Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).

    Article  Google Scholar 

  19. Singh, K. ML4LHS/va-aki-model: initial release. Zenodo https://doi.org/10.5281/zenodo.7129945 (2022).

  20. World Health Organization International Classification of Diseases (ICD) https://www.who.int/standards/classifications/classification-of-diseases (2022).

  21. Sundararajan, V. et al. New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. J. Clin. Epidemiol. 57, 1288–1294 (2004).

    Article  Google Scholar 

  22. Khwaja, A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin. Pract. 120, c179–c184 (2012).

    Article  Google Scholar 

  23. Hand, D. J. & Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 171–186 (2001).

    Article  MATH  Google Scholar 

  24. DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).

    Article  MATH  Google Scholar 

  25. Morris, N. tboot: Tilted bootstrap. R package version 0.2.1 (2020).

  26. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).

  27. R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, 2022) https://www.R-project.org/

  28. Singh, K. & Meyer, S. R. ML4LHS/gpmodels: initial release. Zenodo https://doi.org/10.5281/zenodo.7158501 (2022).

  29. LeDell, E. h2o: R interface for the ‘H2O’ scalable machine learning platform. R package version 3.36.0.2 (2022).

  30. Pafka, S. GBM performance. GitHub https://github.com/szilard/GBM-perf (2021).

Download references

Acknowledgements

This work was supported in part by the Veterans Health Association Innovation Program contract number 36C10B18C2766 (received by X.Z., V.S., H.Y., D.S., R.S., M.H. and K.S.) and through NIDDK R01DK133226 (received by M.M. and K.S.).

Author information

Authors and Affiliations

Authors

Contributions

V.S., R.S., S.C., M.H. and K.S. conceived and designed the study. J.C., X.Z., H.Y., D.S., M.H. and K.S. acquired, analysed and interpreted data. J.C., X.Z. and K.S. participated in the creation of the software used in this work. J.C. drafted the manuscript. X.Z., V.S., H.Y., D.S., R.S., S.C., M.M., G.N.N., M.H. and K.S. substantively revised the manuscript. All authors have approved the submitted version and have agreed both to be personally accountable for the author’s own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated and resolved, and the resolution documented in the literature.

Corresponding author

Correspondence to Karandeep Singh.

Ethics declarations

Competing interests

K.S.’s institution receives grant funding from Teva Pharmaceuticals and Blue Cross Blue Shield of Michigan for unrelated work, and K.S. serves on an advisory board for Flatiron Health. M.M. has received research grants from the US National Institutes of Health (NHLBI K01HL141701). G.N.N. is also supported by R01DK108803, U01HG007278, U01HG009610 and 1U01DK116100. G.N.N. reports personal income and equity and stock options from Renalytix and pulseData. G.N.N. is a scientific cofounder of Renalytix, Verici Dx, Pensieve Health, Nexus Health Connect and Data2Wisdom and owns equity in these companies. G.N.N. has received personal income from Siemens Healthineers, Variant Bio, AstraZeneca, Reata, BioVie, Daiichi Sankyo, Cambridge Health Consulting, Qiming Capital and GLG Consulting in the past three years. M.H. receives research grant funding from Astute Medical Inc. and Spectral Medical Inc., and serves as a consultant for Wolters-Kluwer Inc., Potrero Inc. and CardioSounds Inc. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Shalmali Joshi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Model performance (AUC) of the original VA model at each VA hospital.

Model performance of the original model at each VA hospital in the test set, along with characteristics of each VA hospital. A. Model performance with respect to area under the curve (AUC) with 95% CI of the original VA model for predicting AKI-1+ at each VA hospital. The center dot represents the AUC when the original model is applied to the hospital, and the 95% CI is calculated by the DeLong’s method24. B. Number of predictions (after excluding those with AKI-1+ at baseline) at each VA hospital. C. Hospitalization-level AKI-1+ incidence in the test set (after excluding those with AKI-1+ at baseline) at each VA hospital. Five VA hospitals are not shown here due to small cohort sizes (<30 patients).

Source data

Extended Data Fig. 2 Calibration of the original VA model a) VA test set b) UM test set.

The calibration of the original model on the a) VA test set and b) UM test set. The predicted probabilities (deciles) are plotted against the observed probabilities with 95% confidence intervals. The diagonal line demonstrates the ideal calibration. The model calibration is examined for all patients (red), females only (green), and males only (blue).

Source data

Extended Data Fig. 3 Calibration of the extended VA model at UM.

The calibration of the extended model in the UM test set. The predicted probabilities (deciles) are plotted against the observed probabilities with 95% CI. The diagonal line demonstrates the ideal calibration. The model calibration is examined for all patients (red), females only (green), and males only (blue).

Source data

Extended Data Fig. 4 Predictor importance plot of the original and extended VA model.

Top 20 important predictors of the original VA model (top) and the extended VA model (bottom). Predictors are ranked by their relative importance and expressed as a percentage.

Extended Data Table 1 AKI Incidence in the VA and UM cohorts, by acute kidney injury stage, by sex
Extended Data Table 2 Model performance (AUC) of the extended VA models at VA, by outcomes stage, by sex
Extended Data Table 3 Model Performance (AUC) of the original and extended VA models at VA, by outcome stage, by race
Extended Data Table 4 Model Performance (AUC) of the original and extended VA models at UM, by outcome stage, by race

Supplementary information

Source data

Source Data Extended Data Fig. 1

Statistical source data for Extended Data Figure 1

Source Data Extended Data Fig. 2

Statistical source data for Extended Data Figure 2

Source Data Extended Data Fig. 3

Statistical source data for Extended Data Figure 3

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, J., Zhang, X., Shahinian, V. et al. Generalizability of an acute kidney injury prediction model across health systems. Nat Mach Intell 4, 1121–1129 (2022). https://doi.org/10.1038/s42256-022-00563-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00563-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics