arising from Yan et al. Nature Machine Intelligence https://doi.org/10.1038/s42256-020-0180-7 (2020)
Typical artificial intelligence or machine learning algorithm deployment begins with model training followed by testing under varying circumstances to enable model adaptation and verification of a model’s performance. Despite the urgency that the coronavirus disease 2019 (COVID-19) pandemic has posed for deploying predictive models, proper validation is critical before any claims of utility or generalizability are made1. In the recent publication by Yan et al.2, the authors claimed to have developed a novel simplified mortality prediction model that can be applied in clinical practice. We validated their proposed decision tree against a large database of patients with COVID-193. Yan et al. attempted to determine a subset of biomarkers from blood samples taken throughout a patient’s hospital course that could be used to predict mortality. The authors described the problem as a classification task, where the inputs were the results of the last set of laboratory tests taken from patients with variable severity and where the outcomes were either discharge or death. Following algorithm generation, the authors assessed the feature importance of each parameter. This generated a set of three key features: lactate dehydrogenase (LDH), lymphocyte proportion and high-sensitivity C-reactive protein (hs-CRP). The simplicity of the three-branch model based on these three values is enticing for rapid, wide-scale adoption in clinical practice. Although the authors reported success in their attempt to determine markers of imminent mortality in their dataset, they used a small validation sample size and did not externally validate their model3. We have attempted, under various use cases, to validate their model using data from patients with COVID-19 treated in Northwell Health hospitals. We also attempted to recalibrate the primary branch of the proposed tree-based model based on our data to account for differences between our populations and health systems1.
Model performance as a triage tool
The model is purported to be applicable to any blood sample, far ahead of the primary clinical outcome, thereby suggesting its use as an admission triage tool. We tested the performance of their model on their validation data at the first time point that all three blood tests were collected, similar to how physicians would risk-stratify new patients (Fig. 1a). This critical time point was not included in the original paper, which is important as the authors suggest that their model can be used to prioritize care. Predictive clinical models are used prospectively where the time of outcome is unknown, unlike this model, which retrospectively utilizes data based on the known date of outcome. Given that the clinician cannot know when the date of discharge or death is to occur, the fact that the performance of the proposed model improves closer to the time of outcome is not clinically useful information. Therefore, it is important to show that the model has sufficient performance to justify changes to clinical care (as the authors suggest) at admission.
Our analysis was performed on Python and R, using code available at https://github.com/siabolourani/YIN_reply. The precision was 0.48 for predicting mortality, meaning that over half of the patients that the model predicted would die actually survived. The accuracy was 0.88 and the F1 score was 0.41. In interpreting these results, one must take care in accounting for imbalanced data—their validation set had a survival rate of 0.88, meaning that the null model of always predicting survival had a similar accuracy as the proposed full model.
Model performance on external data
To test the clinical portability of the mortality prediction model, we validated it externally using the Northwell Health electronic health record database. Northwell Health is the largest academic health system in New York, comprising 12 acute care hospitals that serve ~11 million people in the North American epicentre of the COVID-19 pandemic4. The data used for this validation were collected from the Enterprise Electronic Health record (Sunrise Clinical Manager, Allscripts, Chicago) and included patients who had had COVID-19 and had been discharged from Northwell hospitals between 1 March and 31 May 2020. All patients with a final outcome (death or discharged alive) and LDH, hs-CRP and lymphocyte values measured at least once during their hospitalization were considered. Thus, from a total of 13,106 patients, 1,038 patients were included for the validation of the model.
We initially tested the model performance using the first time point when all three laboratory values were available (Fig. 1b). Simulating the operation of the model at this initial triage point, the precision was 0.40 for death (F1 score of 0.56), with an overall accuracy of 0.48.
The model’s accuracy is reported to increase with laboratory values drawn closer to the patient’s outcome. As previously stated, a clinical model that is contingent on knowing, in advance, the date of outcome, is of dubious use. Nevertheless, we externally validated the model using the final (pre-death or discharge) laboratory values in our dataset. The precision for death remained low at 0.41, with an overall model accuracy of 0.50 (Fig. 1c).
Recalibrated model performance on external data
LDH alone was the primary driver of the decision tree and an LDH value of >365 U l−1 led to a terminal node accounting for 93.0% (146/157) of the true positive predictions of mortality in their dataset. Therefore, it could be argued that LDH alone is a sufficiently robust mortality predictor and naturally lends itself to serve as a triage tool. To test this hypothesis, from all our patients (n = 13,106), we included those with at least one LDH value from the emergency department (n = 3,595). With the proposed threshold of LDH > 365 U l−1, the precision for mortality was 0.34 (Fig. 2). We then varied the LDH threshold and found that the maximal precision achieved for this branch was only 0.54, revealing its lack of prognostic utility at our institution as part of a mortality prediction model upon admission.
An interpretable mortality prediction model for patients with COVID-19 is a worthwhile pursuit to help inform clinicians in the battle against this pandemic. We have shown that the recently published model by Yan et al. does not perform as a triage tool based on the internal validation dataset provided by the original authors. Furthermore, we have demonstrated that the decision algorithm was not portable to our large external validation dataset, both with unmodified and optimized parameters. We have thus demonstrated the importance of externally validating this model before its widespread adoption in actual clinical practice, especially given the rapid and widespread dissemination of this model post-publication5. Furthermore, our findings, consistent with other studies6, confirm that the proposed model cannot be recommended for routine clinical implementation.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
The datasets analysed and generated during the current study are available from the corresponding author on reasonable request.
The analysis code is available at the following repository: https://github.com/siabolourani/YIN_reply.
Wynants, L. et al. Prediction models for diagnosis and prognosis of Covid-19 infection: systematic review and critical appraisal. BMJ 369, m1328 (2020).
Yan, L. et al. An interpretable mortality prediction model for COVID-19 patients. Nat. Mach. Intell. 2, 283–288 (2020).
Riley, R. D. et al. Calculating the sample size required for developing a clinical prediction model. BMJ 368, m441 (2020).
Richardson, S. et al. Presenting characteristics, comorbidities and outcomes among 5,700 patients hospitalized with COVID-19 in the New York City area. JAMA 323, 2052–2059 (2020).
An interpretable mortality prediction model for COVID-19 patients. Altmetric https://www.altmetric.com/details/82019437/ (2020).
Gupta, R. K. et al. Systematic evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-19: an observational cohort study. Eur. Resp. J. https://doi.org/10.1183/13993003.03498-2020 (2020).
The authors would like to thank J. Hirsh and K. Coppa for providing the queries and data that enabled this study, as well as the Northwell Machine Learning in Medicine group, whose discussions inspired this paper. We acknowledge and honor all our Northwell team members who consistently put themselves in harm’s way during the COVID-19 pandemic. We dedicate this article to them, as their vital contribution to knowledge about COVID-19 and sacrifices on the behalf of patients made it possible.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Barish, M., Bolourani, S., Lau, L.F. et al. External validation demonstrates limited clinical utility of the interpretable mortality prediction model for patients with COVID-19. Nat Mach Intell 3, 25–27 (2021). https://doi.org/10.1038/s42256-020-00254-2
Nature Biomedical Engineering (2022)
Integrated medical resource consumption stratification in hospitalized patients: an Auto Triage Management model based on accurate risk, cost and length of stay prediction
Science China Life Sciences (2022)
Nature Machine Intelligence (2021)