arising from Yan et al. Nature Machine Intelligence https://doi.org/10.1038/s42256-020-0180-7 (2020)

Typical artificial intelligence or machine learning algorithm deployment begins with model training followed by testing under varying circumstances to enable model adaptation and verification of a model’s performance. Despite the urgency that the coronavirus disease 2019 (COVID-19) pandemic has posed for deploying predictive models, proper validation is critical before any claims of utility or generalizability are made1. In the recent publication by Yan et al.2, the authors claimed to have developed a novel simplified mortality prediction model that can be applied in clinical practice. We validated their proposed decision tree against a large database of patients with COVID-193. Yan et al. attempted to determine a subset of biomarkers from blood samples taken throughout a patient’s hospital course that could be used to predict mortality. The authors described the problem as a classification task, where the inputs were the results of the last set of laboratory tests taken from patients with variable severity and where the outcomes were either discharge or death. Following algorithm generation, the authors assessed the feature importance of each parameter. This generated a set of three key features: lactate dehydrogenase (LDH), lymphocyte proportion and high-sensitivity C-reactive protein (hs-CRP). The simplicity of the three-branch model based on these three values is enticing for rapid, wide-scale adoption in clinical practice. Although the authors reported success in their attempt to determine markers of imminent mortality in their dataset, they used a small validation sample size and did not externally validate their model3. We have attempted, under various use cases, to validate their model using data from patients with COVID-19 treated in Northwell Health hospitals. We also attempted to recalibrate the primary branch of the proposed tree-based model based on our data to account for differences between our populations and health systems1.

Model performance as a triage tool

The model is purported to be applicable to any blood sample, far ahead of the primary clinical outcome, thereby suggesting its use as an admission triage tool. We tested the performance of their model on their validation data at the first time point that all three blood tests were collected, similar to how physicians would risk-stratify new patients (Fig. 1a). This critical time point was not included in the original paper, which is important as the authors suggest that their model can be used to prioritize care. Predictive clinical models are used prospectively where the time of outcome is unknown, unlike this model, which retrospectively utilizes data based on the known date of outcome. Given that the clinician cannot know when the date of discharge or death is to occur, the fact that the performance of the proposed model improves closer to the time of outcome is not clinically useful information. Therefore, it is important to show that the model has sufficient performance to justify changes to clinical care (as the authors suggest) at admission.

Fig. 1: Performance of decision rule.
figure 1

ac, Performance of the decision rule for three settings: first values from the Yan et al. validation data (a) and first values (b) and last values (c) from the Northwell patient data (n = 1,038). T, true; F, false.

Our analysis was performed on Python and R, using code available at https://github.com/siabolourani/YIN_reply. The precision was 0.48 for predicting mortality, meaning that over half of the patients that the model predicted would die actually survived. The accuracy was 0.88 and the F1 score was 0.41. In interpreting these results, one must take care in accounting for imbalanced data—their validation set had a survival rate of 0.88, meaning that the null model of always predicting survival had a similar accuracy as the proposed full model.

Model performance on external data

To test the clinical portability of the mortality prediction model, we validated it externally using the Northwell Health electronic health record database. Northwell Health is the largest academic health system in New York, comprising 12 acute care hospitals that serve ~11 million people in the North American epicentre of the COVID-19 pandemic4. The data used for this validation were collected from the Enterprise Electronic Health record (Sunrise Clinical Manager, Allscripts, Chicago) and included patients who had had COVID-19 and had been discharged from Northwell hospitals between 1 March and 31 May 2020. All patients with a final outcome (death or discharged alive) and LDH, hs-CRP and lymphocyte values measured at least once during their hospitalization were considered. Thus, from a total of 13,106 patients, 1,038 patients were included for the validation of the model.

We initially tested the model performance using the first time point when all three laboratory values were available (Fig. 1b). Simulating the operation of the model at this initial triage point, the precision was 0.40 for death (F1 score of 0.56), with an overall accuracy of 0.48.

The model’s accuracy is reported to increase with laboratory values drawn closer to the patient’s outcome. As previously stated, a clinical model that is contingent on knowing, in advance, the date of outcome, is of dubious use. Nevertheless, we externally validated the model using the final (pre-death or discharge) laboratory values in our dataset. The precision for death remained low at 0.41, with an overall model accuracy of 0.50 (Fig. 1c).

Recalibrated model performance on external data

LDH alone was the primary driver of the decision tree and an LDH value of >365 U l−1 led to a terminal node accounting for 93.0% (146/157) of the true positive predictions of mortality in their dataset. Therefore, it could be argued that LDH alone is a sufficiently robust mortality predictor and naturally lends itself to serve as a triage tool. To test this hypothesis, from all our patients (n = 13,106), we included those with at least one LDH value from the emergency department (n = 3,595). With the proposed threshold of LDH > 365 U l−1, the precision for mortality was 0.34 (Fig. 2). We then varied the LDH threshold and found that the maximal precision achieved for this branch was only 0.54, revealing its lack of prognostic utility at our institution as part of a mortality prediction model upon admission.

Fig. 2: LDH as a mortality predictor.
figure 2

Histogram of emergency department LDH values and precision for mortality of an LDH rule for different thresholds based on Northwell COVID-19 patient data (n = 3,595). The dashed line represents the LDH threshold of 365 U l−1 of Yan et al. PPV, positive predictive value.

Conclusion

An interpretable mortality prediction model for patients with COVID-19 is a worthwhile pursuit to help inform clinicians in the battle against this pandemic. We have shown that the recently published model by Yan et al. does not perform as a triage tool based on the internal validation dataset provided by the original authors. Furthermore, we have demonstrated that the decision algorithm was not portable to our large external validation dataset, both with unmodified and optimized parameters. We have thus demonstrated the importance of externally validating this model before its widespread adoption in actual clinical practice, especially given the rapid and widespread dissemination of this model post-publication5. Furthermore, our findings, consistent with other studies6, confirm that the proposed model cannot be recommended for routine clinical implementation.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.