Main

Outbreaks of the COVID-19 epidemic have been causing worldwide health concerns since December 2019. The virus causes fever, cough, fatigue and mild to severe respiratory complications, which, if very severe, can lead to patient death. On 6 March, there were 98,192 cumulated cases of infection across the world and 3,045 deaths had been reported1. On 11 March, the virus outbreak was declared a pandemic by the World Health Organization2. So far, it has been reported that 13.8–19.1% of COVID-19-infected patients in Wuhan, China, became severely ill3,4,5. Furthermore, recent reports have exposed an astonishing case fatality rate of 61.5% for critical cases, increasing sharply with age and for patients with underlying comorbidities6. The severity of cases is putting great pressure on medical services, leading to a shortage of intensive care resources.

Unfortunately, there is no currently available prognostic biomarker to distinguish patients that require immediate medical attention and to estimate their associated mortality rate. The capacity to identify cases that are at imminent risk of death has thus become an urgent yet challenging necessity. Under these circumstances, we retrospectively analysed the blood samples of 485 patients from the region of Wuhan, China, to identify robust and meaningful markers of mortality risk. A mathematical modelling approach based on state-of-the-art interpretable machine learning algorithms was devised to identify the most discriminative biomarkers of patient mortality. The problem was formulated as a classification task, where the inputs included basic information, symptoms, blood samples and the results of laboratory tests, including liver function, kidney function, coagulation function, electrolytes and inflammatory factors, taken from originally general, severe and critical patients (Table 1), as well as their associated outcomes corresponding to either survival or death at the end of the examination period. Through optimization, this classifier aims to reveal the most crucial biomarkers distinguishing patients at imminent risk, thereby relieving clinical burden and potentially reducing the mortality rate.

Table 1 Criteria for assessment of disease severity upon hospital admission

Medical records were collected by using standard case report forms that included epidemiological, demographic, clinical, laboratory and mortality outcome information (Table 2 and Supplementary Data 1). The clinical outcomes were followed up to 24 February 2020. The study was approved by the Tongji Hospital Ethics Committee.

Table 2 Epidemiological, demographic, clinical, laboratory and mortality outcome information collected from medical records

Data resources

The medical information of all patients collected between 10 January and 18 February 2020 were used for model development. Data originating from pregnant and breast-feeding women, patients younger than 18 years and recordings with data material less than 80% complete were excluded from subsequent analysis. For 375 patients, fever was the most common initial symptom (49.9%), followed by cough (13.9%), fatigue (3.7%) and dyspnoea (2.1%). The age distribution of the patients was 58.83 ± 16.46;years, and 59.7% were male. The epidemiological history included Wuhan residents (37.9%), familial cluster (6.4%) and health workers (1.9%). The laboratory results are shown in Table 2. Of the 375 cases included in the subsequent analysis, 201 recovered from COVID-19 and were discharged from the hospital, while the remaining 174 died> deceased. Following this, 110 newly discharged or deceased patients between 19 February 2020 and 24 February 2020 were enrolled for analysis as an external test dataset.

The minimal, maximal and median follow-up times (from admission to hospital to death or discharge) for all 485 (375 + 110) patients are 0 days 02:01:58 (hours: minutes: seconds), 35 days 04:05:54 and 11 days 04:15:36, respectively. The high mortality rate seen in our study was related to the fact that Tongji Hospital admitted a higher rate of severe and critical cases in Wuhan. A patient’s severity was empirically assessed by medical doctors according to the criteria in Table 1 only at admission7. Figure 1 summarizes the outcome of patients in three different classes.

Fig. 1: A flowchart of patient enrolment.
figure 1

Originally, 375 patients with a definite outcome before 18 February 2020 were used for model development, then an additional 110 patients with a definite outcome between 19 February 2020 and 24 February 2020 were used as an external test dataset.

Development of a machine learning model

Most patients had multiple blood samples taken throughout their stay in hospital. However, the model training and testing uses only the data from the final sample as inputs to the model to assess the crucial biomarkers of disease severity, distinguish patients that require immediate medical assistance and accurately match corresponding features to each label. Nevertheless, the model can be applied to all other blood samples and the predictive potential of the identified biomarkers estimated (see Estimation of the prediction horizon section). Missing data were ‘−1’ padded. The model output corresponds to patient mortality. Patients that survived were assigned to class 0 and those that died to class 1.

The performance models were evaluated by assessing the classification accuracy (ratio of true predictions over all predictions), the precision, sensitivity/recall and F1 scores (defined below):

$${\rm{Precision}}_i = \frac{{{\rm{TP}}_i}}{{{\rm{TP}}_i + {\rm{FP}}_i}}$$
(1)
$${\rm{Recall}}_i = \frac{{{\rm{TP}}_i}}{{{\rm{TP}}_i + {\rm{FN}}_i}}$$
(2)
$${\rm{F}}1_i = \frac{{2 \times {\rm{Precision}}_i \times {\rm{Recall}}_i}}{{{\rm{Precision}}_i + {\rm{Recall}}_i}}$$
(3)
$${\rm{Accuracy}} = \frac{{{\rm{TP}} + {\rm{TN}}}}{{{\rm{TP}} + {\rm{TN}} + {\rm{FP}} + {\rm{FN}}}}$$
(4)
$${\rm{Macro}}\,{\rm{averages}}\left( {\rm{score}} \right) = \frac{1}{C}\mathop {\sum }\limits_i {\rm{score}}_i$$
(5)
$$\begin{array}{l}{\rm{Weighted}} \, {\rm{averages}}\left( {\rm{score}} \right) = \frac{1}{N}\mathop {\sum }\limits_i N_i \cdot {\rm{score}}_i\\ {\rm{score}} \in \{ {\rm{Precision}},{\rm{Recall}},{\rm{F}}1\} \end{array}$$
(6)

where \(i \in C\) represents the class, N is the number of all samples, C is the number of all classes, Ni is the number of samples, TNi in class i, TPi, FPi and FNi stand for true positive, true negative, false positive and false negative rates for class i, respectively. In total, 75 features were considered.

This study uses a supervised XGBoost classifier8 as the predictor model. XGBoost is a high-performance machine learning algorithm that benefits from great interpretability potential due to its recursive tree-based decision system. In contrast, internal model mechanisms of black-box modelling strategies are typically difficult to interpret. The importance of each individual feature in XGBoost is determined by its accumulated use in each decision step in trees. This computes a metric characterizing the relative importance of each feature, which is particularly valuable to estimate features that are the most discriminative of model outcomes, especially when they are related to meaningful clinical parameters.

XGBoost was originally trained with the following default parameter settings: maximum depth equal to 4, learning rate equal to 0.2, number of tree estimators set to 150, value of the regularization parameter α set to 1 and ‘subsample’ and ‘colsample_bytree’ both set to 0.9 to prevent overfitting for cases with many features and small sample size8. We refer to it as the ‘Multi-tree XGBoost algorithm’.

Feature importance for an operable decision tree

To evaluate the markers of imminent mortality risk, we assessed the contribution of each patient parameter to decisions of the algorithm. Features were ranked by Multi-tree XGBoost according to their importance (Supplementary Figs. 1 and 2 and Supplementary algorithm 1). The performances of the model showed no improvement in area under the curve (AUC) scores when the number of top features increased to four. Hence, the number of key features was set to the following three: lactic dehydrogenase (LDH), lymphocytes and high-sensitivity C-reactive protein (hs-CRP).

Table 3 summarizes the performances of the Multi-tree XGBoost model. The results show that the model is able to accurately identify the outcome of patients, regardless of their original diagnosis upon hospital admission. Notably, the performance of the external test set (detailed below) is similar to that of the training and validation sets, which suggests that the model captures the key biomarkers of patient mortality. The set of selected features is represented graphically for each patient in Supplementary Fig. 3, demonstrating a clear separability. Table 3 further emphasizes the importance of LDH as a crucial biomarker for patient mortality rate.

Table 3 Performances of the Multi-tree XGBoost classification in discriminating between mortality outcomes using 100-round fivefold cross-validation using Supplementary algorithm 1

Development of a clinically operable decision tree

Following previous findings on the importance of LDH, lymphocytes and hs-CRP, we aimed to construct a simplified and clinically operable decision model. XGBoost algorithms are based on recursive decision tree building from past residuals and can identify those trees that contribute the most to the decision of the predictive model. Decision trees are simple classifiers consisting of sequences of binary decisions organized hierarchically. Hence, if the accuracy of a tree remains high, reducing the complexity of the model to such a structure has the potential to reveal a clinically portable decision algorithm. In the following, we refer to the latter as an ‘interpretable model’ or ‘single-tree XGBoost’.

There were 24 patients with incomplete measurements for at least one of the three principal biomarkers in their last blood samples, leaving 351 patients to identify a single-tree XGBoost model. To identify the model, XGBoost was re-trained with the same parameters as described above, except for the following: number of tree estimators set to 1, values of the regularization parameters α and β both set to 0, and the subsample and max features both set to 1 as overfitting issues have been avoided based on previous modelling8. The interpretable decision tree was obtained by a random split of the 351 patients to training and validation datasets in the ratio 7:3. The resulting tree structure and performances are shown, respectively, in Fig. 2 and Supplementary Tables 1 and 2.

Fig. 2: A decision rule using three key features and their thresholds in absolute value.
figure 2

Num, the number of patients in a class; T, the number of correctly classified; F, the number of misclassified patients.

In addition, the performances of the interpretable model were estimated for the external test set on the latest blood samples of 110 patients, which were not part of the training or validation of the Single-tree XGBoost model (Table 4). The associated confusion matrix is presented in Supplementary Fig. 5, which shows 100% survival prediction accuracy and 81% mortality prediction accuracy. Overall, the scores for survival and death prediction, accuracy, macro and weighted averages are consistently over 0.90.

Table 4 Performance of the proposed interpretable model on the external test dataset

Finally, for benchmark purposes, the performances of the interpretable model were compared with other standard methods such as random forest and logistic regression9. The receiver operating characteristic curves and AUC scores are shown in Supplementary Table 3 and Supplementary Fig. 4.

Estimation of the prediction horizon

Most patients had multiple blood samples taken throughout their hospital stay. In total, there were 909 blood samples with complete measurements of these three features for all 485 patients used for training and validation, and 251 blood samples with complete measurements of these three features for the 110 patients in the external test set. The predictive potential of our model was evaluated on all blood tests for all 485 patients and 110 patients in the external test dataset (Fig. 3 and Supplementary Figs. 6 and 7). On average, the accuracy of our algorithm was 90%, further showing that the model could be applied to any blood sample, including those that were taken far ahead of the day of primary clinical outcome. On average, the model could predict the outcome of all true positive patients at about 10 days (11 days for patients in the external test set) in advance of outcome using all their blood samples (Fig. 3b,c). The model can even predict 18 days in advance with a cumulative accuracy above 90% (Fig. 3d,e). The accuracy of the prediction increases closer to the patient’s outcome. This prediction horizon analysis suggests that, where a patient’s condition deteriorates, the clinical route is able to give an early warning to clinicians a few days in advance.

Fig. 3: Estimation of the prediction horizon of the decision rule with three features.
figure 3

a, Illustration of the concept of the correct prediction time horizon. b, Histogram of the maximum correct predicton time horizons for all 485 patients with true positive prediction. Note that there are two patients with negative days, as their only blood sample results arrived one day after their clinical outcome. c, Histogram of the maximum correct prediction time horizons for 110 patients in the external test set with true positive prediction. d, The predictive performance (F1 score and cumulative F1 score) evaluated with respect to the day of outcome for all 485 patients. e, The predictive performance (F1 score and cumulative F1 score) evaluated with respect to the day of outcome for the 110 patients in the external test set.

Discussion

The significance of our work is twofold. First, it goes beyond providing high-risk factors4. It provides a simple and intuitive clinical test to precisely and quickly quantify the risk of death. For example, a routine sequential respiratory support therapy for patients with SpO2 below 93% comprises intranasal catheterization of oxygen, oxygen supply through a mask, high-flow oxygen supply through a nasal catheter, non-invasive ventilation support, invasive ventilation support and extracorporeal membrane oxygenation. Predicting that for some patients this sequential oxygen therapy leads to unsatisfactory therapeutic effects could preempt physicians to pursuit different approaches. The goal is for the model to identify high-risk patients before irreversible consequences occur. Second, the three key features, LDH, lymphocytes and hs-CRP, can be easily collected in any hospital. In crowded hospitals, and with shortages of medical resources, this simple model can help to quickly prioritize patients, especially during a pandemic when limited healthcare resources have to be allocated10.

The increase of LDH reflects tissue/cell destruction and is regarded as a common sign of tissue/cell damage. Serum LDH has been identified as an important biomarker for the activity and severity of idiopathic pulmonary fibrosis11. In patients with severe pulmonary interstitial disease, the increase of LDH is significant and is one of the most important prognostic markers of lung injury11. For critically ill patients with COVID-19, the rise in LDH level indicates an increase of the activity and extent of lung injury.

The increase of hs-CRP, an important marker for poor prognosis in acute respiratory distress syndrome12,13, reflects a persistent state of inflammation14. The result of this persistent inflammatory response is large grey-white lesions in the lungs of patients with COVID-19 (seen in autopsy)15. In tissue sections, a large amount of sticky secretion is also seen overflowing from the alveoli15.

Finally, our results also suggest that lymphocytes may serve as a potential therapeutic target. This hypothesis is supported by the results of clinical studies4,16. Lymphopenia is a common feature in patients with COVID-19 and might be a critical factor associated with disease severity and mortality17. Injured alveolar epithelial cells could induce the infiltration of lymphocytes, leading to persistent lymphopenia, as was seen in SARS-CoV-2 and MERS-CoV (they share similar alveolar penetrating and antigen presenting cell (APC) impairing pathways)18,19. A biopsy study has provided strong evidence of substantially reduced counts of peripheral CD4 and CD8 T cells, while their status was hyperactivated20. Also, Jing and colleagues have reported that the lymphopenia is mainly related to the decrease in CD4 and CD8 T cells21. It is thus likely that lymphocytes play distinct roles in COVID-19, which deserves further investigation.

This study has room for further improvement, which is left for future work. First, given that the proposed machine learning method is purely data-driven, our model may vary if starting from different datasets. As more data become available, the whole procedure can easily be repeated to obtain more accurate models. This is a single-centred, retrospective study, which provides a preliminary assessment of the clinical course and outcome of patients. We look forward to subsequent large-sample and multi-centred studies. Second, although we had a pool of more than 70 clinical features, our modelling principle is a trade-off between having a minimal number of features and the capacity of good prediction, therefore avoiding overfitting. Finally, this study strikes a balance between model interpretability and improved accuracy. Although clinical settings tend to prefer interpretable models, it is possible that a black-box model may lead to improved performance.

Conclusion

In summary, this study has identified three indicators (LDH, hs-CRP and lymphocytes), together with a clinical route (Fig. 2), for COVID-19 prognostic prediction. We have developed an XGBoost machine learning-based model that can predict the mortality rates of patients more than 10 days in advance with more than 90% accuracy, enabling detection, early intervention and potentially a reduction of mortality in patients with COVID-19.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this Article.