Introduction

The coronavirus disease 2019 (COVID-19) pandemic poses a major threat to global health. Despite all efforts to slow the spreading and contain the disease, healthcare systems in countries all over the world have been overwhelmed with high demands for critical care resources. To manage these demands in the best possible way and to enable an effective and efficient allocation of critical care resources, prognosis models for individual disease courses and outcomes are essential. Accordingly, several prognosis models for critical and lethal courses in critically ill COVID-19 patients have been published over the course of the year1,2,3,4,5,6,7,8.

The reported predictors for lethal courses in COVID-19 patients can be divided into seven groups, including (1) demographic features like age and gender, (2) comorbidities like COPD, obesity, hypertension and diabetes, (3) radiological signs of disease severity like multi-lobular infiltration, (4) blood infection markers and infection associated blood count parameters like C-reactive protein, procalcitonin and lymphocyte counts, (5) other laboratory blood markers associated with organ distress like lactate dehydrogenase, bilirubin or blood urea nitrogen, (6) direct clinical signs of organ failure like respiratory rate, blood oxygenation or blood pressure and (7) intensive care treatment measures as indirect markers of organ failure like catecholamine doses or ventilation parameters.

Interestingly, the predictors that were identified to indicate critical and lethal courses in COVID-19 patients are very similar to those applied in models for the prediction of lethal courses in critically ill non-COVID-19 viral pneumonia patients9,10,11,12,13. This similarity is not entirely surprising, as the fundamental pathophysiological mechanisms of organ failure in those patients developing a critical or lethal course appear relatively similar between COVID-19 and other types of viral pneumonia, even though the rate of patients developing a critical or lethal course and the time frame of such courses may differ profoundly.

Such pathophysiological similarities of critical and lethal courses between intensive care patients with different types of viral pneumonia might allow to transfer knowledge obtained on one type of viral pneumonia to other types, even though they differ in mortality rates and time courses. Especially in a pandemic situation with a new type of disease, such knowledge transfer might be highly beneficial, as it would bridge the critical early phase by allowing the use of prediction models built for similar diseases until first models based on data of the actual disease are available.

To test our hypothesis that models developed to predict lethal courses for one type of viral pneumonia also allow to predict lethal courses for another type of viral pneumonia, even when the specific diseases differ in lethality rate and time courses, we performed this study. To specifically address the pandemic scenario, we investigated how well lethal courses in critically ill COVID-19 patients can be predicted by a machine learning model trained on data of critically ill patients with non-COVID-19 viral pneumonia.

Results

Patient sample

Of the 749 critically ill non-COVID-19 viral pneumonia patients for which we extracted data, 31 patients were excluded as their ICU treatment was shorter than 24 h or as they were also tested positive for SARS-CoV-19, leaving 718 patients (473 survivor/245 non-survivor) with a median ICU length of stay of 13 d (IQR 5–28 d) for a total of 16,180 time bins of 24 h duration for model training (Fig. 1, Table 1).

Figure 1
figure 1

Durations of ICU treatment and hospitalization of all formerly treated patients. Shown are the histograms of length of stay in intensive care units (top) and total length of stay in the hospital (bottom) for critically ill non-COVID-19 patients (left) and critically ill COVID-19 patients (right), separately for survivors (purple) and non-survivors (orange). 7 (1) non-COVID-19 (COVID-19) patients with more than 200 days in the hospital and 20 (11) non-COVID-19 (COVID-19) patients with more than 100 days in an ICU are not shown in this illustration as they are out of the depicted axis range.

Table 1 Patient characteristics.

For the COVID-19 dataset, we extracted the data of 1176 critically ill patients with completed cases. Of these, 122 were excluded as their ICU treatment was shorter than 24 h or as they were also tested positive for another virus possibly causing pneumonia, leaving 1054 patients (685 survivor/369 non-survivor) with a median ICU length of stay of 9 d (IQR 4–22 d) for a total of 18,521 time bins of 24 h duration for model testing (Fig. 1, Table 1).

Prediction model performance

The multivariate non-COVID-19 viral pneumonia gradient boosted tree model using the full feature set as well as the reduced model that only included the 20 features with the highest importance on the training dataset both showed a significantly better predictive performance than any of the clinical scores APACHE2, SAPS2 and SOFA, and the previously published prediction models (Fig. 2, Table 2).

Figure 2
figure 2

Performance metrics of the non-COVID-19 viral pneumonia mortality prediction models, clinical scores and previously published COVID-19 mortality prediction models. Shown are the receiver operating characteristics (left) and precision-recall (right) curves for the full (purple) and reduced (orange) non-COVID-19 viral pneumonia mortality prediction model and for the clinical scores APACHE2 (blue), SOFA (green), SAPS2 (red) for the prediction of mortality within the next 5 days in COVID-19 patients across all 24 h time bins of each patient’s stay on the ICU, weighted inversely by the number of time bins per patient. Additionally shown are the ROC and PRC curves of previously published COVID-19 mortality prediction models (dashed lines) and the performance of a random classifier (solid gray).

Table 2 Performance metrics.

The time courses of prediction metrics for all models that used time-varying variables increased with increasing time after admission, and reached their maximum towards the endpoint (Fig. 3). Throughout the first day after admission to the end of stay, both the full and the reduced model outperformed all clinical scores and previously published COVID-19 prediction models. Additionally, the performance of the reduced model did not systematically differ from that of the full model during the first days after admission. However, it was reduced 5 days before the endpoint, but approximated the performance of the full model towards the endpoint.

Figure 3
figure 3

Time courses of the area under the ROC curves (auROC) and area under the precision recall curve (auPRC) of the non-COVID-19 viral pneumonia mortality prediction model, clinical scores and previously published COVID-19 mortality prediction models. Shown are the auROC (top) and auPRC (bottom) time courses between admission and 20 days after admission (left) and between 120 and 1 h before the endpoint (death/control endpoint; right) for the full (purple) and reduced (orange) non-COVID-19 viral pneumonia mortality prediction models and for the clinical scores APACHE2 (blue), SOFA (green), SAPS2 (red) for the prediction of mortality within the next 5 days in COVID-19 patients. Prediction windows for the time courses after admission were 24 h and prediction windows for the time courses before the endpoints were 1 h. Additionally shown are the ROC and PRC curves of previously published COVID-19 mortality prediction models (dashed lines) and the performance of a random classifier (solid gray).

Clinical features of the reduced model

From the 251 features of the full model, we determined those 20 unique clinical features that showed the highest feature importance as quantified by the mean absolute SHAP values on the non-COVID-19 viral training dataset (Fig. 4). Most of these features showed a significant difference between patients who deceased within the next 5 days and patients who survived the next 5 days already within the first 24 h after admission, both for the non-COVID-19 patients training and the COVID-19 patients test dataset (Table 3).

Figure 4
figure 4

Impact of clinical features on the prediction of mortality in COVID-19 patients (SHAP values). Left: Shown is the impact of feature values on the reduced models’ output for prediction of mortality of COVID-19 patients within the next 5 days. Each point represents a patient’s feature value (color-coded from blue for a low feature value to red for high feature values). Negative impact (left to the vertical line) of a feature value represents an impact towards the prediction of survival, positive impact (right to the vertical line) represents an impact towards prediction of non-survival. For example, high thrombocytes concentration levels (red) are associated with a higher probability of survival whereas low mean arterial blood pressure (blue) is associated with a lower probability of survival. Right: Mean absolute impact of each clinical feature on the model output.

Table 3 Univariate analyses of the clinical features for the reduced multivariate viral pneumonia prediction model for data from the first 24 h after admission.

Discussion

We demonstrate here that lethal courses in critically ill COVID-19 patients can be predicted by a machine learning model trained on critically ill non-COVID-19 viral pneumonia patients. Furthermore, we show that the predictive performance of the model is not inferior to models developed specifically for COVID-19 patients. The plausibility of this approach is reinforced by the fact that the features that showed the highest importance in our model trained on non-COVID-19 patients and the features included in specific COVID-19 models are largely identical.

The features that are commonly included in models to predict individual mortality in COVID-19 and critically ill non-COVID-19 viral pneumonia patients can be divided in seven groups, including (1) demographic features like age and gender, (2) comorbidities like chronic obstructive pulmonary disease (COPD), obesity, hypertension and diabetes, (3) radiological signs of disease severity like multi-lobular infiltration, (4) blood infection markers and infection associated blood count parameters like C reactive protein, procalcitonin and lymphocyte counts, (5) other laboratory blood markers associated with organ distress like lactate dehydrogenase, bilirubin or blood urea nitrogen, (6) direct clinical signs of organ failure like respiratory rate, blood oxygenation or blood pressure and (7) intensive care treatment measures as indirect markers of organ failure like catecholamine doses or ventilation parameters1,2,3,4,5,6,7,8,9,10,11,12,13.

Similarly, the 20 parameters with the highest feature importance in our model trained on non-COVID-19 viral pneumonia patients included radiological signs of pulmonary infiltrates [group 3], infection-associated blood counts of neutrophils and monocytes [group 4], laboratory markers of organ distress and organ failure (thrombocytes, red blood cell distribution width, pH, P/F ratio, sodium, lactate dehydrogenase and alanine aminotransferase) [group 5], direct clinical signs of organ distress and organ failure (heart rate, blood pressure, blood oxygen saturation, urine output and respiratory rate) [group 6] or intensive care treatment measures as indirect markers of organ distress and organ failure (vasoactive inotropic score as a summary parameter of catecholamine administration, ventilation peak pressure and ventilation mode) [group 7].

While the differentiation between the latter two groups might not be sharp as the clinical signs of group 6 are always impacted by the treatment measures of group 7 and vice versa, it is clear that besides the infection parameters as the primary driving cause for mortality in viral pneumonia, all but two of the other parameters included in the 20 parameters with the highest feature importance in our model are either direct or indirect measures of organ failure and therefore represent the mechanism by which the infection induces mortality. Accordingly, the included parameters cover signs of organ distress and organ failure for all major organ systems that are in the primary focus of intensive care treatment, including heart and circulation, lungs and respiration, liver and coagulation, as well as kidneys and volume regulation.

The fact that from all demographic features [group 1] only age and height and none of the comorbidities [group 2] proved of a high enough predictive value independent of the other included parameters to show in the 20 parameters with the highest feature importance might seem unexpected at first glance, as many features from these groups have been shown in various previous studies as valuable predictors for critical and lethal courses in both critically ill COVID-19 and non-COVID viral pneumonia patients. However, when focusing on mortality, all of these features can be regarded as indirect predictors as they mediate the likelihood of specific organ failures that lead to a lethal course. Thus, in the case of the parameters included in the model that allow the prediction of lethal organ failure, the predictive value of the parameters from these first two groups of indirect parameters can be masked by the parameters indicating organ failure. For example, COPD has been shown in multiple studies to be a risk factor for a critical or lethal course in both COVID-19 and non-COVID-19 viral pneumonia patients7,14, but these critical and lethal courses are not caused by COPD directly and independently of organ failure. Instead, the effect of COPD is mediated through organ damage and associated increased risks of organ failure like lung or heart failure. Overall, this effect of organ failure parameters masking indirect risk factors in the prediction of lethal courses can be expected to increase with decreasing time between prediction and death. Thus, when focusing on the treatment phase in the intensive care unit, which is defined by immediate or impending organ distress and organ failure, the measures of the severity of the organ dysfunction can be expected to fully mask the indirect predictors, as we show here. The only indirect parameter that remained unmasked in our model was age, suggesting that other than the impact of specific diseases and disease groups the impact of age on organ function and compensation reserves for organ function during distress is not fully represented by the here included organ failure markers. In contrast, the role of the other parameter of the group of demographic features that was included in the 20 most important features—the patients’ height—is most probably not that the patients’ height is a predictor of mortality by itself, but that the patients’ height is an indirect prediction parameter that increases the information value of other predictors through individual normalization. As an example, the information value of urine output per kilogram of lean body weight (which is primarily determined by the height) is higher than the information value of urine output by itself.

Interestingly, the performance of the reduced model with 20 features approximated the performance of the full model towards the endpoints (Fig. 3, right), which suggests that these 20 features might indeed be closely related to the physiological processes during terminal organ failure due to viral pneumonia. However, the predictive performance of our non-COVID-19 viral pneumonia prediction models on COVID-19 patients cannot be explained by identical features alone, but it also requires similar relative weights between the features and similar value ranges at which the features exert their predictive value. This is supported by the fact that the 5 most important features from the full model on the non-COVID-19 patients training dataset (Supplementary Table 2) are also the 5 most important features of the reduced model on the COVID-19 patients test dataset and that all included 20 features show a significant feature importance on the test dataset (Fig. 4), although they were determined from the non-COVID-19 patients training data set. Interestingly, both the full and reduced models are also relatively well calibrated as shown by the relatively low Brier score (Table 2) and the calibration curve (Supplementary Fig. 1), further indicating a good transferability of the model trained on non-COVID-19 patients to COVID-19 patients. Thus, the most parsimonious explanation is that the fundamental pathophysiological mechanisms of terminal organ failure are highly similar in lethal courses of COVID-19 pneumonia and non-COVID-19 viral pneumonia15. Again, this is not surprising, as even if the damaging mechanisms as well as their time courses and relative rates of severe and lethal courses might differ, the signs caused by the damages can be expected to be similar. Thus, a prediction model of lethal courses for one type of viral pneumonia could be transferred to a different type of viral pneumonia and show a comparable predictive performance.

This concept of transferring knowledge obtained on one type of viral pneumonia to other types, even though they differ in mortality rates and time courses, might be highly beneficial for clinical management, especially in the context of a pandemic with a new disease. In such a situation, as experienced with COVID-19, one critical phase is the early stage of the pandemic, when the lack of knowledge and experience with the new disease puts exceptional stress on critical care resources16. In this situation, an effective and efficient allocation of critical care resources requires applicable prediction models of disease progression, which in the case of a new disease are only available with a significant delay. To bridge this gap and support resource allocation and disease management during the early phase of a pandemic with a new disease, it seems feasible to transfer prediction models from similar diseases for which data is widely available. As we have shown in this study, such a model transfer can lead to a similar predictive performance as models developed on early data of the specific disease.

Even though we here present only a methodological case study to demonstrate the possibility of knowledge transfer between similar diseases, models as we present here might actually be of practical use in situations when the demands on intensive care specialists overwhelm their capacities, like in the situation of a pandemic with a new disease. In such a situation, with an increasing number of patients overseen by one specialist, the risk that one patient begins to deteriorate unnoticed increases. Thus, models like the ones presented here, might allow to detect such deteriorations and to point out patients which might require a bit of focused attention while it is still early enough to intervene. Of course, our study only demonstrates that models like the one presented here hold such a potential, but further studies are required to investigate clinical processes how the allocation of intensive care specialists’ attention as a rare resource during a pandemic might be assisted by such models in a beneficial and ethical way.

This study has several limitations: since our datasets were retrospective, we could only analyze those parameters that had been recorded and documented in the electronic clinical databases. Thus, we had to exclude a variety of parameters, especially laboratory parameters like interleukins, tumor necrosis factor or specific leukocyte/lymphocyte subgroups, which might carry additional independent information or might prove as better predictors than parameters that we included, but which were not available in a sufficient number of patients of our cohorts. This also affected the number of patients that could be tested using the previously published models of Wang and Zhou, thus limiting the comparability between the here presented models.

Conclusions

In conclusion, we have demonstrated that a machine learning model trained on critically ill non-COVID-19 viral pneumonia patients allows to predict lethal courses in critically ill COVID-19 patients within the next 5 days with a predictive performance comparable to that of models specifically developed for COVID-19 patients. Therefore, we propose for future pandemics to apply already available prediction models to support critical care resource allocation and disease management, as the transfer of knowledge seems to be feasible, even when the specific diseases differ in time courses and in rates of critical and lethal courses.

Methods

Ethics approval and consent to participate

Before commencement, this study was approved by the local ethics committee (Ethikausschuss 4 am Campus Benjamin Franklin, Charité—Universitätsmedizin Berlin, Chairperson Prof. R. Stahlmann, Application Number EA4/008/19, approval date: 06 Feb 2019, amendment date: 14 May 2020). All methods were carried out following relevant guidelines and regulations. As a retrospective study analyzing anonymized data from standard clinical care, without any additional data collection, informed consent was waived for this study by the competent ethics committee.

Datasets, data extraction and data cleansing

For the non-COVID-19 dataset, we extracted all available data of patients treated between 11th November 2014 and 11th January 2021 in intensive care units (ICUs) from three centers of Charité – Universitätsmedizin Berlin, one of the largest university hospitals in Europe, with viral pneumonia and positive PCR tests for influenza, parainfluenza, metapneumovirus, orthopneumovirus or non-SARS-CoV-2 coronavirus from the computerized clinical information systems. For the COVID-19 dataset, we extracted the medical data of patients treated between 1st January 2020 and 19th April 2021 in ICUs of Charité – Universitätsmedizin Berlin with positive PCR tests for SARS-CoV-2.

The extracted medical data in both datasets comprises 123 clinical features including demographic data (e.g., age, gender, body mass index), blood gas analyses (e.g. paO2, pH, lactate), ventilation parameters (e.g. respiratory rate, tidal volume, oxygenation index), vital data (e.g., blood pressure, heart rate), laboratory tests (e.g. creatinine, bilirubin), fluid volume intake and efflux, diagnoses (ICD-10 codes), clinical scores (e.g. SOFA, SAPS2, APACHE2) and drug application rates (e.g. catecholamines).

Data cleansing included removal of invalid data (e.g. non-numeric strings instead of numbers), values outside of physical or physiological ranges and removal of duplicated values. No imputation of missing values has been performed. Patients were excluded from the data sets if their length of stay on the ICU was shorter than 24 h or if they were included in both datasets (i.e. tested positive for SARS-CoV-2 and another virus possibly causing viral pneumonia).

Dataset preprocessing

We aimed to predict the mortality of viral pneumonia patients (formal declaration of death) within the next 5 days relative to each time point. To that end, we aggregated each patient’s time series data into time bins of 24 h, starting at the admission to the ICU. For each time bin, we calculated the median (50th percentile) and the 10th and 90th percentiles of each time-varying variable available (i.e. measured) in this time bin. Data points available before the investigated 24 h time bin were carried over and considered “current” until a new value for that variable was available. We extracted different percentiles as each of these percentiles contains different information value. For example, a short period of extreme tachycardia during the observed 24 h time window would be captured in the 90th percentile but not in the median value.

We used only features that were present in at least 30% of the time bins in the training dataset. Static variables such as age and diagnoses were then appended to each time bin, for a total of 264 features.

Model development on non-COVID-19 patients

As we aimed to predict the mortality within the next 5 days relative to each time point, we assigned a positive class label to each of the 24 h time bins that occurred within 5 days before death of a patient. All other time bins, i.e. those that were more than 5 days before death or from surviving patients, were assigned a negative class label. To account for the different number of time bins per patient depending on how long they stayed in the ICU, we weighted each time bin by the inverse number of time bins of the respective class (positive/negative) per patient. For instance, if a patient deceased after 30 days on the ICU, we weighted the first 25 (negative) samples by 1/25 each and the last 5 (positive) samples by 1/5 each. The sample weights were used both during training and during calculation of performance metrics.

We trained gradient boosted decision tree models using XGBoost (version 1.3.3)17 on the training dataset of critically ill non-COVID-19 patients. To optimize the hyper-parameters of the XGBoost training algorithm, we applied a Bayesian optimization approach using the python BayesianOptimization package18,19 (version 1.1.0; see Supplementary Information for the parameter search range). Hyper-parameter optimization was performed using a tenfold cross-validation scheme with the average auROC value of the 10 hold-out sets (1 for each fold) as the maximization target for the optimizer (see Supplementary Table 1).

Feature selection for reduced model

To evaluate whether a model with a reduced number of features could achieve similar performance, we performed a feature selection based on feature importances on the training dataset as measured by Shapley additive explanation (SHAP) values—a game-theoretic approach to explain the output of machine learning models using the python package SHAP (version 0.39.0)20. We selected the top 20 unique features with the highest average absolute SHAP values, excluding duplicates of the same physiological parameters. Using that list of features, we trained another model on the training dataset in the same way as described above for the full model.

Prediction of mortality in COVID-19 patients and comparison with other models

To assess the ability of the gradient boosted tree models trained on non-COVID-19 viral pneumonia patients to predict the mortality of COVID-19 patients, we applied the trained model on the COVID-19 patient dataset and calculated the auROC and auPRC for the prediction of mortality, weighting each sample using the weights as described above to uniformly weigh each patient independent of their length of stay. We additionally calculated the positive predictive value (PPV)/precision, negative predictive value (NPV), F1 score, specificity and sensitivity/recall for the threshold value that yielded the highest F1 score. To assess model calibration, we determined the Brier score and calibration curves (fraction of positive vs. mean predicted value from the models). All performance metrics and the calibration curve were computed using scikit-learn (version 0.24.1). Confidence intervals were calculated using bootstrap resampling (2000 samples).

To display the time course of performance metrics, we applied the prediction models on the set of bins from the same time (e.g. first 24 h) without sample weighting, as each patient contributed at most one data point per time bin.

To compare the predictive performance of our model to those of the established clinical scores APACHE2, SAPS2 and SOFA, we calculated the respective performance metrics for the prediction of mortality within the next 5 days using the scores directly.

To further compare the predictive performance of previously published models that were specifically developed to predict mortality in COVID-19 patients1,2,8, we used each of the models as published (without retraining) to the COVID-19 dataset and evaluated the respective performance metrics for prediction of mortality within the next 5 days.

Predictive performance relative to the endpoint

To determine the predictive performance of our models not only relative to the admission to the ICU but also relative to the endpoint (death in non-surviving patients), we needed to define an endpoint for the surviving patients. Using the discharge from the ICU as the endpoint would create an unrealistic and trivial comparison, as surviving patients close to discharge will show significantly different characteristics as non-surviving patients close to death. We therefore applied a case–control matching scheme: Assuming that the time courses of the diseases are comparable between the surviving and non-surviving patients, we clipped the data of each surviving patient to the length of stay of a randomly selected non-surviving patient21,22. Specifically, we determined the length of ICU stay for all non-surviving patients (time between admission and death). Then, for each surviving patient, we randomly picked a length of ICU stay from the non-surviving patients that was less or equal than the length of ICU stay of that surviving patient. This length of stay of the non-surviving patient then determined the endpoint for the surviving patient. We thus matched the durations of ICU treatment of the surviving patients to those of the non-surviving patients, without changing the prevalence of mortality in the dataset, as all surviving and non-surviving patients are still maintained in the dataset.

To provide a higher time resolution of the predictive performance, we aggregated data in 1 h time bins with respect to the endpoint, calculating the variable aggregates (10th, 50th and 90th percentile) as described above.