At-admission prediction of mortality and pulmonary embolism in an international cohort of hospitalised patients with COVID-19 using statistical and machine learning methods

By September 2022, more than 600 million cases of SARS-CoV-2 infection have been reported globally, resulting in over 6.5 million deaths. COVID-19 mortality risk estimators are often, however, developed with small unrepresentative samples and with methodological limitations. It is highly important to develop predictive tools for pulmonary embolism (PE) in COVID-19 patients as one of the most severe preventable complications of COVID-19. Early recognition can help provide life-saving targeted anti-coagulation therapy right at admission. Using a dataset of more than 800,000 COVID-19 patients from an international cohort, we propose a cost-sensitive gradient-boosted machine learning model that predicts occurrence of PE and death at admission. Logistic regression, Cox proportional hazards models, and Shapley values were used to identify key predictors for PE and death. Our prediction model had a test AUROC of 75.9% and 74.2%, and sensitivities of 67.5% and 72.7% for PE and all-cause mortality respectively on a highly diverse and held-out test set. The PE prediction model was also evaluated on patients in UK and Spain separately with test results of 74.5% AUROC, 63.5% sensitivity and 78.9% AUROC, 95.7% sensitivity. Age, sex, region of admission, comorbidities (chronic cardiac and pulmonary disease, dementia, diabetes, hypertension, cancer, obesity, smoking), and symptoms (any, confusion, chest pain, fatigue, headache, fever, muscle or joint pain, shortness of breath) were the most important clinical predictors at admission. Age, overall presence of symptoms, shortness of breath, and hypertension were found to be key predictors for PE using our extreme gradient boosted model. This analysis based on the, until now, largest global dataset for this set of problems can inform hospital prioritisation policy and guide long term clinical research and decision-making for COVID-19 patients globally. Our machine learning model developed from an international cohort can serve to better regulate hospital risk prioritisation of at-risk patients.


II. BAYESIAN OPTIMISATION METHOD
Assume a Gaussian Process (GP) defined by the property that any finite set of N points {x n ∈ X } N n=1 induces a multivariate Gaussian distribution on R N : Assume that the observations are of the form {x n , y n } N n=1 , where y n ∼ N (f (x n ) , ν) and ν is the variance of noise.This prior and the observations induce a posterior over functions; the acquisition function, which is denoted by a : X → R + , determines what point in X should be evaluated next via optimization x next = argmax x a(x).In other words, an acquisition function is a function of the posterior distribution that describes the utility for all values of hyperparameters.The acquisition functions depend on the previous observations, as well as the GP hyperparameters; the dependence noted as a (x; {x n , y n } , θ).Under the prior, the acquisition functions depend on the model solely through its predictive mean function µ (x; {x n , y n } , θ) and predictive variance function σ 2 (x; {x n , y n } , θ) with the best current value as x best = argmin xn f (x n ) and the cumulative distribution function of the standard normal as Φ(•).The strategy is to maximize the expected improvement (EI) over the current best and use the highest utility hyperparameter values to compute the next loss.
When maximising the EI one samples from points for which one expects either a higher utility, or points previously unexplored.This approach helps to save both time and computational resources in finding the optimal combination of hyperparameters without trying out all possible combinations.The algorithm can be shortly described as: 1) Given observed values f (x), update the posterior using the GP model 2) Find x new that maximises the EI: x new = arg max EI(x) 3) Compute the loss for the point x new

III. METRICS
The metrics used to evaluate the models include: 1) Area under receiver-operating-characteristic curve (AU-ROC): an ROC curve is a plot of true positives (TP) as a function of false positives (FP) where each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.The area under the ROC curve is a summary measure of sensitivity and specificity [?]. 2) Accuracy, ratio between correctly classified examples and the total number of cases in the dataset.In our case, can be misleading because of class imbalance where simply assigning all examples to the majority class is a way of achieving high accuracy 3) Weighted F1 score, harmonic mean of precision and recall which penalises extreme values of each weighted by class proportions due to imbalance Sensitivity, the probability of a positive prediction for patients with disease (i.e. the conditional probability of correctly identifying diseased patients) T P T P + F N PRE refers to precision (or positive predicted value) is the ratio of correctly identified positive examples and the total number of predicted positives: TP is true positive (correctly classified positive), TN is true negative (correctly classified negative), FP is false positive (falsely classified positive), and FN is false negative (falsely classified negative) cases.

IV. INTERPRETABILITY METHODS
Every classification made by a decision tree can be associated with a corresponding decision path and the F-score is just the number of times a feature is used to split the data across all trees.We use the shap library and built on the game-theoretic concept of treating features in the final model as players in a voting game.The method is applied on the entire test set and is based on ideas from game theory [28], [29].In short, the following equation is used to calculate the Shapley value φ for feature i: (1) Where features have their value calculated by taking the difference between the results of the characteristic function v on N (the set of all features) and S (the subset of N without feature i).The Shapley value of a particular feature i is then calculated by taking the average of the marginal contributions of all possible combinations.

VI. CLASS IMBALANCE
PE predictions for XGBoost in Figure 1.On the left we have the XGBoost prediction incapable of learning a clear probability boundary between the heavily imbalanced classes using default parameters and setups.Another example of the need for thresholding can be seen in the prediction probabilities on the training set of the logistic regression model in Figure 2 below.Clearly, the 0.5 default probability threshold will not prove sufficient to capturing the discrimination between the two classes and a lower one would be more suitable.The most optimal threshold, however, would still require increasing the presence of false positives as there is an overlap in the probability densities of the two classes.In our case, luckily, our main care is the level of sensitivity coupled with the AUROC which would capture the majority of true positive cases in rare disease occurrence.

Models
Brief Description

Logistic Regression Maps a linear relationship taking into account correlations between covariates Linear Discriminant Analysis Maps a linear relationship assuming the covariates are independent and normally distributed Naive Bayes
A probabilistic estimator assuming conditional independence between covariates ignoring correlations Random Forest An ensemble of decision trees whose predictions are aggregated for the final prediction XGBoost Using extreme gradient-boosting to improve ensembles of random forests for prediction Ensemble Using AdaBoosted decision trees, similar to XGBoost but with different boosting mechanism, in an ensemble Ensemble with XGBoost Using our XGBoost as the base estimator in the ensemble hierarchy instead of AdaBoost

VIII. AGE SKEW FOR UK AND SPAIN PATIENTS
It is important to note that the patient populations from Spain and UK are different, especially in their age distribution.When we look at Figures 5 and 6, we see that the patients in Spain are far more likely to be in the 40-80 years band while those in the UK in the <40 and >80 years categories.As age can be an impactful predictor for both PE occurrence and death, it is to be expected that the model results for these two patient populations can differ.

Fig. 1 .
Fig. 1.Confusion Matrices of XGBoost Before And After Imbalance Adjustment

Fig. 2 .
Fig. 2. Probability Prediction Density of Logistic Regression for PE Reveals Trade-off of Sensitivity and Specificity

Fig. 3 .
Fig. 3. Correlation of Features with PE (Only Spain and UK Data)

TABLE I MACHINE
LEARNING METHODS DEPLOYED DURING STUDY

TABLE II CORRELATIONS
OF FEATURES WITH PE (LEFT) AND WITH DEATH (RIGHT)

TABLE III CORRELATIONS
OF UK AND SPAIN PATIENT FEATURES WITH PE (LEFT) AND ALL PATIENTS WITH DEATH (RIGHT) (CONTINUED)