Effectiveness, Explainability and Reliability of Machine Meta-Learning Methods for Predicting Mortality in Patients with COVID-19: Results of the Brazilian COVID-19 Registry

The majority prognostic scores proposed for early assessment of coronavirus disease 19 (COVID-19) patients are bounded by methodological aws. Our group recently developed a new risk score - ABC 2 SPH using traditional statistical methods (least absolute shrinkage and selection operator logistic regression -LASSO). In this article, we provide a thorough comparative study between modern machine learning (ML) methods and state-of-the-art statistical methods, represented by ABC 2 SPH, in the task of predicting inhospital mortality in COVID-19 patients using data upon hospital admission. We overcome methodological and technological issues found in previous similar studies, while exploring a large sample (5,032 patients). Additionally, we take advantage of a large and diverse set of methods and investigate the effectiveness of applying meta-learning, more specically Stacking, in order to combine the methods' strengths and overcome their limitations. In our experiments, our Stacking solutions improved over previous state-of-the-art by more than 26% in predicting death, achieving 87.1% of AUROC and MacroF1 of 73.9%. We also investigated issues related to the interpretability and reliability of the predictions produced by the most effective ML methods. Finally, we discuss the adequacy of AUROC as an evaluation metric for highly imbalanced and skewed datasets commonly found in health-related problems.


Introduction
The number of patients with coronavirus disease 2019 (COVID- 19), as well as the related deaths, have increased exponentially during the COVID-19 pandemic. Although over 7.6 billion doses of COVID-19 vaccines have been administered, variants are still emerging, and COVID-19 seems to be an issue governments worldwide will need to keep grappling with (1,2).
Given the current scenario, there is an urgent need for an early disease strati cation tool upon hospital admission, to allow the early identi cation of risk of death in patients with COVID-19, assisting in the management of disease and optimizing resource allocation, hopefully assisting to save lives during the pandemic. Although several scores have been proposed for the early assessment of COVID-19 patients at hospital admission, the majority of them are bounded by methodological aws and technological limitations, meaning that reliable prognostic prediction models are scarce (3)(4)(5).
A state-of-the-art method for this prediction task has recently been proposed by our group with the development of a new risk score -ABC 2 -SPH -using traditional statistical methods (least absolute shrinkage and selection operator -LASSO regression), which exploits a rich set of information, including patient's demographics, comorbidities, vital signs and laboratory parameters at the time of presentation, for assessing prognosis in COVID- 19  Second, given the profusion and diversity of the compared methods, we investigate the effectiveness of meta-learning ensemble strategies, most notably Stacking (9), that combine the methods' outputs (probabilities), in order to exploit the ML methods' strengths and overcome their limitations.
Third, we study the reliability of the predictions of the most effective methods by correlating the probability of the outcome and the effectiveness (accuracy) of the methods. Few studies have investigated this important aspect of the predictions, which has practical impact in the applicability of the methods. Fourth, we investigated how interpretable (or explainable) are the predictions produced by the most effective methods, using modern interpretability tools.

Related Work
This study also included a narrative review on existing prediction models for COVID-19 mortality. A literature search of Medline and MedRxiv was conducted in August 2021, with no language or date restrictions. The search terms "COVID-19", "SARS-CoV-2" were used, combined with "mortality", "prognosis", "risk factors", "hospitalizations" or "score". Text screening retained 76 studies included in the S1 Table. The existing literature largely focuses on American and Chinese inhospital patients (53.94%). In fact, models validated in one country cannot be extrapolated to the population as a whole, since there is heterogeneity among countries in different characteristics such as populations features (including genetics, race, ethnicity, prevalence of comorbidities), socioeconomic factors, and the healthcare systems themselves (access, hospitals patient load, practice and available resources) (10).
Another remarkable point is the sample size. Larger population studies are needed to allow certain metrics of model performance to be estimated with more accurate and reliable results. In contrast, smaller samples reduce the ability to identify risk factors and increase the likelihood of over tting (11). Among the analyzed models, 17.10% were developed and validated in a sample of 500-1000 patients, and 35.52% had less than 500 patients. Only 47.36% of the studies used a sample with more than 1000 patients. Finally, few studies deeply analyzed the impact of the variables in the nal model or on the nal model outcome. Most studies did not investigate how reliable the made predictions are in terms of the correlation between the probability of the prediction and the accuracy. This analysis has implications on the practical use of this technology. An accurate but unreliable method has its practical applicability diminished. We explicitly tackle these issues in our study.

Materials And Methods
This is a substudy of the Brazilian COVID-19 Registry, a multi-hospital cohort study previously described (13). Adult patients with laboratory-con rmed COVID-19, admitted consecutively in any of the 36 participating hospitals, from March 1 to September 30, 2020 were enrolled. Individuals were not included if they were transferred between hospitals and data from the rst or last hospitals was not available, as well those who were admitted for other reasons and developed COVID-19 symptoms during their stay (4).
Trained hospital staff or interns collected clinical and laboratory data from the medical records (S2 Table) (4,11). Laboratory exams were performed at the discretion of the treating physician. A prespeci ed case report form was used, applying Research Electronic Data Capture (REDCap) tools (14). Error checking code was developed in R, to identify data entry errors, as previously described (4), and the results were sent to each center for checking and correction, before further data analysis.

Data analysis
The development, validation and reporting of the models followed guidance from the TRIPOD guideline and the Prediction model Risk of Bias Assessment Tool (PROBAST) (11,15).
At that time, 36 Brazilian hospitals participated in the cohort, located in 17 cities, from ve Brazilian states (4). A total of 5032 patients were admitted between March 1, 2020 and September 31, 2020, and the full group was used to perform a 10-fold cross validation procedure, which was repeated 3 times (at a total of 30 performance measurements for each of the classi ers presented in our study). The overall study population included 45.9% women, with a mean age of 60± 17 years, 27.17% needed mechanical ventilation and 20.15% died.
In order to properly assess the performance of different models, we chose to use three different metrics, each assessed through the aforementioned 10-fold cross validation procedure, for each classi er. Our evaluation metrics include both micro-average and macro-average F1-score (micro-F1 and macro-F1), and the AUROC. The F1 score is the harmonic mean between precision and recall scores, for each class (i.e. one score to estimate how well the model can predict which patients will die, and one to estimate the same regarding which patients will not die). The "average" part, described as either "micro" or "macro", refers to how these results are aggregated. In "macro" averaging, all classes are taken as equally important, while in "micro" averaging, class imbalance is not accounted for in the nal result and all individual predictions are considered equally important (16).
As for the speci c models compared in our study, we trained two modern neural network benchmarks --the FNet transformer, with and without virtual adversarial training, which is a regularization technique --and a deep convolutional Resnet. We also experimented with a support vector machine classi er, a boosting model (microsoft research's Light Gradient Boosting Machine), and the K-nearest neighbors algorithm, as well as a stacking of these methods.
We compare these ML alternatives to traditional statistical methods, including a Generalized Additive Model (GAM), which has rarely been used in this scenario before, and LASSO regression, the current stateof-the-art. GAM was used before in ABC 2 -SPH, but only to select variables for the lasso regression, which yielded an inferior result when compared to LASSO regression, whereas in our work, we directly tune GAM to the classi cation task, thus obtaining better results, as we shall see.
The choice of neural networks to include in our study was motivated by current state-of-the-art methods, even though, in general, neural networks tend to perform better in situations where massive amounts of data are available (17,18 (2017), in which the model's decision boundary is smoothed in the most anisotropic direction through a gradient-based approximation (20).
Additionally, we included a standard support vector machine classi er, which learns a separation hyperplane between classes, while maximising the separation margin, and a K-nearest neighbors classi er, which yields predictions based on spatial similarities between training samples and new query points.
Motivated by the results shown in Shwartz et al (2021) (21), we included a boosting algorithm (LightGBM), which is usually an effective model in tabular data, as concluded in Ke et al (2017) (22). As the nal classi er, we included a meta-learning ensemble-based Stacking model, which learns to combine the prediction outputs of all previous classi ers in order to improve classi cation effectiveness. We compare these methods to Generalised Additive Models (GAM) and LASSO regression, the latter being the current state-of-the-art model for this task, as demonstrated in our previous work.
We ran all classi cation tests using a 10 fold cross validation procedure, after which we calculated con dence intervals for each result, and con rmed statistical signi cance by applying a Wilcoxon signedrank test with 95% con dence.
For the parametrization of our models, we used the values presented in Table 1, where the values in brackets are evaluated in the validation set of the cross validation process. For deep network models we use the early stop to optimize the model, which optimizes the weights until the model has no signi cant improvement in the validation set.  Ethics approval and consent to participate The study protocol was approved by the Brazilian National Commission for Research Ethics (CAAE 30350820.5.1001.0008). Individual informed consent was waived due to the severity of the situation and the use of deidenti ed data, based on medical chart review only. All methods were carried out in accordance with relevant guidelines and regulations.

Results And Discussion
Classi cation results for the prediction of death are shown in Table 1 Thus, tree-based ensemble models such as random and boosting forests tend to be more robust to small sample sizes and to over tting, which is exactly the behavior we observed in our experiments (23). SVM and K-nearest neighbors (KNN), which are simpler models, with fewer parameters, also tend to perform reasonably well on smaller datasets being better than the neural network models.
We should stress that the statistical models LASSO regression and GAM showed very competitive results.
Unexpectedly, GAM was the runner up method considering all metrics, being even better than LASSO and some traditional ML methods such as SVM and KNN. In our work, we directly tuned GAM to the classi cation task, using the cross-validation procedure, which yielded superior performance. GAM and LightGBM are statistically tied regarding all evaluation metrics considering a Wilcoxon signed-rank test with 95% con dence.
In any case, the best single overall model, with statistical signi cance, under all considered metrics, was the Stacking model, which is a combination of the output of all other individual models, which, in turn, allows us to better discriminate between patients with higher clinical risk at admission time. When considering micro and macro-F1, F1 for death and AUROC at the task of predicting death, Stacking was signi cantly (statistically) better than all other models. The largest gains were in F1 to predict death with gains of up more than 26% over LASSO, the previous state-of-the-art. The Stacking technique improves the F1-score results for the class of interest (death) by 7% over RF, by 5% for LightGBM and by 6% for GAM, which were the three individual best models in this metric. The combination of models based on different classi cation premises, potentially made stacking more robust. If a single classi er makes a wrong prediction, the others can still make corrections, increasing the robustness of the nal stacking model.  Despite similarities in the curves and at AUROC values, these classi ers can yield quite different results when compared with micro-F1 and macro-F1, or class-speci c F1 scores, which shows that (1) AUROC score is not an adequate metric for evaluating and comparing models, especially in face of high imbalance/skewness and that (2) even though some models, like Stacking and GAM have very similar AUROC scores, their capacity to discriminate relevant outcomes like death is quite different (0.532 F1 score for GAM and 0.564 for Stacking, a signi cant difference of 6%).
Interestingly, using such curves, we can sensibly calibrate the trade-off between sensitivity and speci city, further customizing the way such models can be used. In particular, when applying Stacking, our model can be tailored to the early identi cation of high-risk patients, with good discrimination capacity.

Explainability
Explainability is an essential aspect of the task if ML methods are to be trusted and actually used by practitioners. Various prognostic factors have been proposed in the strati cation of COVID-19 patients, based on their risk of death, that includes clinical, laboratory and radiological variables. Among these risk factors, stand out advanced age, multiple comorbidities on admission (such as hypertension, diabetes mellitus, cardiovascular diseases and others), abnormal levels of C-reactive protein (CRP), lymphocytes, neutrophils, D-dimer, blood urea nitrogen (BUN) and lactate dehydrogenase (LDH).
A very interesting feature of some ML models, in particular decision trees, RF and boosting forests, is the explainability of these models. This is still a very active research area, but modern advances in tools and visualization alternatives allow us to represent which features are most important to the model and at which polarities and intervals. In this context and as previously cited, the best model in our tests was the Stacking, that is a meta-model, in which inputs are the outputs of other classi ers. Since we aim to explain a classi er that works on the level of the features themselves (instead of a meta-level of other classi er outputs), we will provide explanations for the second best model, LightGBM. Furthermore, tree-based boosting and bagging algorithms rank as some of the most explainable machine learning models, and also lead many benchmarks, particularly for tabular data where data samples are not that large. Their unique combination of explainability, reliability and performance, added to the fact that stacking is a metaclassi er are why we will exploit the boosting model (which, in our case, outperformed the bagging modelrandom forests/RF), to analyse the correlations among variables.
In a sense, some traditional models, such as regression models, also have a good explainability, as it is possible to assess the coe cients of each attribute, to measure how important a feature is. These models however do not measure up to modern tree-based algorithms in many scenarios, especially in cases with larger datasets (24). Another key difference between these models is that, in the case of regression models, we have to explicitly remove collinear variables, but these variables, even though they might not improve classi cation performance, still yield valid model explanations.
In decision tree based algorithms, each node represents a feature. The closer to the root (i.e. the ' rst' node of each tree), the more the feature is able to differentiate the data classes. For example, in Fig 2, feature 'SF ratio'' with the value less than 233 and the feature 'lactate' with a value less than 1.68 mmol/L results in a subset with 5.9% of the dataset where the 'death' outcome is more common.
These algorithms look for the values of the features that further separate the classes, while trying to decrease the coe cient or entropy values of the class label (which are measures of purity and information) in each partition in the decision tree --this coe cient is called the GINI Index. Such index and the entropy score tend to isolate records that represent the most frequent class in a branch.
In Fig 3, we present SHapley Additive exPlanation (SHAP) values for our boosting model. This is a special type of explainability technique, which allows us to not only probe which features were important to the model, but also which polarities or intervals push predictions to each of the training classes, and additionally, allows us to evaluate why the model predicted any single instance (25).
For any simple model, such as regression, the model itself is a reasonable explanation of what was learned. However, for more complex models, which in turn are capable of learning more complex solutions (provided enough data is present at training time), we cannot use the model to explain itself, since it is a complex solution. In these situations, shapley values build upon the idea that the explanation of a model can itself be a model. This technique has been recently introduced, and further expands on the explainability of machine learning models, making them even more useful, as they become more interpretable (25).
With values (blue tones) indicate higher risk. Although COVID-19 is a multisystem disease, it is well known that lung involvement is the mainstain for assessing disease severity, and oxygen requirement upon hospital admission has been shown to be an independent predictor for severe COVID-19 in several studies (31,32).
We also observed that lower values of platelets and higher levels of urea and C-reactive protein also increase risk of mortality, which is in line with what was previously observed using statistical models (33).
Other studies suggest that C-reactive protein was a marker of a cytokine storm developing in patients with COVID-19 and was associated with the disease mortality (34)(35)(36).
Interestingly, ML models can return explanations in the form of intervals, such as the behavior seen in Fig 3 for sodium and bicarbonate levels, which imply there is a "safe interval" for which risk is lower, but values either too high or too low yield higher risk of death. This is an intrinsic limitation of regression models, and the variable may be seen as non-signi cant due to the fact that it is a non-linear association.
From a clinical perspective, those results are in line with a recent study, which demonstrated that hypernatremia and hyponatremia during COVID-19 hospitalization are associated with a higher risk of death and respiratory failure, respectively (37). With regards to bicarbonate, low levels are related to acidosis, and high levels are usually related to advanced chronic obstructive pulmonary disease (COPD) with retention of carbon dioxide, both of them conditions well-known to be associated with worse prognosis in clinical practice (38)(39)(40). This sort of non-linearity cannot be captured by simple regression models, since we can only measure how large coe cient values are, and correlate that to the importance of each feature.
In exploiting LASSO regression in our previous work (4), we had to exclude some features which had shown to be important in the boosting model due to high collinearity. This may explain the difference in the features included in both models, despite the fact that all features included in both had previous evidence of association with COVID-19 prognosis.
Another interesting remark is shown in Fig 4, in which we can see the relative importance of each feature.
Here, again, age is the most important single feature (due to higher mean SHAP value), which is in line with previous studies (3,26,27). However, the remaining features when combined yield higher predictive value in this task than just age.

Reliability
Finally, we investigate issues related to the reliability of the models. Neural network models are, for instance, known for having irregular error rates, regardless of prediction con dence. At the other end of the spectrum, boosting and bagging models tend to have a very interesting reliability pro le, with a tendency to have lower error rates at high con dence scores, and higher error rates at lower con dence scores. This enables tuning the trade-off between accuracy and sensitivity for some speci c classi ers.
Accordingly, we show in shows prediction ranges for the model's con dence score, while the y-axis shows the percentage of hits or misses for the model. Note that the model makes more correct predictions (hits, in green) when it is more certain of the prediction (range 0.87-0.96). Thus, this classi er yields a useful reliability pro le, in relation to its con dence score. This kind of characteristic means we can tune how many patients the model will indicate, as well as how sensitive or speci c that indication can be. Such tuning can be tailored to any healthcare service, accounting for intensive care unit beds, available professionals and so on.
Based on S1 Table, there were few prediction studies that had extensive analysis utilizing AI techniques. In this study, AI techniques were compared to traditional statistical methods to develop a model to predict COVID-19 mortality, considering demographic, comorbidity, clinical presentation and laboratory analysis data. We observed that regarding the prediction of the class of interest (death), the best individual methods was a ML one (LightGBM) closely followed by a statistical model (GAM), both being better than neural network models, and both being surpassed by a meta-learning ensemble model --Stacking --which was the best overall solution, considering all criteria for the posed prediction problem.
We would like to emphasize that, despite the fact that in medical research the AUROC is widely used as the sole measure of models' discriminatory ability, our data reassures that it is an insu cient metric for evaluating and comparing models. F1 Score is a more robust metric, especially in larger, more complex and imbalanced datasets, which are common in health-related scenarios.

Conclusion
In this study, modern AI techniques performed better than traditional statistical methods to predict COVID-19 mortality. The meta-learning strategy based on Stacking, surpassed the state-of-the-art LASSO regression method by more than 26% for prediction death. We also demonstrated that AUROC score was an insu cient metric for evaluating and comparing models. Even though some models, like Stacking and GAM have very similar AUROC scores, their capacity to discriminate relevant outcomes like death is quite different (0.53 F1 score for GAM and 0.56 for Stacking, which yields a 5.6% difference). Finally, we investigated issues related to the explainability and prediction reliability of the best ML models, concluding that they are potentially very useful for practical purposes in real settings. Age was the main mortality risk predictor, but urea, C-reactive protein, lactate, respiratory rate, heart rate, NRL, neutrophils, sodium and pCO2 may also signi cantly in uence the disease outcome. Figure 1 Receiver Operating Characteristic (ROC) curve comparing multiple models, trained on the prediction of the death outcome.

Figure 2
A sample decision tree with depth 2, trained on our dataset. At each level but the last, the rst line of text in each box shows the variable and its cut before the split.  Mean SHAP values for each feature in the prediction of either death.

Figure 5
Error rates for each con dence threshold in the Stacking model.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. TableS1andS2.pdf