The escalating influx of patients into emergency departments (EDs) has given rise to a critical issue known as emergency overcrowding, resulting in a significant disparity between available resources and the genuine needs of patients1. This situation is widely reported and results in a mismatch between scarce resources and the real needs of patients2. Effectively addressing this intricate phenomenon necessitates strategic interventions3,4. An essential aspect of effective management involves the development of efficient assessment methods to gauge the severity of critically ill patients, predicting outcomes such as deterioration and mortality at the earliest possible stage5,6. Employing such risk stratification tools facilitates early detection, intervention, and intensive monitoring of individuals at a heightened risk of morbidity or mortality7,8.

Several studies have investigated the application of scoring systems to predict in-hospital mortality, identified by a discharge status of “died” or “died in a medical facility”6,9,10,11,12,13. Within the Iranian context, specific studies have utilized scoring systems for predicting in-hospital mortality in the ED, incorporating predictors such as demographic information, vital signs, mechanical ventilation status, oxygen saturation, abnormal electrocardiography findings, and the history of underlying diseases. Notable among these systems are the Acute Physiology and Chronic Health Evaluation (APACHE)14, Simplified Acute Physiology Score (SAPS)14, and Sequential Organ Failure Assessment (SOFA)15. Additionally, an Iranian study compared in-hospital mortality prediction between emergency residents' judgment and prognostic models in the ED, highlighting the superior calibration of mortality risk prediction by SOFA16. These investigations collectively underscore the utility of scoring systems in assisting clinicians with timely intervention decisions, crucial for mitigating in-hospital mortality. However, it's noteworthy that existing scoring systems and certain severity indices primarily rely on conventional methods such as logistic regression (LR)17,18,19,20,21. These static scores may not fully capture patient progression, necessitating a deeper understanding of how to tailor interventions based on individual patient conditions.

In recent years, significant progress in predictive modeling, particularly through the application of machine learning (ML) methodologies, has significantly enhanced forecasting capabilities across diverse scenarios22,23,24,25,26. These cutting-edge approaches have successfully illuminated high-order nonlinear interactions among variables, thereby contributing to more robust predictions27,28. Moreover, recent developments in ML models have yielded promising outcomes in predicting clinical scenarios, including mortality within EDs29,30,31,32,33,34,35,36. Noteworthy is a study that addressed ML-based early mortality prediction in the ED by quantifying the criticality of ED patients, emphasizing the substantial potential of ML as a clinical decision-support tool to aid physicians in their routine clinical practice31. Additionally, another investigation conducted a retrospective comparison between the Modified Early Warning Score (MEWS) and an ML approach in adult non-traumatic ED patients29. The study concluded that ensemble stacking ML methods exhibit an enhanced ability to predict in-hospital mortality compared to MEWS, particularly in anticipating delayed mortality.

Ensemble learning (EL), an established ML technique, stands out as a robust approach by amalgamating predictions from multiple models to enhance overall performance and predictive accuracy37,38. In the context of predicting in-hospital mortality in emergency medicine, EL models may be a dependable alternative to classical LR-based scoring systems for several reasons: (1) In the domain of emergency medicine, patient outcomes are intricately linked to complex relationships that classical models may struggle to discern; (2) Emergency medicine datasets often exhibit missing information or anomalous values in patient records. Ensemble models exhibit robustness in providing predictions despite encountering such challenges; (3) By combining models that make errors on distinct subsets of the data, ensemble methods contribute to improved prediction accuracy. This diversity proves particularly beneficial in capturing the heterogeneity observed in emergency medicine cases; (4) Ensemble methods demonstrate superior generalization capabilities to new, unseen data. This attribute is crucial in emergency medicine, where patient populations and conditions exhibit variations, demanding a model with robust generalization capabilities; (5) The flexibility in hyperparameter tuning offered by ensemble methods is indispensable when confronted with diverse patient populations and the dynamic nature of evolving medical practices in emergency medicine.

Hence, the present study formulated the hypothesis that EL models might exhibit superior predictive capabilities for in-hospital mortality in EDs compared to traditional LR-based models. While the potential advantages and capabilities of EL techniques in constructing predictive models are acknowledged, the assessment of these models, particularly in comparison to classical LR models, remains limited, especially within the context of Iran. Consequently, the primary objective of this study is to compare the predictive performance of EL models with LR models for in-hospital mortality in EDs within a single-center setting in Iran.

Material and methods

The current study proposed a framework for comparing the performance of LR and EL models in predicting in-hospital mortality using similar predictors. EL methods included Bagging39, Adaboost40, Random Forests (RF)41, Stacking42, and Extreme Gradient Boosting (XGB)41. The key challenges associated with in-hospital mortality include mixed data types, a large number of features, unbalanced data, and low performance of developed models in some settings such as EDs, all of which encourage the use of ML models.

To address these challenges, our framework comprises three main phases: pre-processing (Descriptive analysis, Data normalization, and Resampling), model development, and evaluation of the real data set. An overview of the proposed framework is illustrated in Fig. 1.

Figure 1
figure 1

Overview of the proposed ensemble ML models for predicting in-hospital mortality in the emergency department (ED); For the prediction of in-hospital mortality in EDs, logistic regression and five ensemble models were developed and these models were trained and evaluated on the dataset consisting of 2205 patients with 24 predictors, where the number of alive and deceased were 81% and 19%, respectively. This dataset was randomly partitioned into two subsets: the training set includes 67% of data (n = 1477), and the rest of it (n = 728) was assigned to the test set; RF, random forests; XGB, extreme gradient boosting.

Study design and dataset description

This cross-sectional study was conducted in the largest referral ED in the northeast of Iran from March 2016 to March 2017, with over 200,000 patients visiting each year. The study followed the TRIPOD statement for reporting prognostic models, which stands for Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis. The ethics committee of the Mashhad University of Medical Sciences approved the study (Number: IR.MUMS.MEDICAL.REC.1402.129), and it conformed to the Declaration of Helsinki principles. Informed consent was obtained from all participants or their legal guardian(s) before the study, for experiments involving human participants.

Inclusion and exclusion criteria

All adult patients, aged ≥ 18 years, with Emergency Severity Index (ESI) triage levels 1 to 3 who presented to the ED throughout the research period were included. Patients triaged directly to the particular department and the intensive care unit (ICU) were excluded from the study. Detailed information about the inclusion and exclusion criteria was presented previously in another report14.

In-hospital mortality as the outcome variable

In this study, in-hospital mortality was defined as an encounter with a discharge status of “died” or “died in a medical facility.” Two classes were defined as the primary outcome: “Alive” and “Deceased,” with their outcomes encoded as binary target value, 0 and 1, respectively.

Covariates

The final diagnosis was reported by universal code using the International Classification of Diseases–10th (ICD-10) edition codes. The variables considered in this study are routinely used in traditional scoring systems such as the APACHE and SOFA families for predicting in-hospital mortality or morbidity, which have been previously validated internally in our setting14,15. These variables can be categorized into six primary domains: demographic data, vital signs, hematology, biochemistry, Gasometry, and clinical parameters.

The demographic data, such as age and gender, were considered. The vital signs category incorporates parameters such as body temperature (Temp), Mean Arterial Pressure (MAP), including Diastolic Blood Pressure and Systolic Blood Pressure, Respiratory Rate (RR), and the Glasgow Coma Scale (GCS) and pulse. Hematological indicators consist of Hematocrit (HCT), White Blood Cell (WBC) count, and platelet (PLT) count. The biochemistry domain encompasses plasma concentrations of Creatinine (Cr), Potassium (K), Albumin (Alb), Bilirubin (Bil), Sodium (Na), Blood Sugar (BS), pH, and Urea.

Gasometry parameters include Partial pressure of arterial oxygen (PaO2), Bicarbonate (HCO3), Partial pressure of carbon dioxide (PCO2), and Fraction of inspired oxygen (FiO2). Lastly, clinical parameters involve the utilization of a Mechanical Ventilator (MV) plus ED status (triage level measured by emergency severity index (ESI), ED arrival method (walk-in vs. ambulance), and exploration of past medical history.

These variables were categorized and participated in model developments as follows:

Continuous predictors: Age, Pulse rate, PaO2, FiO2, GCS, Urine output, RR, Na, BS, pH, Urea, and PLT were considered integer values. However, this difference does not significantly impact the outcome prediction. Both categories receive similar preprocessing steps and thus do not substantially affect predictions. MAP, Temp, HCO3, PCO2, HCT, WBC, Cr, K, Alb, and Bil were used as real values.

Categorical (binary) predictors: MV and Chronic diseases.

Covariates and outcome variables preprocessing

In the first phase, to prepare input data for model development, various preprocessing techniques were applied, including descriptive analysis, data normalization, and resampling. The following subsections provide details of these techniques.

Step 1: descriptive analyses

As the first step, a descriptive analysis was conducted for both covariates and outcomes. In this analysis, the possible correlations between covariates and outcomes, and their linear relationships, were evaluated using Spearman’s correlation coefficient43. Spearman Correlation is a non-parametric test that shares the same assumptions as the Pearson correlation but does not rely on the normality of data distribution.

The Spearman correlation was applied to the continuous covariates, and the significance of their correlations with outcomes was studied based on Confidence Intervals (CIs), R2, Bayes Factors (BF10), and power44. Moreover, to avoid feature redundancy, the possible pairwise correlation between predictors was examined. Categorical variables were summarized as frequencies and percentages, while continuous variables were expressed as mean ± standard deviation (SD) in both the text and tables.

Step 2: scaling and normalization

To mitigate the impact of the varied range of continuous covariates and labels of categorical covariates, data scaling methods were employed. First, for continuous variables, the range of values was transformed using MIN–MAX scaling into the range of [0,1].

Step 3: resampling of unbalanced data

A common challenge in mortality datasets is the unbalanced class distribution, which can lead to over-fitting and under-performance of ML models29. In the current dataset, the majority class (alive) and the minority class (deceased) represented 81% and 19% of the patients, respectively. To address this issue, a combination of over-sampling and under-sampling techniques, called SMOTETomek, was applied to the training dataset45,46. SMOTETomek is a hybrid method that uses under-sampling (Tomek) with an over-sampling (SMOTE) technique. It applies SMOTE for data augmentation on the minority class and Tomek Links (a nearest neighbors’ method) for omitting some of the samples in the majority class. This method can enhance ML models’ performance by making less noisy or ambiguous decision boundaries.

Model development

In the second phase of our framework, the process of model development was performed, which consisted of (1) determining the best parameters of models using tuning techniques, (2) dividing data into the training and testing datasets using cross-validation, (3) selecting performance measures for the evaluation of models, and finally, (4) developing models and (5) determining the importance of features in the model. The five steps are detailed below.

Step 1: tuning of models’ parameters

One of the main challenges in developing ML models was determining the best parameters. To address this issue, a hyper-parameter tuning technique called GridSearchCV47 was carried out. In hyper-parameter tuning, an exhaustive search was performed over the parameters’ space, and as a result, models were optimized based on the best parameters using performance metrics.

Step 2: K-fold cross-validation for training and testing

For the development and evaluation of models, the dataset underwent training and testing phases. The optimal parameters of models were determined using K-fold cross-validation (K-fold)48 where the training dataset was divided into K folds, models were trained and validated, and the models with the highest average performance were considered as the optimal ones.

Step 3: models’ performance evaluation

To evaluate the ML models, their discrimination power was assessed using performance measures, including Precision, Sensitivity, Accuracy, F-measure (F1), Matthew’s Correlation Coefficient (MCC), Area Under Curve of Receiver Operator Characteristic (AUC-ROC), Area Under Curve of Precision–Recall (AUC-PRC), Calibration Plot, Brier Score (BS), Mean Squared Error (MSE), and the DeLong test49,50,51,52,53,54.

The accuracy metric checks the proportion of correctly classified samples, while F1 is the harmonic mean of precision and sensitivity. The calibration plot illustrates the consistency between predictions and observed outcomes. Comparing the calibration of all models through a scatter plot indicates the amount of agreement between the observed outcomes and predicted risk of mortality.

Moreover, by comparing the models’ performance and their accuracy, the Brier Score is computed, and the DeLong test is performed for pairwise comparison between the AUC-ROC. As Eq. (1) shows, BS is calculated as the mean squared difference between predicted probabilities (P) and actual outcomes (O) for binary classification, providing a comprehensive measure of model accuracy and calibration.

$${\text{BS}}=\frac{1 }{{\text{N}}}{\sum }_{{\text{i}}=1}^{{\text{N}}}{\left({{\text{P}}}_{{\text{i}}}-{{\text{O}}}_{{\text{i}}}\right)}^{2}$$
(1)

Where, N is the number of observations, Pi is the predicted probability for observation i, and Oi is the actual outcome for observation i.

The DeLong test is based on the covariance between the models. The test statistic follows a standard normal distribution under the null hypothesis of no difference in AUC between the two models. The significance of the difference is then assessed using the standard normal distribution. Equation (2) shows how the DeLong test statistic is calculated.

$${\text{Z}}=\frac{{{\text{AUC}}}_{1}-{{\text{AUC}}}_{2}}{\sqrt{{\text{Var}}\left({{\text{AUC}}}_{1}\right)+{\text{Var}}\left({{\text{AUC}}}_{2}\right)-2\times {\text{Cov}}({{\text{AUC}}}_{1},{{\text{AUC}}}_{2})}}$$
(2)

where AUC1 and AUC2 are the areas under the ROC curves for models 1 and 2, Var(AUC1)) and Var(AUC2) are their respective variances, and Cov(AUC1, AUC2) is the covariance between the areas.

This step ensures a robust evaluation of predictive performance and identifies any significant variations. These assessments are vital for enhancing the transparency and reliability of our models, contributing to their validity in predicting in-hospital mortality.

Step 4: ML modeling

Our framework included LR55 and five ensemble ML methods. EL models are meta-models that develop models by exploiting multiple weak classifiers and integrating obtained results to achieve stronger classifiers or regressors via voting or boosting mechanisms. In this study, EL models, Bagging56, AdaBoost57, RF58, Stacking42, and XGB59 were applied.

  • The Bootstrap AGGregating (Bagging) method is demonstrated using decision tree classifiers. This approach employs bootstrap sampling with replacement to create subsets of the training data. These subsets are then used to independently build weak and homogeneous models. The weak models are trained in parallel, and a more accurate model is produced through the voting method, which generates multiple random subsets from the training dataset and utilizes them to train various Ensemble Learning (EL) models concurrently. Each classification model makes predictions, and their results are averaged to achieve a more robust outcome39.

  • AdaBoost is a tree-based boosting technique that assigns lower weights to misclassified samples, and these weights are adjusted sequentially during the retraining process. The final classification is achieved by combining all weak models, with the more accurate ones carrying more weight and exerting a greater influence on the final results60.

  • RF is a robust bagging method that involves creating multiple decision tree models. It addresses two aspects of sampling: reducing the amount of training data and the number of variables. Multiple decision trees are trained on randomly selected training subsets to mitigate overfitting. The final aggregate is derived through a majority voting procedure on the models’ results. Consequently, there is reduced correlation between the models, leading to a more reliable final model61.

  • Stacked generalization (Stacking) is an ensemble ML model typically comprising heterogeneous models. It generates the final prediction by combining multiple strong models and aggregating their results. In the first level, stacking models consist of several base models (RF, ADA, and GradientBoostingClassifier), while in the second level, a meta-model (LR) is created, taking into account the outputs of the base models as input42.

  • XGB is a tree-based boosting method that utilizes random sample subsets to create new models, with each successive model aiming to reduce the errors of the previous ones. To mitigate overfitting and reduce time complexity, it employs regularization to penalize complex models, tree pruning, and parallel learning59.

More information about the setting of each model is provided in Table 1.

Table 1 Parameters of ensemble machine learning models for predicting in-hospital mortality in emergency department.

Step 5: feature importance

To indicate the most important covariates in deploying ML models, feature importance was assessed. In this study, SHapely Additive explanations (SHAP) were used to determine the importance of features in the training dataset. This method, based on cooperative game theory, increases the transparency and interpretability of ML models by measuring local and global impacts of features. According to the SHAP values, the most relevant features for the final models were indicated62.

In this research, Python 3.9.1 (Anaconda), Scikit-learn, Pandas, and NumPy were used for the development and evaluation of models. Visualization of data and output results were performed using the Matplotlib library. In the following subsections, the developed EL models are evaluated and discussed from four aspects: statistical information, effects of preprocessing (resampling) on data, feature importance in modeling, and comparing results of the models through different viewpoints59.

Results

Descriptive analysis results

For predicting in-hospital mortality in EDs, LR and five EL models were developed and evaluated on a dataset comprising 2205 patients with 24 predictors and a binary outcome. The distribution of alive and deceased patients was 1779 (81%) and 426 (19%), respectively. The dataset was randomly split into two subsets: the training set, encompassing 67% of the data (n = 1477), and the test set, with the remaining data (n = 728). In both the training and testing sets, patients were classified into “alive” and “deceased” categories. In the training set, there were 1203 (81%) alive and 274 (19%) deceased patients, while in the testing set, there were 576 (79%) alive and 152 (21%) deceased patients. Despite the almost equal ratio of alive and deceased patients in the initial training and testing sets, all sets were unbalanced in terms of the number of alive and deceased patients.

A total of 2205 patients were included, with a mean age of 61.83 ± 18.49 years, of whom 1169 (53%) were male. Patient ages ranged from 18 to 98 years, with survivors having an age range of 63–77 years and non-survivors in the range of 70–80 years (P < 0.001). Baseline characteristics of patients are summarized in Table 2.

Table 2 Baseline characteristics of population’s study.

Additionally, the pairwise correlation coefficient between predictors was computed using Spearman Correlation, illustrated in a heatmap plot (Fig. 2). In the heatmap, warm colors indicate high correlation coefficients, while cool ones show low correlation coefficients. This plot indicated that no very strong correlation occurred between continuous predictors with the defined threshold (± 0.8). However, notable correlations, such as high and positive correlations (HCO3, PCO2: 0.74) and (Urea, Cr: 0.77), as well as moderate and negative correlations (Urine output, Cr: − 0.43) and (Urine output, Urea: − 0.47), were observed.

Figure 2
figure 2

Pairwise correlation coefficient between predictors.

Moreover, the correlation between covariates and outcomes was assessed, and the results are presented in Table 3, providing correlation coefficients (r), p-values, BF10, and statistical power. It is important to note that, while statistically significant correlations were observed for several predictors with the outcome, the magnitude of these correlations is modest. Specifically, only two correlations reached values of 0.35 and 0.22, indicating a generally small effect size.

Table 3 Correlation between covariates and outcome.

Feature importance

To evaluate the importance of each predictor in deploying EL models, we considered the features mentioned in Section “Covariates”, whose correlation with the outcome was analyzed in Table 3. These features in the training dataset were ranked using SHAP63, a method widely used for interpreting complex ML models.

Figure 3 depicts the estimated SHAP values across all samples for the XGB model, demonstrating high performance among EL models. Features are sorted based on SHAP values, with red and blue colors indicating high and low impacts. Additionally, the mean SHAP value for each feature is presented, where higher values indicate higher importance.

Figure 3
figure 3

Evaluation of features' importance by SHAP summary plot.

According to Fig. 3, predictors such as Urine output, BS, chronic disease, Temp, and Na were considered the least important, while Urea and MV were identified as the most influential factors.

Resampling effect on data

In the current dataset, the majority class (alive) represented 81% (n = 1779), while the minority class (deceased) was 19% (n = 426). Applying the SMOTE Tomek resampling technique led to a better-balanced training set by increasing the overall number of samples from 1477 to 2402. This resulted in the percentage of the deceased class increasing from 19% (247/1477) to 50% (1201/2402), while the percentage of the alive class reduced from 81% (1203/1477) to 50% (1201/2402) in the training dataset. The study tested the SMOTE Tomek sampling method on a basic LR model, showing improved performance in precision, sensitivity, and F1-measure for the minority class after resampling. Additionally, the resampling method increased the overall AUC-ROC of the LR model from 0.52 to 0.82. As a result, SMOTE Tomek was selected and applied to address the imbalanced data issue in our training data. Table 4 shows the performance comparison of ML Model (LR) before and after resampling.

Table 4 Performance comparison of ML model (LR) before and after resampling.

Quality assessment of models

To identify high-performance models, comparisons were made between Logistic Regression (LR) and Ensemble Learning (EL) models (Bagging, AdaBoost, Random Forests, Stacking, and XGB). These models were developed on a training dataset, and their parameters were tuned using GridSearchCV in tenfold cross-validation. The following sections comprehensively evaluate the developed models from three perspectives: (1) predictive performance, (2) discrimination ability, and (3) goodness-of-fit.

Evaluation of the predictive performance of models

The performance of the models was analyzed based on various measurement metrics. Table 5 demonstrates that among the eight investigated models, ensemble models consistently exhibited the best values across all metrics. For instance, Bagging achieved the highest AUC-ROC (0.84) and AUC-PR (0.64) for predicting in-hospital mortality, while XGB demonstrated superior precision (0.83), sensitivity (0.831), accuracy (0.842), and F1 score (0.833). Additionally, XGB outperformed other models with the highest MCC of 0.48, indicating robust performance in unbalanced data, and RF achieved the lowest BS of 0.128, assessing the calibration of models. Furthermore, a comparison of confusion matrices revealed that XGB, Stacking, and RF had the highest True Negatives (TN) in the range of [0.70, 0.73], while Bagging and LR exhibited the highest True Positives (TP) at 0.15.

Table 5 Predictive performance of models on the testing dataset.

Evaluation of discrimination ability of models

The pairwise comparison of AUC-ROCs is presented in Table 6, graphically representing sensitivity on the Y-axis and 1-specificity on the X-axis. Additionally, the AUC-PRC is utilized to evaluate how well a model balances precision and recall. In ascending order, Bagging emerged as the most discriminative model with the highest AUROC (0.839, CI 0.802–0.875) and AUCPR = 0.64, followed by RF (0.833, CI 0.797–0.87) and AUCPR = 0.623, XGB (0.826, CI 0.789–0.863) and AUCPR = 0.616, AdaBoost (0.818, CI 0.78–0.857) and AUCPR = 0.61, and Stacking (0.817, CI 0.778–0.856). Figure 4 illustrates that EL models achieved the maximum AUC-PRC, with Bagging leading at 0.64, RF at 0.623, XGB at 0.62, and LR at 0.61.

Table 6 Pairwise comparison of AUCs by using the DeLong method.
Figure 4
figure 4

Left The receiver operating characteristic curves (AUC-ROC) graphically represent sensitivity versus 1 specificity. Right The area under the Precision–Recall curve (AUC-PRC) represents how a model balances the precision and recall.

Evaluation of goodness-of-fitting in models

The calibration plot illustrates the consistency between predictions and observations across different percentiles of predicted values, and comparing the calibration of all models through a scatter plot reveals the agreement between predictions and observations. According to Fig. 5, Stacking and RF exhibited greater success in calibration. Moreover, the best BS, a metric comprising calibration and refinement terms, was achieved by RF with a BS of 0.128, followed by Stacking with the lowest BS of 0.132. Conversely, AdaBoost had the highest Brier score at 0.250, indicating a less favorable calibration performance.

Figure 5
figure 5

Comparison of models based on calibration plots. A calibration plot is a measure of goodness-of-fit as a graphical presentation of the actual mortality probability versus the predicted mortality probability.

Discussion

The utilization of advanced EL algorithms enables the evaluation of a more extensive range of clinical variables compared to the traditional LR approach. This approach not only allows for the exploration of clinical variables with predictive value but also facilitates the assessment of key features contributing to clinical deterioration. Additionally, EL models offer the potential for automation, eliminating the need for manual review22. In preliminary studies, including ours, EL models have proven valuable for clinical decision support, particularly in the stratification of critically ill patients in the ED based on risk factors64. Notably, the RF model stands out by providing end-users with the capability to interpret the relative importance of predictive features, enhancing its clinical utility3.

Main findings

The present study applied various ML algorithms to develop models for prognosis patient outcomes based on collected inpatient care data. Our study reports several important findings.

First, when models were trained with both laboratory and clinical data, the highest diagnostic accuracy was achieved. Notably, correlations between (HCO3, PCO2: 0.74) and (Urea, Cr: 0.77) were observed, showing the strongest correlation, albeit falling just below the defined threshold of 0.8.

Second, utilizing a select set of variables, we found that ensemble methods demonstrated higher performance than classical models such as LR. The LR model's performance remained comparable to high-ranking modern models like RF, Bagging, Adaboost, XGB, and Stacking in predicting in-hospital mortality among ED-admitted patients. No significant differences in discrimination power were observed between the LR and EL models. Regarding overall performance, RF ranked first due to its lowest BS value (0.128). Despite Bagging having the highest discriminatory power among other models, XGB excelled in various metrics, including the highest precision (83%), sensitivity (83.1%), accuracy (84.2%), F1 score (83.3%), MCC (48%), and the lowest MSE (40%).

Third, in pairwise comparisons of AUROC curves, no significant differences were found between XGB and either RF or Bagging, suggesting that XGB performed as well as both.

Lastly, concerning calibration, while all studied models tended to overestimate mortality risk and exhibited insufficient calibration, Stacking demonstrated relatively good agreement between predicted and actual mortality compared to others.

Comparison to other similar studies

The use of ML models has recently demonstrated effectiveness in predicting outcomes in EDs. For example, ML has been applied to triage in the ED, prediction of cardiac arrest, admission prediction, detection of sepsis and septic shock, identification of patients with suspected infections, and prediction of mortality for sepsis and suspected infections65. There is ample evidence consistently suggesting that ML approaches outperform more conventional statistical modeling methods in various contexts, such as ED patients with sepsis22, coronary artery disease66, and critically ill patients for predicting in-hospital mortality67.

In a comprehensive investigation22, an RF model was meticulously crafted utilizing an extensive dataset encompassing over 500 clinical variables extracted from electronic health records across four hospitals. Intriguingly, contrary to our findings, this study accentuated the superior performance of this locally derived big data-driven ML approach when compared to both existing clinical decision rules and classical models in predicting in-hospital mortality among ED patients with sepsis. This divergence may be attributed to the substantial scope of the dataset employed. Our study, in contrast, employed 24 variables to construct the ML model. Nevertheless, it is noteworthy that, given the exigent nature of emergency settings with limited time for decision-making, models incorporating fewer predictors may demonstrate enhanced performance and practical utility.

Additionally, another study29 utilized an extensive multicenter dataset to develop an EL model for predicting in-hospital mortality among adult non-traumatic ED patients at distinct temporal stages—stratified into intervals of 6, 24, 72, and 168 h. The performance of this model was then compared with that of an LR-based MEWS, calculated using systolic blood pressure, pulse rate, RR, Temp, and level of consciousness. In contrast to our study, this research revealed that EL methods exhibited heightened predictive accuracy for in-hospital mortality, demonstrating notable proficiency in forecasting delayed mortality. It's important to note that our study specifically focused on predicting outcomes at the time of admission, emphasizing prioritization based on the severity of illness. It is recognized that the accuracy of prediction models tends to improve as the temporal proximity to the occurrence of the desired outcome decreases.

Consistent with our investigation, Son et al.68 conducted a study in South Korea wherein they examined 21 features spanning vital signs, hematology, Gasometry, and morbidities. Their approach involved the utilization of various ML algorithms and classical models to optimize ML classification models and data-synthesis algorithms for predicting patient mortality in the ED. Notably, their top-performing model employed the Gaussian Copula data synthesis technique in conjunction with the CatBoost classifier, yielding an AUC of 0.9731. Additionally, Adaptive Synthetic Sampling (ADASYN) and SMOTE data-synthesis techniques ensembled by LR resulted in AUCs of 0.9622 and 0.9604, respectively, aligning with our findings. Two additional studies merit attention in the context of our investigation. One study, focusing on sepsis patients admitted to the ED, underscored the importance of variables such as Temp, gasometry, GCS, and the mode of arrival to the ED69, all of which align with the parameters considered in our study. The second study concentrated on statistically significant variables, including demographics, vital signs, and chronic illnesses70. These parallel investigations emphasize the relevance of these variables in predicting patient outcomes and fortify the comprehensive nature of our study, which incorporates key factors identified in similar research contexts.

Several studies have employed external validation for benchmarking ML and LR methods in various domains, such as the detection of prostate cancer71, identification of brain tumors72, prediction of in-hospital mortality in patients suffering from ischemic heart disease73, and after brain injury74. In our study, we validated the model only on the test dataset. Our findings align with those published recently on predicting mortality after traumatic brain injury75. The main reason for this concordance might be that ML methods may struggle to effectively analyze non-linear and non-additive signals37. Clinical decision-making can be strengthened through interactions with provider intuition, reducing over- and under-triage risks. These models can also help improve resource allocation and operational flow for crisis management teams.

Considering that our models were derived from data encompassing a case-mixed patient population, their applicability is envisaged in analogous settings without a predefined temporal constraint. Nevertheless, we propose the exploration of developing ML models tailored to specific patient groups, such as those afflicted with Sepsis65 and Covid-195,76,77, in future research endeavors.

Strengths and limitations

In this study, we outline both strengths and limitations. Strengths include (i) the analysis of features contributing to model predictions, (ii) the prospective design of the study, which spanned over a year and included a relatively large number of patients, (iii) a systematic comparison of models from different aspects, such as performance, discrimination, and calibration, and (iv) the comparison of classic LR and novel EL approaches.

However, we are aware of several limitations. Firstly, the results stem from a cross-sectional study conducted in a single center. External validation in additional centers is planned for the future based on the findings of this single-center study. Additionally, we limited ourselves to three levels of ESI acuity, making it unclear to what extent these models can be generalized to a broader ED population. Increasing the predictive applicability of models necessitates extended follow-up. Furthermore, clinicians may be hesitant to adopt ML techniques due to their perceived “black box” nature.

Moreover, the features considered in our analysis, such as vital signs, demographic data, and other relevant parameters, primarily exhibit a cross-sectional nature. Consequently, our approach focuses on the initial measurements taken at admission, forming the basis for model generation. We refrain from incorporating temporal features measured at multiple time points to maintain model simplicity and avoid unnecessary complexity. This decision to concentrate on the first measured parameters at admission is deliberate, aiming to strike a balance between model intricacy and practical applicability.

When employing various ML methods, a crucial point for discussion arises: how to reconcile the differences in the sets of features identified by each algorithm. The 24 features under consideration in our study have been internally validated within our setting14,15 and are widely recognized as proxies for the performance of vital organs. Consequently, we incorporated all 24 features into the six ML algorithms utilized in our analysis. Given that these features were uniformly included in the ML algorithms, we compared the models’ outputs—namely, the predicted probability of mortality—based on various performance metrics. These metrics indicate that the XGB model outperformed other models across multiple indices.

Conclusion

In the prediction of in-hospital mortality for patients admitted to the ED, LR demonstrated comparable accuracy to high-ranking EL models. Notably, Bagging exhibited a substantial discrimination power with an AUC-ROC of 0.84, while the optimal overall performance was observed with XGB (Sensitivity = 0.83, Accuracy = 0.83, F1 Score = 0.83, and MCC = 0.48). Furthermore, when compared to LR, XGB demonstrated improvements of 5% in sensitivity, 4% in accuracy, 4% in F1 measures, and 5% in MCC.

The application of these models should prioritize the identification of critically ill patients, particularly in the dynamic and rapidly changing clinical environments of the ED and ICU. This is of utmost importance given the clinical instability of patients in these settings, where conditions evolve rapidly. Future studies are encouraged to explore the development of real-time predictive models, with the integration of these models into electronic health record databases facilitating ongoing evaluation of treatment outcomes. In contrast, conventional scoring systems often necessitate comprehensive and rigid data inputs to yield predetermined outcomes.