A two-tier feature selection method for predicting mortality risk in ICU patients with acute kidney injury

Acute kidney injury (AKI) is one of the most important lethal factors for patients admitted to intensive care units (ICUs), and timely high-risk prognostic assessment and intervention are essential to improving patient prognosis. In this study, a stacking model using the MIMIC-III dataset with a two-tier feature selection approach was developed to predict the risk of in-hospital mortality in ICU patients admitted for AKI. External validation was performed using separate MIMIC-IV and eICU-CRD. The area under the curve (AUC) was calculated using the stacking model, and features were selected using the Boruta and XGBoost feature selection methods. This study compares the performance of a stacking model using two-tier feature selection with a model using single-tier feature selection (XGBoost: 85; Boruta: 83; two-tier: 0.91). The predictive effectiveness of the stacking model was further validated by using different datasets (Validation 1: 0.83; Validation 2: 0.85) and comparing it with a simpler model and traditional clinical scores (SOFA: 0.65; APACH IV: 0.61). In addition, this study combined interpretable techniques and causal inference to analyze the causal relationship between features and predicted outcomes.

pose a significant challenge to model application.Consequently, reducing the number of input features while preserving model accuracy has become a pressing concern.Zhu et al. 19 successfully developed a machine learning model for predicting the risk of death in sepsis patients, achieving a 71% reduction in the number of features.Their findings indicate that even with small samples and low-dimensional data, accurate identification of patients at risk is feasible, enabling early treatment.Additionally, Shen et al. 20 and Wu et al. 7 validated the impact of feature reduction on the stability and accuracy of model prediction performance.Although many models have been proposed to identify patients at risk for AKI, few models can predict the risk of clinically important outcomes (hospital death or dialysis) once a patient develops AKI.The application of models with clinically important predictions may help guide the early treatment of patients with AKI.
In this study, we introduce a machine learning model that employing a two-stage feature selection process to predict in-hospital mortality risk among ICU patients with AKI.Our aim is to identify crucial features for mortality prediction and, as a result, reduce feature dimensionality to enhance model interpretability without sacrificing accuracy.

Study population characteristics
For this study, data from 16,090 initial ICU admissions in the MIMIC III database were collected, with 11,182 patients meeting the inclusion criteria.These data comprised the training set, with 30% allocated for internal Table 1.The baseline characteristics of AKI patients are analyzed.Statistical analysis of mortality and survival data within the training, internal validation, and external validation sets is conducted.Age and gender are depicted as frequencies and proportions within various categories, whereas other indicators are expressed using their mean values and standard deviations.deaths.Across all three datasets, the proportion of men diagnosed with AKI exceeded that of women, and correspondingly, the mortality rate was also higher among male patients.Additionally, AKI patients aged over 60 exhibited a notably elevated mortality rate compared to patients in other age brackets.

Feature selection and Model performance
Feature selection involved initial screening with the Boruta algorithm for the first tier (see Supplementary Fig. 1A,B).Subsequently refinement was carried out using XGBoost for the second tier (see Supplementary Fig. 2A).Ultimately, a total of 24 relevant features were identified.Additionally, the features identified using only XGBoost feature selection are depicted in Supplementary Fig. 2B.
The construction of the model is carried out according to the features screened by the Boruta and XGBoost algorithms, respectively, and the prediction effect of the model is evaluated and compared.The evaluation indexes are shown in Fig. 1A-C,A-C′′).As can be seen from the figure, the best-evaluated algorithm is stacking, with an AUC (95% CI) of 0.85 (0.846-0.854) for XGBoost-Stacking (Stacking model construction using XGBoost filtered features) and an AUC (95% CI) of 0.83 (0.828-0.831) for Boruta-Stacking (Stacking model construction using Boruta filtered features).Subsequently, model construction was performed using two-tier feature selection, and different types of algorithms were used for model training.The results showed that the stacking algorithm gave the best prediction with an AUC (95% CI) of 0.91 (0.906-0.915) (Fig. 1E-G).The precision, accuracy, and F1-score of the trained model were evaluated, achieving values of 0.90, 0.89, and 0.90, respectively (Fig. 2A-C).Furthermore, a comparison was conducted between the performance of models developed using single-tier and two-tier feature selection methods (Table 2; see also Supplementary Table 2 for detailed results).This analysis revealed that the two-tier feature selection approach consistently yielded superior prediction performance compared to the single-tier method.
To further validate the predictive effectiveness of the developed two-tier feature selection model, we used both internal and external validation sets to evaluate the model's performance.Figure 1E'-G' demonstrates the validation results on validation set 1, with an AUC (95% CI) of 0.83 (0.830-0.833).Figure 2A'-C' shows the precision of 0.86, accuracy of 0.81, and F1-score of 0.80 for validation set 1.In Fig. 1E"-G", we show the validation results of validation set 2 with an AUC (95% CI) of 0.85 (0.845-0.853).Figure 2 A"-C" shows that validation set 2 has a precision of 0.84, an accuracy of 0.80, and an F1-score of 0.80.Comparing the training results with the validation results reveals (Tables 2 and 3) that the AUC values of both the internal and external validation sets are higher than 0.80, indicating that the model predicts well on different datasets.In addition, we compared the constructed model with the traditional clinical scoring systems SOFA and APACHE IV (Fig. 1E,  E' , E").The results showed that on the training set, the AUC was 0.65 for SOFA and 0.61 for APACHE; on the validation set 1, the AUC was 0.71 for SOFA and 0.64 for APACHE; and on the validation set 2, the AUC was 0.62 for SOFA and 0.64 for APACHE.These results indicate that, compared to the traditional clinical scoring system, the constructed model has a better prediction effect.The experimental results of internal validation are presented in Online Supplementary Fig. 3A-C, Fig. 4A-C, and Table 3.
To achieve a more comprehensive evaluation of the model's performance, calibration curves and Brier scores were employed.The results are presented in Fig. 3 and Table 4.The Brier scores (with 95% CI) indicated good calibration, with values of 0.103 (0.093-0.113) for the training set, 0.106 (0.096-0.118) for validation set 1, and 0.110 (0.100-0.122) for validation set 2 (Table 4).

Model interpretability
The ensemble stacking model with two-tier feature selection utilizes two perspectives for model interpretation: individual and global.From an individual perspective, the interpretation module analyzes feature weights of the base model using the PI technique, as depicted in Fig. 4A-G.The top three important features in the base model are age, BUN, and temperature.During the analysis of the meta-model's feature weights, the predicted outputs of RFs' predicted outputs significantly influenced the final predictions.Table 5 displays the feature weights of the stacking model, which align with the findings of the base model analysis.From a global perspective, a causal diagram based on significant features is presented in Fig. 5.A causal relationship is observed between CL and HB as confounders of BUN and age and BUN as confounders of death, This relationship is represented as CL, HB → BUN → Death.However, determining the specific impact of each detection value is not feasible.Therefore, the influence of specific feature values is analyzed using LIME (Fig. 6).The LIME analysis reveals that the model's predictions vary under different combinations of feature values.For instance, the model is more likely to predict a lower risk of death when BUN ranges from 14.0 to 20.5 and age is ≤ 57 years, while predicting a higher risk of death when INR exceeds 1.5.This personalized analysis helps physicians understand the causal relationships between features and characteristics during model executions, thereby enhancing their comprehension of the models' decision-making processes.
In summary, the stacking ensemble model with two-tier feature selection integrates individual and global perspectives for model interpretation.The analysis of feature weights using the PI technique for both base and stacked models, along with causal diagrams and LIME analyses based on causal inference, enhances understanding of the model's predictive process and provides reliable references for medical decision-making.

Discussion
In the building of predictive models for AKI, logistic regression with backward or forward selection is a common approach for selecting a subset of features for model construction 21 .More recent approaches, methods such as Lasso, Boruta 22 , and XGBoost 23 have been employed for feature selection in AKI prediction.However, Lasso methods are typically limited in their adaptability, often relying on linear models or assumptions.In nonlinear scenarios, Lasso methods may fail to accurately capture complex feature relationships, resulting in the selection of insufficient features for effective data interpretation.Logistic regression, which relies on a linear combination of individual features for classification, may not adequately capture feature interactions, potentially impacting prediction accuracy.While Boruta 6 , a feature selection method based on tree models, excels at uncovering complex feature relationships and handling highly correlated features.Nonetheless, it solely focuses on the relationship between features and targets, disregarding the importance of features and models.XGBoost 24 , a gradient-boosting tree model, excels at capturing complex relationships among features, particularly in nonlinear scenarios.Its feature selection process focuses on the correlation between features and the model.
Several studies have highlighted the importance of feature selection in improving model performance for AKI prediction.Zhou et al. 25 demonstrated significant improvements in model predictions by incorporating deep features alongside those extracted using convolutional neural networks (CNNs).Similarly, Zhu et al. 19 observed a substantial enhancement in prediction accuracy following a 71% reduction in feature set size.Based on these findings, we propose that the model prediction performance can be improved by selecting intersecting features and reducing redundant features.Therefore, in our experiments, we first conducted feature selection using the Boruta algorithm to filter out features that correlate with the target value.Subsequently, in the second tier of feature selection, we employed XGBoost to filter out features that correlate with the model.Experimental results demonstrate the superiority of the two-tier feature selection approach over the single-tier approach.The stacking ensemble model exhibited superior predictive performance compared to the baseline model.Notably, the stacking ensemble model with two-tier feature selection achieved the highest predictive performance.Yue et al. 26 applied the Boruta algorithm to screen 34 variables and built a random forest model for predicting mortality risk in acute kidney injury patients, achieving an AUROC of 0.82.Yang et al. 27 employed the Boruta algorithm to select 36 variables and utilized XGBoost for modeling mortality risk prediction in sepsis-associated acute kidney injury patients, achieving an AUROC of 0.85.In comparison to previous studies, our proposed model achieved an AUROC of 0.91, indicating improved model performance.These results affirm the efficacy of our proposed approach.Furthermore, by employing model interpretable techniques and causal inference, we conducted a causal analysis of factors influencing model predictions.Our findings revealed significant associations between various laboratory tests and the prediction of mortality risk in AKI patients, consistent with previous studies by Son 28 , Zhang 29 , and others, further validating the reliability of our model.Taken together with the experimental  www.nature.com/scientificreports/results, the two-tier feature selection proposed in this study can better predict the risk of death of AKI patients in the ICU, and it can better capture the complexity and diversity of AKI risk by reducing the confounding variables in the model inputs.By combining the predictive power of multiple models, it can provide a more reliable auxiliary diagnosis for clinical decision-making.However, it is worth noting that a common challenge we faced was that this study was not prospective but retrospective.We chose to use the MIMIC-III dataset, but it is important to note that this dataset does not adequately represent the entire population and the diversity of different clinical practices.This somewhat limits our ability to analyze the problem in depth and make accurate predictions, as well as our ability to generalize the model to real-world applications.It is worth noting that due to the large number of missing urine output indicators in the dataset, we chose to temporarily omit this indicator from our study after referring to the relevant literature 18,[30][31][32][33] .However, recent studies 34 suggest that urine output plays a crucial role in the disease progression of AKI.Therefore, we will fully consider urine output as an important indicator in future studies to improve the ability to predict more accurately the risk of death in patients.

Methods
The conceptual framework for our developing two-tier feature selection prediction model is presented in Fig. 7.

Study population
Data for this study were retrieved from three distinct critical care databases: MIMIC-III 35 , MIMIC-IV 36 , and eICU-CRD 37 .The prediction models were developed using the publicly accessible MIMIC-III databases.The data were divided into two sets: 30% of the data were reserved for internal validation, and the remaining 70% were used for model construction.The predictive performance of these models was validated using an entirely independent dataset, the MIMIC-IV and eICU-CRD datasets.MIMIC-III includes critical care data from 46,520 ICU patients admitted to Beth Israel Deaconess Medical Center in Boston between June 1, 2001, and October 31,  2012.This dataset encompasses 26 tables encompassing demographics, admission records, discharge summaries, ICD-9 diagnostic records, vital signs, laboratory measurements, and medication usage.In contrast, MIMIC-IV includes data from over 190,000 patients, 450,000 hospitalizations, and more than 1000 hospital admissions to Beth Israel Deaconess Medical Center (BIDMC) and Massachusetts Institute of Technology (MIT) between 2008 and 2019, totaling 1,000,000 admissions.It offers a broader array of information, covering demographics, laboratory tests, medication usage, vital signs, surgical procedures, and disease diagnoses.Although MIMIC-III and MIMIC-IV may share medical information and data types, their data collection, processing, and dissemination methodologies differ.The MIMIC-IV dataset is broader in scope, spanning more hospitals and patients and covering a longer timeframe.The eICU-CRD Collaborative Research Database (eICU-CRD) is a large public database created by MIT in collaboration with the Laboratory for Computational Physiology (LCP).
The database is a completely independent dataset that brings together data from many hospitals within the United States, expanding the scope of the study by providing data from multiple centers.The database covers routine data on more than 200,000 patients admitted to intensive care units in 2014 and 2015 and includes a wealth of high-quality clinical information such as physiological parameters, laboratory results, medication records, and diagnostic information.The data are presented in both structured and unstructured forms and are automatically collected from monitoring equipment, electronic medical records, and other healthcare information systems.

Determination of outcome variables: mortality and AKI
Mortality, defined as the death rate among patients with AKI during their ICU hospitalization, was determined through specific criteria.Firstly, AKI diagnosis followed the Kidney Disease Improving Global Prognosis (KDIGO) 1 guidelines, considering serum creatinine concentration (Scar) and urine output (UO) levels.According to literature studies [30][31][32][33]38 , serum creatinine concentration (Scar) was used as the main target of study in this experiment. AK was defined as: a 1.5-fold increase in serum creatinine concentration within the prior 7 days; a rise of ≥ 0.3 mg/dL within 48 h; or a sustained urine output of < 0.5 mL/kg/h for ≥ 6 h.In cases where baseline serum creatinine was unavailable pre-admission, the first serum creatinine at admission served as the baseline.Patients with AKI in the ICU were identified via departmental codes.Subsequently, ICU duration was computed based on admission and discharge times, and data from 24 h preceding admission were extracted 26,39,40 .Data from the initial ICU admission were used for patients with multiple admissions; the average value was calculated for repeated examinations within 24 hours 41 .

Inclusion and exclusion criteria
To ensure data safety and emphasize the effectiveness of the model in early prediction, we focused on developing a predictive model using medical data from 24 h prior to a patient's admission to the hospital to screen patients diagnosed with AKI.The final dataset for the experiment was selected from this data (Fig. 8).During the data selection process, we excluded patients who met the following criteria: (1) age < 18 years old; (2) patients who were admitted to the intensive care unit for > 24 h; (3) patients who had already received chronic renal replacement therapy prior to admission; and (4) data with < 20% of missing values or a lack of outcome information.These exclusion criteria were designed to ensure the quality and accuracy of the experimental data for better exploring the relationship between early patient status and AKI.

Data processing
In this study, datasets with missing values exceeding 20% were excluded, and outliers were identified using boxand-whisker plots and subsequently removed.To handle missing values, multiple imputations were performed utilizing the RF algorithm, known for its effectiveness in imputing missing data 42 RF offers several advantages, including the ability to handle mixed types of missing data, adaptability to interactions and nonlinearities, and scalability to large datasets 43 , while preserving the distribution of data post-imputation.Additionally, the data underwent Min-Max normalization, transforming it into a specific range of intervals to ensure uniform scaling of each feature.This normalization process ensured uniform scaling for each feature, maintaining a relative weight balance between features.By addressing issues of model bias towards certain features due to differing scales, the normalization process improved the performance and interpretability of the machine learning model, ensuring consistent contribution weights of individual features to the model.

Statistical analyses
Descriptive statistics were utilized to assess the distribution and inherent patterns of numerical characteristics within the dataset.Measures such as mean, median, mode, range, variance, and standard deviation were examined as appropriate.Pearson's correlation coefficient was employed to analyze the degree of linear correlation between variables.Descriptive statistics for continuous variables included either mean ± standard deviation or median (interquartile range), while frequencies were utilized for categorical variables.The normal distribution of each variable was evaluated using the Kolmogorov-Smirnov test.Student's t-test compared continuous variables, and Fisher's exact method was used for correlational analysis between variables.Statistical analysis was performed using R version 4.3.1 for Windows.

Feature selection
This study employs a two-tier feature selection approach to improve both the performance and interpretability of the prediction model.The Boruta algorithm was utilized in the initial tier, and the XGBoost algorithm was employed in the subsequent tier.Boruta 6,44 is a RF-based feature selection method that evaluates feature importance through modelling the distribution of random and original features.In the first tier, Boruta is applied to filter out features with significant predictive power for the target variable from the initial set.XGBoost 23 , an efficient gradient boosting tree algorithm, serves as the meta-model in the second tier, known for its excellent predictive performance and automatic feature screening.Feature selection within the XGBoost model further refines the initially selected features, resulting in a final subset with enhanced predictive power and stability.

Model construction
In this study, we have strategically employed the SEM (Stacking Ensemble Method) to build our model, with the goal of further enhancing overall model performance by adeptly integrating outputs from multiple base learners (single classifiers) as inputs to the meta-learner.Extensive prior research has demonstrated the substantial superiority of the SEM in performance compared to independent classifiers 45 .To further optimize model performance, this study has employed the voting ensemble method in the preliminary stage.This method selectively crafts the base model for the SEM based on data characteristics and the principle of model diversity.Ultimately, Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Light Gradient Boosting Machine (LGBM), EXtreme Gradient Boosting (XGBoost), and Random Forest (RF) were identified as the base models for stacking ensemble, with LR being specifically chosen as the metamodel.This decision was made considering that the variables outputted by the base models represent linear data and align with the pursuit of model interpretability.The objective of this selection is to strike a balance between the diversity of the base models and the performance of the overall model, thereby providing a more comprehensive and reliable analytical foundation for this study.

Evaluation metrics
To assess the performance of the used models comprehensively and thoroughly, we utilize a diverse set of performance metrics, encompassing the area under the Receiver Operating Characteristic (AUROC), 95% Confidence Interval (CI), Precision-Recall Curve (AUC-PRC), Precision, Accuracy, Recall, F1 score, Calibration curves, and Brier scores.This comprehensive metric framework is designed to provide a more holistic understanding of the model's performance across various dimensions.Specifically, the evaluation is conducted using the following formulas: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
N is the total number of samples, f i is the predicted probability of the ith predicted sample, and o i is the actual outcome of the ith sample (usually 0 or 1).

Model interpretability
The SEM combines multiple base models to generate predictions.Thus, when interpreting the model, the feature weights of each base model undergo an initial assessment using the Permutation Importance (PI) technique.Subsequently, mathematical computation determines the feature weights of each model, which are then utilized as the feature weights of the stacked models.Features with higher weights are selected based on importance ranking, and a causal diagram is constructed using a causal inference framework 46 .In this framework, confounders are defined as variables directly influencing both the predicted outcome and the predictor.These confounders are pivotal factors contributing to AKI mortality rates 47

Figure 1 .
Figure 1.Illustrates model performance metrics with single-and two-tier feature selection, including feature selection using only XGBoost and Boruta, along with receiver operating characteristic curves (ROC), precisionrecall curves (PRC) predicted by models with two-tier feature selection.These evaluations are conducted on both the training set (MIMIC-III) and the validation sets (MIMIC-IV, eICU-CRD), with the receiver operating characteristic curves presented alongside a shaded 95% confidence interval (ROC (95% CI)).(A-C) denote the ROC, PRC, and ROC (95% CI) predicted by the model using only XGBoost for feature selection, respectively.(A′-C′) denote the ROC, PRC, and ROC (95% CI) predicted by the model using only Boruta for feature selection.(E-G) denote the ROC, PRC, and ROC (95% CI) predicted by the model using two-tier feature selection on the training set.(E′-G′) denotes the AUROC, PRC, and ROC (95% CI) predicted by the model using two-tier feature selection on Validation 1 (MIMIC-IV).(E″-G″) denotes AUROC, PRC, and ROC (95% CI) predicted by the model using two-tier feature selection on Validation2 (eICU-CRD).CI confidence interval.

Figure 3 .
Figure 3. Calibration curves and calculated Brier scores.Red indicates training, green internal validation set, yellow validation 1, and blue validation 2. CI confidence interval.

Figure 4 .
Figure 4. Feature importance analysis of stacking base models.(A) LR: Permutation Importance; (B) LGBM: Permutation Importance; (C) NB: Permutation Importance; (D) RF: Permutation Importance; (E) XGBoost: Permutation Importance; (F) SVM: Permutation Importance: The number next to each feature indicates the importance of that feature.A positive value indicates that the feature contributes positively to the model, while a negative value indicates that the feature contributes negatively to the model.The larger the number (positive or negative), the greater the influence of the feature on the model.Green indicates that the feature contributes positively to the model.Red indicates that the feature contributes negatively to the model.The shades of green and red represent the weights.

Figure 6 .
Figure 6.Interpretation of the predictions using the LIME algorithm with different random state values.(A) Predicted values for the model's disaggregated categories, feature coefficients (orange and blue indicate positive and negative relationships, respectively), and feature values in this sample are plotted against (B) local interpretations of features (red and green indicate positive and negative relationships, respectively).

Figure 7 .
Figure 7.A conceptual model for predicting outcomes in AKI patients within the ICU using a limited set of features.The initial conceptual model is designed for ongoing prediction of AKI-related hospitalization outcomes.Firstly, we gather data on the patient's laboratory tests, surgeries, and medication usage.Secondly, relevant features are identified for prediction through feature selection.Thirdly, we introduce a stacking ensemble model, employing fivefold cross-validation to assess patient outcomes.Lastly, the model undergoes analysis using various interpretable methods.

Figure 8 .
Figure 8. Flow chart of the study population selection.ICU-AKI: patients with acute kidney injury (AKI) treated in the Intensive Care Unit (ICU).
. Finally, Local Interpretable Model-Agnostic Explanations (LIME) is employed to analyze how specific values of different characteristics impact the model's predicted outcomes across various categories.This elucidation of clinical parameters leading to high patient mortality facilitates targeted interventions for potentially critical illnesses during clinical practice.

Characteristics Train DataSet 7828 Validation 1 7822 Validation 2 5928 Survival 5555 Death 2273 Survival 6705 Death 840 Survival 5403 Death 525
Patients were categorized as either dead or surviving.Furthermore, data from MIMIC-IV (validation 1) and eICU-CRD (validation 2) were retrieved for external validation using the same criteria.The mortality and survival data for the training set, internal validation set, and external validation set were statistically analyzed a summarized in Table 1.The training set consisted of 7828 cases, of which 2273 resulted in mortality and 5555 in survival.The internal validation set consisted of 3354 cases with 840 deaths and 2514 survivors (see Supplementary Table 1).External validation set 1 (MIMIC-IV) included 7822 cases with 6705 deaths and 1117 survivors.External validation set 2 (eICU-CRD) consisted of 5928 cases, with 5403 survivors and 525

Table 2 .
Evaluation of models for predicting AKI with two-tier feature selection.AUC area under the curve.

Table 3 .
Model performance metrics.Internal and external validation of two-tier feature selection.SVM Support vector machine, NB Naïve Bayes, LR logistic regression, RF random forest, XGBoost extreme gradient boosting, LGBM light gradient boosting machine, AUC area under the curve.

Table 4 .
Model evaluation of predicted AKI using 95% CI for AUC.AUC Area under the curve, CI confidence interval. Single-