Development of a machine learning-based prediction model for sepsis-associated delirium in the intensive care unit

Septic patients in the intensive care unit (ICU) often develop sepsis-associated delirium (SAD), which is strongly associated with poor prognosis. The aim of this study is to develop a machine learning-based model for the early prediction of SAD. Patient data were extracted from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database and the eICU Collaborative Research Database (eICU-CRD). The MIMIC-IV data were divided into a training set and an internal validation set, while the eICU-CRD data served as an external validation set. Feature variables were selected using least absolute shrinkage and selection operator regression, and prediction models were built using logistic regression, support vector machines, decision trees, random forests, extreme gradient boosting (XGBoost), k-nearest neighbors and naive Bayes methods. The performance of the models was evaluated in the validation set. The model was also applied to a group of patients who were not assessed or could not be assessed for delirium. The MIMIC-IV and eICU-CRD databases included 14,620 and 1723 patients, respectively, with a median time to diagnosis of SAD of 24 and 30 h. Compared with Non-SAD patients, SAD patients had higher 28-days ICU mortality rates and longer ICU stays. Among the models compared, the XGBoost model had the best performance and was selected as the final model (internal validation area under the receiver operating characteristic curves (AUROC) = 0.793, external validation AUROC = 0.701). The XGBoost model outperformed other models in predicting SAD. The establishment of this predictive model allows for earlier prediction of SAD compared to traditional delirium assessments and is applicable to patients who are difficult to assess with traditional methods.


Study population. The diagnosis of sepsis was based on the Third International Consensus Definitions
for Sepsis and Septic Shock (Sepsis-3), which defines sepsis as a sequential organ failure assessment (SOFA) score ≥ 2 associated with infection or suspected infection.Suspected infection was defined as antibiotics given within 3 days or 24 h of culture collection 1 .The following patients were excluded: (1) those aged < 18 years; (2)  patients with multiple ICU admissions; (3) patients with an ICU stay of less than 24 h.
The presence of delirium was assessed using the CAM-ICU score, which consists of four features: (1) an acute onset of mental status changes or a fluctuating course; (2) inattention; (3) disorganized thinking; and (4) an altered level of consciousness.A patient is diagnosed as delirious (i.e., CAM-ICU positive) if they exhibit features 1 and 2, along with either feature 3 or 4 14 .
We excluded septic patients without documented delirium assessment and septic patients who could not be assessed (documented inability to assess any of the 4 characteristics of the CAM-ICU scale).In addition, patients with a positive delirium assessment before the onset of sepsis and outside the ICU were excluded.
Data extraction and processing.The following data were extracted from the MIMIC-IV and eICU-CRD databases: (1) demographic information; (2) type of initial ICU admission; (3) initial vital signs and laboratory test results within 24 h of ICU admission; (4) SOFA and Glasgow Coma Scale (GCS) scores within 24 h of ICU admission; (5) comorbidities (hypertension, diabetes, acute myocardial infarction, chronic obstructive pulmonary disease, stroke, chronic kidney disease, acute kidney injury); (6) use of mechanical ventilation (MV), continuous renal replacement therapy (CRRT), vasopressors, and sedatives within 24 h of ICU admission; (7)  ICU length of stay, 28-days ICU mortality, diagnosis time for delirium and sepsis.For continuous variables, outliers and obviously conflicting values were considered as missing values (for example, numerical values for vital signs were eliminated using certain rules (i.e., heart rate values should be between 0 and 300).Variables with more than 20% missing values were excluded from the analysis.Multiple imputation for missing values was performed using the "MICE" package 15 .For unordered multicategorical variables, one-hot coding was used to represent them.

Statistical analysis.
Continuous variables were expressed as median and interquartile range.The Mann-Whitney U test was used for statistical comparisons between two groups.Categorical variables were described as counts and percentages, and the Chi-squared test or Fisher's exact test was used for group comparisons.Kaplan-Meier survival curves were constructed and compared using the log-rank test.
Model performance was evaluated using area under the receiver operating characteristic curve (AUROC), specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and kappa coefficient, with AUROC serving as the primary performance metric.We also evaluated the change in PPV and NPV of the model at different prevalence rates.The model with optimal predictive performance was selected as the primary model for this study.Calibration curves were used to assess the degree of agreement between observed and predicted outcomes, and decision curve analysis (DCA) was used to assess net clinical benefit.
The Shapley Additive Explanations (SHAP) method was used to explore the interpretability of the final predictive model.Higher SHAP values indicated an increased likelihood of SAD 17 .Partial dependence plots (PDPs) could be used to calculate SHAP values for each feature, allowing clinicians to make more accurate predictions.PDPs can show the marginal effects of each feature on the predictions of the machine learning model.
To evaluate the application of the model, we applied the final model to another group of patients in the MIMIC-IV database who were not assessed or could not be assessed for delirium and predicted the occurrence of SAD in these individuals.
All statistical analyses were performed using R 4.2.3 (Vienna, Austria) and STATA 15.1 (College Station, Texas), with P < 0.05 considered statistically significant.The machine learning code and the raw patient data are available on Github (https:// github.com/ bbyca t927/ SAD).
Ethics approval and consent to participate.The MIMIC-IV database was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology.Access to the eICU-CRD database was approved by the Institutional Review Board of the Massachusetts Institute of Technology.All protected health information in the database was de-identified, eliminating the need for individual patient consent.All methods were performed in accordance with relevant guidelines and regulations.

Results
Participants and baseline characteristics.After applying the exclusion criteria, a total of 14,620 patients from the MIMIC-IV database and 1723 patients from the eICU-CRD database were included (Fig. 1).Baseline characteristics of all patients are shown in Table 1.In the MIMIC-IV database, there were 5,390 cases of SAD (36.9%).Figure 2A shows the Kaplan-Meier curves for the two groups, showing a higher 28-days ICU mortality rate for the SAD group compared to the Non-SAD group (P < 0.01, Log-rank test).Similarly, ICU length of stay was significantly longer in the SAD group compared to the Non-SAD group (Fig. 2B P < 0.01, Mann-Whitney U test).
Supplementary Table S1 shows that the median time to diagnosis of sepsis in the MIMIC-IV and eICU-CRD databases was 3 and 0 h, the median time to diagnosis of SAD was 24 and 30 h, and the mean time to diagnosis of SAD was 44.9 and 58.7 h.

Feature selection and model development.
Initially, 42 feature variables were identified (Table 1), and after one-hot coding of unordered multi-categorical variables, a total of 53 feature variables were obtained.LASSO regression was then performed.Figure 3A illustrates the cross-validation error for the penalty term.Using the lambda.1secriterion, we identified 43 variables with significant predictive ability.Figure 3B shows the coefficient profiles for these 53 features in LASSO regression, indicating the optimal point for retaining variables with non-zero coefficients.These 43 selected variables, along with their non-zero coefficient values, are presented in Supplementary Table S2.Based on the selected features, we built a traditional logistic regression model and six machine learning models: SVM, XGBoost, RF, KNN, DT, and NB. 2 describes the predictive performance of these models on the internal validation set, while Table 3 describes   www.nature.com/scientificreports/0.701 on the external validation set.The performance of the other models is also visualized in these figures, highlighting the superior performance of the XGBoost model (Fig. 4A and Fig. 4B).

Model performance. Table
To examine the calibration of the models, calibration curves for the three best performing models (XGBoost, RF, SVM) were generated and compared (Fig. 4C).Among them, XGBoost showed the best fit between observed and predicted probabilities, indicating its superior calibration.Decision curve analysis (DCA) was performed on these three models and the results are shown in Fig. 4D.The analysis showed that using the XGBoost prediction model provided the highest net benefit for predicting SAD, outperforming both RF and SVM.
For further analysis, we evaluated the PPV and NPV of the models at different thresholds (prevalence rates).In the internal validation set, RF showed the highest PPV at a threshold of 0.3, while XGBoost and SVM maintained stable PPV with increasing thresholds.For the external validation set, RF and XGBoost showed superior PPV.However, XGBoost showed consistent PPV across all thresholds.While all models showed an increase in NPV with increasing thresholds, the NPV was generally lower compared to the internal validation set (Supplementary Tables S3 and S4).Overall, these results confirm the robustness of XGBoost, particularly its stability across different prevalence rates.

Model interpretations.
To identify the most influential features in the model, we plotted the feature importance ranking for the XGBoost model (top 15 features, Fig. 5A).These features included mechanical ventilation, cardiovascular ICU (CVICU), GCS score, sedation, acute kidney injury (AKI), temperature, anion gap, blood sodium, vasopressors, respiratory rate, age, stroke, bicarbonate, platelets, and white blood cells.The SHAP  www.nature.com/scientificreports/Summary plot (Fig. 5B) complements this ranking by illustrating the impact of each feature on the model's output.Each dot on the plot corresponds to a SHAP value for a feature in a given case.The y-axis represents a feature, and the x-axis location indicates the SHAP value or the magnitude of the feature's effect on the prediction.The color of the dots represents the actual value of the feature, with purple indicating low values and yellow indicating high values (e.g., for MV, yellow dots on the right side of the zero line indicate higher MV values contributing to a higher risk of SAD).Partial Dependence Plots (PDPs) provide a graphical depiction of the marginal effect of a feature on the predicted outcome of a machine learning model (Fig. 6).In these plots, the x-axis represents the actual values of the clinical parameters, while the y-axis represents the corresponding SHAP values.This provides a way to quantify the relationship between the feature and the risk.A key advantage of PDPs is their ability to highlight non-linear relationships between features and the outcome.If the plotted line is not straight, or changes direction, this suggests that the relationship between the feature and the outcome is not linear.Thus, PDPs provide a more nuanced understanding of the model's decision rules beyond what is captured by linear models.For binary features, such  as sedation, AKI, and stroke, the two distinct states of the variable are represented along the x-axis.The y-axis shows the average predicted outcome for the instances at each state.For example, a higher average prediction at one state over the other indicates that this state has a higher likelihood of leading to the predicted outcome.It's also worth noting that curve fitting for binary variables in PDPs does not indicate a trend or gradient as it does for continuous variables, but simply connects the average predictions at the two states.
Application of the model.In the MIMIC-IV database, there were a total of 6625 patients who were either not assessed or unable to be assessed for delirium, with 330 patients falling into the latter category (Fig. 1).The baseline characteristics of these patients compared with those with sepsis included in the MIMIC-IV model are detailed in Supplementary Table S5.These patients had higher ICU 28-days mortality and in-hospital mortality compared with those in the model (P < 0.01).Using XGBoost model, we predicted the occurrence of SAD in these patients.In the total group, 1833 patients (27.7%) were predicted to develop SAD.Furthermore, when comparing patients who were unassessed and those who could not be assessed, we found a higher predicted SAD incidence rate in the latter group, at 44.5% compared to 26.8% in the former group (P < 0.01).Mortality rate and ICU length of stay were also higher in the group of patients who could not be assessed than in those who were unassessed (P < 0.01) (Supplementary Table S5).

Discussion
In this investigation, we found that approximately 36.9% of sepsis patients in the ICU experienced delirium, with SAD patients having higher 28-days mortality rates and longer ICU stays compared to Non-SAD patients.
We then developed an XGBoost-based machine learning predictive model that demonstrated commendable predictive performance in both internal and external validation, enabling early prediction of SAD on ICU admission.To our knowledge, this is the first study to establish a predictive model for SAD, as previous research has primarily focused on constructing predictive models for delirium [18][19][20] or sepsis-associated encephalopathy [21][22][23][24][25] .
Existing research on SAD has predominantly examined risk factors and typically included a limited number of study patients 4,5 .Currently, the CAM-ICU score is the most commonly used method for diagnosing delirium, but it requires multiple assessments of the patient before a positive result is possible 9,14 .In contrast, our machine learning prediction model, based on data from the first 24 h of the patient's ICU admission, is able to predict SAD much earlier, as confirmed by our study results.It is worth noting that the completion of the delirium assessment by ICU staff (mainly nurses) varies widely, from only 38% in usual care to 84-95% after rigorous intervention 26 .Failure to complete has been attributed in part to patient-related factors such as age, language, sedation, and intubation, as well as staff-related issues such as inadequate training, difficulty using assessment tools, and heavy workload 27,28 .Even when an assessment is completed, a proportion of CAM-ICU scores are recorded as "unable to assess" (UTA) due to sedation, neurological deficits, underlying dementia or speech/hearing impairment.Such unassessable cases have been reported to account for 19-30% of all score records 26,29 .All of these factors can lead to underestimation of delirium in the ICU, and in our study we also found that many patients had no delirium assessment or were marked as UTA.Our predictive model revealed a SAD incidence of 27.7% in the cohort of unassessed patients, which was lower than the model's predicted incidence of 36.9%, while the SAD incidence in patients marked as UTA increased to a substantial 44.5%.Thus, by applying our machine learning prediction model to clinical data, clinicians may be able to identify potential SAD patients more comprehensively.However, it should be noted that further independent validation with different datasets with confirmed SAD diagnoses is needed to assess the generalizability and accuracy of this machine learning model in different clinical settings.
Our study identified mechanical ventilation as the strongest risk factor for SAD, with 50.6% of 6597 mechanically ventilated patients experiencing delirium, a finding consistent with many delirium-related studies 18,19 .In a study of mechanically ventilated sepsis patients, the incidence of SAD reached 48% 5 .In some partial dependence plots, we observed that sedation within 24 h of ICU admission was a favorable factor for SAD, which differs from some research findings 18 .Our sedatives included midazolam, dexmedetomidine, and propofol.Relevant www.nature.com/scientificreports/studies have shown that the use of benzodiazepines and propofol may increase the risk of delirium 30,31 , whereas dexmedetomidine may decrease it 32 .However, the role of sedatives in SAD remains controversial; research by Yu Kawazoe et al. 33 found no significant differences in mortality, delirium-free days, and ventilator-free days between the dexmedetomidine group and other sedative groups (propofol, midazolam, fentanyl) in mechanically ventilated sepsis patients.A large randomized controlled trial showed similar results 34 .We speculate that these results may be related to early sedation, as early sedation may reduce the duration of mechanical ventilation, which is the strongest risk factor for SAD, and its reduction would be conducive to reducing the incidence of SAD.
Research by Stephens et al. 35 found that the use of light sedation within the first 48 h of mechanical ventilation could reduce mortality, mechanical ventilation duration, and ICU length of stay.Shehabi et al. 36 introduced the concept of early goal-directed sedation, implementing goal-directed sedation as soon as possible (12 h) after the initiation of mechanical ventilation, resulting in less benzodiazepine use, more delirium-free days, and less physical restraint in the early goal-directed therapy group.Notably, the impact of early sedation on patients is closely related to the depth of sedation; early deep sedation is associated with significantly increased rates of delirium, duration of mechanical ventilation, and mortality compared with early light sedation 35 .
Stroke is also a risk factor for SAD.Some of the current predictive models associated with delirium tend to exclude stroke from their exclusion criteria, possibly due to the difficulty in distinguishing overlapping symptoms between delirium and stroke.However, in recent years, there has been an increasing number of studies on delirium in stroke patients.A systematic review of delirium in neuro ICU(NICU) patients suggests the need for delirium assessment in stroke patients, with current tools being applicable for monitoring delirium in both stroke and brain injury patients 37 .The CAM-ICU score can accurately diagnose delirium after stroke, with a study by Mitasova et al. finding a sensitivity of 76%, specificity of 98%, and accuracy of 94% for the CAM-ICU in diagnosing delirium in stroke patients 38 .In addition, stroke-related delirium may interfere with the diagnosis of SAD, so we excluded pre-sepsis delirium in our exclusion criteria.Studies have shown that the incidence of delirium in stroke patients ranges from 10.7 to 16% 39,40 , while the incidence of delirium in the NICU ranges from 12 to 43% 37 .Infection is one of the risk factors for delirium in stroke patients 41 .The incidence of delirium is higher in sepsis patients with concomitant stroke; in our study, the incidence of delirium reached 50% in sepsis patients with stroke and 56.2% for SAD in the NICU.
Our results indicate that CVICU is a favorable factor for SAD, with an incidence rate of 19.9% in CVICU, similar to some studies 42 .The initial 24-h GCS score is also an important predictor of SAD, consistent with the results of the two most recent delirium prediction models 18,19 .Other predictive factors such as AKI, temperature, anion gap, blood sodium, vasopressors, respiratory rate, age, bicarbonate, platelets, and white blood cells have also been validated by similar studies or predictive models [19][20][21][22][23][24][25] .PDPs suggest that some of these predictors have a nonlinear relationship with the occurrence of SAD.For example, GCS score, temperature, sodium, and bicarbonate.
Our study has several limitations.First, there is currently no definitive diagnostic criterion for SAD.Although we established several inclusion and exclusion criteria, misdiagnosis and missed diagnoses remain inevitable.Second, we used LASSO regression for feature selection due to its efficiency in handling large numbers of variables, which may not be optimal for all models and may miss complex, non-linear relationships within the data.Third, it's important to note that the risk factor analysis based on PDPs may be subject to the assumption of feature independence.Finally, we did not further analyze the effects of sedative drug types, doses, and duration of use on SAD, which may complicate our predictive variables.

Conclusion
SAD is common in ICU sepsis patients, with higher mortality rates and longer ICU stays than sepsis alone.Using our machine learning-based early prediction model, we can predict the risk of SAD earlier than delirium can be detected by traditional tools such as CAM-ICU, and this model can be applied to patients who are difficult to assess conventionally.The establishment of this model facilitates early risk identification and the implementation of preventive measures, potentially reducing the incidence and mortality of SAD.

Figure 2 .
Figure 2. (A) Kaplan-Meier survival curves of 28-days ICU mortality for SAD and Non-SAD groups in the MIMIC-IV database.(B) Boxplots of ICU length of stay for SAD and Non-SAD groups in the MIMIC-IV database.

Figure 3 .
Figure 3. (A) Cross-validation plot for the penalty term.The dashed lines represent the lambda.minand lambda.1se.(B) Plots for the LASSO regression coefficients over different values of the penalty parameter.The vertical dashed lines correspond to the lambda.minand lambda.1sefrom the cross-validation.

Figure 4 .
Figure 4. (A) The receiver operating characteristic (ROC) curves of the LR, SVM, XGBoost, RF, KNN, DT, and NB models on the internal validation set.(B) The ROC curves of the LR, SVM, XGBoost, RF, KNN, DT, and NB models on the external validation set.(C) Calibration curves of the XGBoost, RF, SVM models.(D) Decision curves of the XGBoost, RF, SVM models.

Table 1 .
Baseline Characteristics of SAD and Non-SAD Patients.Continuous variables were expressed as median and interquartile range, the Mann-Whitney U test was used for statistical comparisons between two groups.Categorical variables were described as counts and percentages, and the Chi-squared test or Fisher's exact test was used for group comparisons.ICU intensive care unit, CCU coronary care unit, CVICU cardiovascular ICU, MICU medical ICU, SICU surgical ICU, NICU neuro ICU, TSICU trauma-neuro surgical ICU, BP blood pressure, WBC white blood cell count, BUN blood urea nitrogen, INR international normalized ratio, PTT partial thromboplastin time, GCS glasgow coma scale, SOFA sequential organ failure assessment, MV mechanical ventilation, CRRT continuous renal replacement therapy, AMI acute myocardial infarction, CKD chronic kidney disease, COPD chronic obstructive pulmonary disease, AKI acute kidney injury.

Table 2 .
Model performance on the internal validation set.

Table 3 .
Model performance on the external validation set.