Introduction

Sepsis is a severe organ dysfunction caused by a dysregulated host response to infection, with high incidence and mortality, and is a common critical illness1. Approximately 48 million people worldwide suffer from sepsis each year and approximately 11 million people die from it2. Delirium is the most common manifestation of brain dysfunction in critically ill patients, characterized by symptoms such as altered consciousness, impaired attention, disorientation, hallucinations and delusions3. Delirium is a common neurological complication in septic patients in the intensive care unit (ICU), with reported incidence rates ranging from 17.7 to 48%, and its severity is closely associated with patient prognosis4,5. Furthermore, sepsis-induced delirium is also associated with long-term cognitive dysfunction after discharge, causing physical discomfort and pain to patients and a burden to families and the economy6,7.

Sepsis-associated delirium (SAD) is a complex clinical syndrome, the mechanism of which is not fully understood. It may be related to several factors, including neuroinflammation, cerebral perfusion abnormalities, blood–brain barrier damage, and neurotransmitter imbalances8. Currently, there is no definitive diagnostic criterion for SAD, and the Confusion Assessment Method for the ICU (CAM-ICU) score is the most effective tool for diagnosing and assessing delirium in adult ICU patients according to the 2013 Society of Critical Care Medicine guidelines for pain, agitation, and delirium9. There is still no specific treatment for SAD, and early detection and prevention of SAD in septic patients are critical to its occurrence and prognosis10. Several studies have analyzed the risk factors for SAD in septic patients4,5,11, but there is still no early prediction tool for SAD in septic patients.

The aim of this study is to develop an early prediction model for SAD using machine learning methods based on sepsis-related data from large public databases and to evaluate the clinical applicability of this model. Our ultimate goal is to provide clinicians with a tool to identify high-risk patients more quickly and comprehensively, allowing for earlier implementation of preventive measures and ultimately reducing the incidence and mortality of SAD.

Materials and methods

Data source

This is a retrospective cohort study based on the Medical Information Mart for Intensive Care-IV (MIMIC-IV, version 2.2) and the eICU Collaborative Research Database (eICU-CRD, version 2.0)12,13. The MIMIC-IV database contains information on all patients admitted to Beth Israel Deaconess Medical Center between 2008 and 2019, while the eICU-CRD is a multicenter telemedicine database containing data from more than 200,000 patients admitted to 335 ICUs in 208 hospitals across the United States between 2014 and 2015. The database includes comprehensive information such as length of stay, laboratory tests, medication management, vital signs, etc. for each patient. To protect patient privacy, all personal information was de-identified and random codes were used instead of patient identifiers. Therefore, this study did not require patient consent or ethics approval. The researcher (Zhang) has completed the training program provided by the collaborating institution (Certificate No. 53496787) and is qualified to use the database and extract data.

Study population

The diagnosis of sepsis was based on the Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3), which defines sepsis as a sequential organ failure assessment (SOFA) score ≥ 2 associated with infection or suspected infection. Suspected infection was defined as antibiotics given within 3 days or 24 h of culture collection1. The following patients were excluded: (1) those aged < 18 years; (2) patients with multiple ICU admissions; (3) patients with an ICU stay of less than 24 h.

The presence of delirium was assessed using the CAM-ICU score, which consists of four features: (1) an acute onset of mental status changes or a fluctuating course; (2) inattention; (3) disorganized thinking; and (4) an altered level of consciousness. A patient is diagnosed as delirious (i.e., CAM-ICU positive) if they exhibit features 1 and 2, along with either feature 3 or 414.

We excluded septic patients without documented delirium assessment and septic patients who could not be assessed (documented inability to assess any of the 4 characteristics of the CAM-ICU scale). In addition, patients with a positive delirium assessment before the onset of sepsis and outside the ICU were excluded.

Data extraction and processing

The following data were extracted from the MIMIC-IV and eICU-CRD databases: (1) demographic information; (2) type of initial ICU admission; (3) initial vital signs and laboratory test results within 24 h of ICU admission; (4) SOFA and Glasgow Coma Scale (GCS) scores within 24 h of ICU admission; (5) comorbidities (hypertension, diabetes, acute myocardial infarction, chronic obstructive pulmonary disease, stroke, chronic kidney disease, acute kidney injury); (6) use of mechanical ventilation (MV), continuous renal replacement therapy (CRRT), vasopressors, and sedatives within 24 h of ICU admission; (7) ICU length of stay, 28-days ICU mortality, diagnosis time for delirium and sepsis. For continuous variables, outliers and obviously conflicting values were considered as missing values (for example, numerical values for vital signs were eliminated using certain rules (i.e., heart rate values should be between 0 and 300). Variables with more than 20% missing values were excluded from the analysis. Multiple imputation for missing values was performed using the “MICE” package15. For unordered multicategorical variables, one-hot coding was used to represent them.

Statistical analysis

Continuous variables were expressed as median and interquartile range. The Mann–Whitney U test was used for statistical comparisons between two groups. Categorical variables were described as counts and percentages, and the Chi-squared test or Fisher's exact test was used for group comparisons. Kaplan–Meier survival curves were constructed and compared using the log-rank test.

MIMIC-IV data were randomly divided into training and internal validation sets in a 7:3 ratio, with eICU-CRD data serving as the external validation set. Least absolute shrinkage and selection operator (LASSO) regression was used for dimensionality reduction and feature selection16. After data reduction, predictive models were built using the following methods: (1) logistic regression (LR); (2) support vector machine (SVM); (3) decision tree (DT); (4) random forest (RF); (5) extreme gradient boosting (XGBoost); (6) k-nearest neighbors (KNN); and (7) naive bayes (NB).

Model performance was evaluated using area under the receiver operating characteristic curve (AUROC), specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and kappa coefficient, with AUROC serving as the primary performance metric. We also evaluated the change in PPV and NPV of the model at different prevalence rates. The model with optimal predictive performance was selected as the primary model for this study. Calibration curves were used to assess the degree of agreement between observed and predicted outcomes, and decision curve analysis (DCA) was used to assess net clinical benefit.

The Shapley Additive Explanations (SHAP) method was used to explore the interpretability of the final predictive model. Higher SHAP values indicated an increased likelihood of SAD17. Partial dependence plots (PDPs) could be used to calculate SHAP values for each feature, allowing clinicians to make more accurate predictions. PDPs can show the marginal effects of each feature on the predictions of the machine learning model.

To evaluate the application of the model, we applied the final model to another group of patients in the MIMIC-IV database who were not assessed or could not be assessed for delirium and predicted the occurrence of SAD in these individuals.

All statistical analyses were performed using R 4.2.3 (Vienna, Austria) and STATA 15.1 (College Station, Texas), with P < 0.05 considered statistically significant. The machine learning code and the raw patient data are available on Github (https://github.com/bbycat927/SAD).

Ethics approval and consent to participate

The MIMIC-IV database was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology. Access to the eICU-CRD database was approved by the Institutional Review Board of the Massachusetts Institute of Technology. All protected health information in the database was de-identified, eliminating the need for individual patient consent. All methods were performed in accordance with relevant guidelines and regulations.

Results

Participants and baseline characteristics

After applying the exclusion criteria, a total of 14,620 patients from the MIMIC-IV database and 1723 patients from the eICU-CRD database were included (Fig. 1). Baseline characteristics of all patients are shown in Table 1. In the MIMIC-IV database, there were 5,390 cases of SAD (36.9%). Figure 2A shows the Kaplan–Meier curves for the two groups, showing a higher 28-days ICU mortality rate for the SAD group compared to the Non-SAD group (P < 0.01, Log-rank test). Similarly, ICU length of stay was significantly longer in the SAD group compared to the Non-SAD group (Fig. 2B  P < 0.01, Mann–Whitney U test).

Figure 1
figure 1

Research flowchart. n1, patients excluded in MIMIC-IV database. n2, patients excluded in eICU-CRD database.

Table 1 Baseline Characteristics of SAD and Non-SAD Patients.
Figure 2
figure 2

(A) Kaplan–Meier survival curves of 28-days ICU mortality for SAD and Non-SAD groups in the MIMIC-IV database. (B) Boxplots of ICU length of stay for SAD and Non-SAD groups in the MIMIC-IV database.

Supplementary Table S1 shows that the median time to diagnosis of sepsis in the MIMIC-IV and eICU-CRD databases was 3 and 0 h, the median time to diagnosis of SAD was 24 and 30 h, and the mean time to diagnosis of SAD was 44.9 and 58.7 h.

Feature selection and model development

Initially, 42 feature variables were identified (Table 1), and after one-hot coding of unordered multi-categorical variables, a total of 53 feature variables were obtained. LASSO regression was then performed. Figure 3A illustrates the cross-validation error for the penalty term. Using the lambda.1se criterion, we identified 43 variables with significant predictive ability. Figure 3B shows the coefficient profiles for these 53 features in LASSO regression, indicating the optimal point for retaining variables with non-zero coefficients. These 43 selected variables, along with their non-zero coefficient values, are presented in Supplementary Table S2. Based on the selected features, we built a traditional logistic regression model and six machine learning models: SVM, XGBoost, RF, KNN, DT, and NB.

Figure 3
figure 3

(A) Cross-validation plot for the penalty term. The dashed lines represent the lambda.min and lambda.1se. (B) Plots for the LASSO regression coefficients over different values of the penalty parameter. The vertical dashed lines correspond to the lambda.min and lambda.1se from the cross-validation.

Model performance

Table 2 describes the predictive performance of these models on the internal validation set, while Table 3 describes their performance on the external validation set. In terms of the AUROC, the XGBoost model outperformed the other models, with an AUROC of 0.793 on the internal validation set and 0.701 on the external validation set. The performance of the other models is also visualized in these figures, highlighting the superior performance of the XGBoost model (Fig. 4A and Fig. 4B).

Table 2 Model performance on the internal validation set.
Table 3 Model performance on the external validation set.
Figure 4
figure 4

(A) The receiver operating characteristic (ROC) curves of the LR, SVM, XGBoost, RF, KNN, DT, and NB models on the internal validation set. (B) The ROC curves of the LR, SVM, XGBoost, RF, KNN, DT, and NB models on the external validation set. (C) Calibration curves of the XGBoost, RF, SVM models. (D) Decision curves of the XGBoost, RF, SVM models.

To examine the calibration of the models, calibration curves for the three best performing models (XGBoost, RF, SVM) were generated and compared (Fig. 4C). Among them, XGBoost showed the best fit between observed and predicted probabilities, indicating its superior calibration. Decision curve analysis (DCA) was performed on these three models and the results are shown in Fig. 4D. The analysis showed that using the XGBoost prediction model provided the highest net benefit for predicting SAD, outperforming both RF and SVM.

For further analysis, we evaluated the PPV and NPV of the models at different thresholds (prevalence rates). In the internal validation set, RF showed the highest PPV at a threshold of 0.3, while XGBoost and SVM maintained stable PPV with increasing thresholds. For the external validation set, RF and XGBoost showed superior PPV. However, XGBoost showed consistent PPV across all thresholds. While all models showed an increase in NPV with increasing thresholds, the NPV was generally lower compared to the internal validation set (Supplementary Tables S3 and S4). Overall, these results confirm the robustness of XGBoost, particularly its stability across different prevalence rates.

Model interpretations

To identify the most influential features in the model, we plotted the feature importance ranking for the XGBoost model (top 15 features, Fig. 5A). These features included mechanical ventilation, cardiovascular ICU (CVICU), GCS score, sedation, acute kidney injury (AKI), temperature, anion gap, blood sodium, vasopressors, respiratory rate, age, stroke, bicarbonate, platelets, and white blood cells. The SHAP Summary plot (Fig. 5B) complements this ranking by illustrating the impact of each feature on the model's output. Each dot on the plot corresponds to a SHAP value for a feature in a given case. The y-axis represents a feature, and the x-axis location indicates the SHAP value or the magnitude of the feature's effect on the prediction. The color of the dots represents the actual value of the feature, with purple indicating low values and yellow indicating high values (e.g., for MV, yellow dots on the right side of the zero line indicate higher MV values contributing to a higher risk of SAD).

Figure 5
figure 5

(A) Feature importance ranking plot of the XGBoost model (top 15 features). (B) SHAP summary plot of the XGBoost model (top 15 features). mv: mechanical ventilation, CVICU: cardiovascular ICU, wbc: white blood cell count, gcs: glasgow coma scale, aki: acute kidney injury.

Partial Dependence Plots (PDPs) provide a graphical depiction of the marginal effect of a feature on the predicted outcome of a machine learning model (Fig. 6). In these plots, the x-axis represents the actual values of the clinical parameters, while the y-axis represents the corresponding SHAP values. This provides a way to quantify the relationship between the feature and the risk. A key advantage of PDPs is their ability to highlight non-linear relationships between features and the outcome. If the plotted line is not straight, or changes direction, this suggests that the relationship between the feature and the outcome is not linear. Thus, PDPs provide a more nuanced understanding of the model's decision rules beyond what is captured by linear models. For binary features, such as sedation, AKI, and stroke, the two distinct states of the variable are represented along the x-axis. The y-axis shows the average predicted outcome for the instances at each state. For example, a higher average prediction at one state over the other indicates that this state has a higher likelihood of leading to the predicted outcome. It's also worth noting that curve fitting for binary variables in PDPs does not indicate a trend or gradient as it does for continuous variables, but simply connects the average predictions at the two states.

Figure 6
figure 6

Partial dependence plots of features. Y-axis represents SHAP values; X-axis represents actual clinical parameters for continuous variables, and for binary variables (e.g., AKI, MV, sedation, stroke), ‘0’ indicates absence and ‘1’ indicates presence of the condition.

Application of the model

In the MIMIC-IV database, there were a total of 6625 patients who were either not assessed or unable to be assessed for delirium, with 330 patients falling into the latter category (Fig. 1). The baseline characteristics of these patients compared with those with sepsis included in the MIMIC-IV model are detailed in Supplementary Table S5. These patients had higher ICU 28-days mortality and in-hospital mortality compared with those in the model (P < 0.01). Using XGBoost model, we predicted the occurrence of SAD in these patients. In the total group, 1833 patients (27.7%) were predicted to develop SAD. Furthermore, when comparing patients who were unassessed and those who could not be assessed, we found a higher predicted SAD incidence rate in the latter group, at 44.5% compared to 26.8% in the former group (P < 0.01). Mortality rate and ICU length of stay were also higher in the group of patients who could not be assessed than in those who were unassessed (P < 0.01) (Supplementary Table S5).

Discussion

In this investigation, we found that approximately 36.9% of sepsis patients in the ICU experienced delirium, with SAD patients having higher 28-days mortality rates and longer ICU stays compared to Non-SAD patients. We then developed an XGBoost-based machine learning predictive model that demonstrated commendable predictive performance in both internal and external validation, enabling early prediction of SAD on ICU admission. To our knowledge, this is the first study to establish a predictive model for SAD, as previous research has primarily focused on constructing predictive models for delirium18,19,20 or sepsis-associated encephalopathy21,22,23,24,25. Existing research on SAD has predominantly examined risk factors and typically included a limited number of study patients4,5.

Currently, the CAM-ICU score is the most commonly used method for diagnosing delirium, but it requires multiple assessments of the patient before a positive result is possible9,14. In contrast, our machine learning prediction model, based on data from the first 24 h of the patient's ICU admission, is able to predict SAD much earlier, as confirmed by our study results. It is worth noting that the completion of the delirium assessment by ICU staff (mainly nurses) varies widely, from only 38% in usual care to 84–95% after rigorous intervention26. Failure to complete has been attributed in part to patient-related factors such as age, language, sedation, and intubation, as well as staff-related issues such as inadequate training, difficulty using assessment tools, and heavy workload27,28. Even when an assessment is completed, a proportion of CAM-ICU scores are recorded as “unable to assess” (UTA) due to sedation, neurological deficits, underlying dementia or speech/hearing impairment. Such unassessable cases have been reported to account for 19–30% of all score records26,29. All of these factors can lead to underestimation of delirium in the ICU, and in our study we also found that many patients had no delirium assessment or were marked as UTA. Our predictive model revealed a SAD incidence of 27.7% in the cohort of unassessed patients, which was lower than the model's predicted incidence of 36.9%, while the SAD incidence in patients marked as UTA increased to a substantial 44.5%. Thus, by applying our machine learning prediction model to clinical data, clinicians may be able to identify potential SAD patients more comprehensively. However, it should be noted that further independent validation with different datasets with confirmed SAD diagnoses is needed to assess the generalizability and accuracy of this machine learning model in different clinical settings.

Our study identified mechanical ventilation as the strongest risk factor for SAD, with 50.6% of 6597 mechanically ventilated patients experiencing delirium, a finding consistent with many delirium-related studies18,19. In a study of mechanically ventilated sepsis patients, the incidence of SAD reached 48%5. In some partial dependence plots, we observed that sedation within 24 h of ICU admission was a favorable factor for SAD, which differs from some research findings18. Our sedatives included midazolam, dexmedetomidine, and propofol. Relevant studies have shown that the use of benzodiazepines and propofol may increase the risk of delirium30,31, whereas dexmedetomidine may decrease it32. However, the role of sedatives in SAD remains controversial; research by Yu Kawazoe et al.33 found no significant differences in mortality, delirium-free days, and ventilator-free days between the dexmedetomidine group and other sedative groups (propofol, midazolam, fentanyl) in mechanically ventilated sepsis patients. A large randomized controlled trial showed similar results34. We speculate that these results may be related to early sedation, as early sedation may reduce the duration of mechanical ventilation, which is the strongest risk factor for SAD, and its reduction would be conducive to reducing the incidence of SAD. Research by Stephens et al.35 found that the use of light sedation within the first 48 h of mechanical ventilation could reduce mortality, mechanical ventilation duration, and ICU length of stay. Shehabi et al.36 introduced the concept of early goal-directed sedation, implementing goal-directed sedation as soon as possible (12 h) after the initiation of mechanical ventilation, resulting in less benzodiazepine use, more delirium-free days, and less physical restraint in the early goal-directed therapy group. Notably, the impact of early sedation on patients is closely related to the depth of sedation; early deep sedation is associated with significantly increased rates of delirium, duration of mechanical ventilation, and mortality compared with early light sedation35.

Stroke is also a risk factor for SAD. Some of the current predictive models associated with delirium tend to exclude stroke from their exclusion criteria, possibly due to the difficulty in distinguishing overlapping symptoms between delirium and stroke. However, in recent years, there has been an increasing number of studies on delirium in stroke patients. A systematic review of delirium in neuro ICU(NICU) patients suggests the need for delirium assessment in stroke patients, with current tools being applicable for monitoring delirium in both stroke and brain injury patients37. The CAM-ICU score can accurately diagnose delirium after stroke, with a study by Mitasova et al. finding a sensitivity of 76%, specificity of 98%, and accuracy of 94% for the CAM-ICU in diagnosing delirium in stroke patients38. In addition, stroke-related delirium may interfere with the diagnosis of SAD, so we excluded pre-sepsis delirium in our exclusion criteria. Studies have shown that the incidence of delirium in stroke patients ranges from 10.7 to 16%39,40, while the incidence of delirium in the NICU ranges from 12 to 43%37. Infection is one of the risk factors for delirium in stroke patients41. The incidence of delirium is higher in sepsis patients with concomitant stroke; in our study, the incidence of delirium reached 50% in sepsis patients with stroke and 56.2% for SAD in the NICU.

Our results indicate that CVICU is a favorable factor for SAD, with an incidence rate of 19.9% in CVICU, similar to some studies42. The initial 24-h GCS score is also an important predictor of SAD, consistent with the results of the two most recent delirium prediction models18,19. Other predictive factors such as AKI, temperature, anion gap, blood sodium, vasopressors, respiratory rate, age, bicarbonate, platelets, and white blood cells have also been validated by similar studies or predictive models19,20,21,22,23,24,25. PDPs suggest that some of these predictors have a nonlinear relationship with the occurrence of SAD. For example, GCS score, temperature, sodium, and bicarbonate.

Our study has several limitations. First, there is currently no definitive diagnostic criterion for SAD. Although we established several inclusion and exclusion criteria, misdiagnosis and missed diagnoses remain inevitable. Second, we used LASSO regression for feature selection due to its efficiency in handling large numbers of variables, which may not be optimal for all models and may miss complex, non-linear relationships within the data. Third, it's important to note that the risk factor analysis based on PDPs may be subject to the assumption of feature independence. Finally, we did not further analyze the effects of sedative drug types, doses, and duration of use on SAD, which may complicate our predictive variables.

Conclusion

SAD is common in ICU sepsis patients, with higher mortality rates and longer ICU stays than sepsis alone. Using our machine learning-based early prediction model, we can predict the risk of SAD earlier than delirium can be detected by traditional tools such as CAM-ICU, and this model can be applied to patients who are difficult to assess conventionally. The establishment of this model facilitates early risk identification and the implementation of preventive measures, potentially reducing the incidence and mortality of SAD.