Prediction models for postoperative delirium in elderly patients with machine-learning algorithms and SHapley Additive exPlanations

Postoperative delirium (POD) is a common and severe complication in elderly patients with hip fractures. Identifying high-risk patients with POD can help improve the outcome of patients with hip fractures. We conducted a retrospective study on elderly patients (≥65 years of age) who underwent orthopedic surgery with hip fracture between January 2014 and August 2019. Conventional logistic regression and five machine-learning algorithms were used to construct prediction models of POD. A nomogram for POD prediction was built with the logistic regression method. The area under the receiver operating characteristic curve (AUC-ROC), accuracy, sensitivity, and precision were calculated to evaluate different models. Feature importance of individuals was interpreted using Shapley Additive Explanations (SHAP). About 797 patients were enrolled in the study, with the incidence of POD at 9.28% (74/797). The age, renal insufficiency, chronic obstructive pulmonary disease (COPD), use of antipsychotics, lactate dehydrogenase (LDH), and C-reactive protein are used to build a nomogram for POD with an AUC of 0.71. The AUCs of five machine-learning models are 0.81 (Random Forest), 0.80 (GBM), 0.68 (AdaBoost), 0.77 (XGBoost), and 0.70 (SVM). The sensitivities of the six models range from 68.8% (logistic regression and SVM) to 91.9% (Random Forest). The precisions of the six machine-learning models range from 18.3% (logistic regression) to 67.8% (SVM). Six prediction models of POD in patients with hip fractures were constructed using logistic regression and five machine-learning algorithms. The application of machine-learning algorithms could provide convenient POD risk stratification to benefit elderly hip fracture patients.


INTRODUCTION
Hip fracture is a common type of fracture in elderly patients.By 2050, it is estimated that more than 50% of osteoporotic fractures will be hip fractures in Asia [1].As life expectancy increases, more elderly patients choose surgery to treat hip fractures for a better prognosis.Postoperative delirium (POD) is a common and severe complication in patients with hip fractures [2][3][4].It is common for POD to occur 2-7 days after surgery.POD is associated with loss of independence, increased morbidity and mortality, institutionalization, and a prolonged hospital stay with higher healthcare costs [3,5].Researchers have found that multifactor prevention and treatment can benefit one-third of delirium cases [6].By identifying high-risk patients, clinicians can improve the outcomes of patients with hip fractures through timely intervention.
In various clinical domains, machine-learning methods have proven helpful in predicting events of interest [7][8][9][10].Some studies have developed POD prediction models in hip fracture patients with conventional logistic regression methods [11][12][13][14][15], but few have proposed prediction models with machine learning.Furthermore, the results of these studies were not entirely satisfied for the areas under the receiver operating curve (AUCs) of 0.779-0.79[16,17].More attempts should be presented for better predicting POD in hip fracture patients using machine-learning methods.
Thus, we try to develop a prediction model of POD with conventional logistic regression and machine-learning algorithms to support clinical decision-making.

MATERIALS AND METHODS Study design and patients
Our study was retrospective.From January 2014 to April 2019, a cohort of Chinese PLA General Hospital patients who underwent hip fracture surgery was analyzed in this study.The inclusion criteria were: (1) age ≥65 years; (2) undergoing surgery for hip fracture with anesthesia.The exclusion criteria were: (1) undergoing secondary surgery for hip fracture；(2) hip fractures caused by tumors.

Ethics statements
According to the Declaration of Helsinki, the study was approved by the Ethics Committee Board of the First Medical Center of the Chinese PLA General Hospital (Number: S2019-311-03).All data were anonymized before analysis, and patient consent was waived due to the retrospective study design.

Data collection
The dataset of hip fractures was established from the medical record system.We collected preoperative and intraoperative parameters.The basic characteristics of patients included age, sex, body mass index (BMI), smoking, alcohol, history of hypertension, diabetes, cardiovascular diseases (CHD), chronic obstructive pulmonary disease (COPD), renal insufficiency, cerebrovascular disease, depression, and anxiety.Before surgery, the prescribed medication included anticholinergic drugs, non-steroidal antiinflammatory drugs (NSAIDs), benzodiazepines, opioids, and antipsychotic drugs were recorded.The laboratory test results of the last time before surgery were collected: the complete blood cell count (CBC), Arterial Blood Gas (ABG), Clotting factors, and Comprehensive Metabolic Panel (CMP).Some intraoperative data were recorded: American Society of Anesthesiologists (ASA) physical status classification, the type of hip fracture, the type of surgery and anesthesia, duration of surgery and anesthesia, urine, blood loss, use of dexmedetomidine and droperidol, fluid management (crystalloid and colloid), blood transfusion, use of glucocorticoids (dexamethasone and methylprednisolone), dexmedetomidine, droperidol, vasoactive drugs, preoperative hospital stay, duration of systolic blood pressure (SBP) >=140 mmHg, and mean arterial pressure (MAP) <=60 mmHg.

Definitions of POD
The incidence of POD within consecutive 7 days postoperatively was recorded.First, the patients with characteristic words of delirium documented in the postoperative medical records were captured by the computer.All the characteristic words of delirium were chosen according to the Confusion Assessment Method (CAM) scale [18,19].Second, the patients using drugs for delirium postoperatively were also added.Third, the patients with preoperative medical records containing the words of delirium and the drug for delirium were excluded.At last, all the patients preliminarily diagnosed by a computer were rechecked by neurologists using the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) criteria [20].

Model building strategy
A predictive model using logistic regression was developed.The training and validation datasets were randomly divided by 3:1.The variables in the model were selected using forward and backward stepwise methods.The nomogram of the prediction model was then established.Patients from the validation dataset were used to evaluate the prediction model.The area under the receiver operating characteristic curve (AUC) was calculated to assess the prediction model's discrimination ability.Hosmer-Lemeshow goodness-of-fit testing was used to assess the model's calibration.For each threshold probability, a decision curve analysis (DCA) revealed the net benefits [21].
We developed five different machine-learning models with different algorithms: random forest (RF), Support Vector Machines (SVM), adaptive boosting with classification trees (AdaBoost), extreme gradient boosting with classification trees (XGBoost), and gradient boosting machine (GBM).The k-fold cross-validation (k = 5) was used for training since it is simple to understand and generally results in a less biased or optimistic estimate of the model skill than other methods [22].An over-sampling method was used for the nonequilibrium dataset (many negative and very few positive patients) to improve machine-learning models' performances.We used an improved over-sampling algorithm named borderline SMOTE in constructing our machine-learning models.The algorithm uses only minority class samples on the border to synthesize new samples, thereby improving the class distribution of the samples.After using borderline SMOTE, the model performance reached its best.
The interpretability of the model was used SHapley Additive exPlanations (SHAP).Feature importance of different individuals was shown in SHAP figures.

Statistical analysis
In this study, Student's t-tests were used to compare normally distributed continuous variables, expressed as mean (standard deviation).A Mann-Whitney's test compared continuous variables under non-normal distribution expressed as median and interquartile range.The χ 2 test or Fisher's exact test compares the categorical variables expressed as frequency or percentage.The significance level was set at 0.05, and all tests were two-tailed.The logistic regression model was developed with R 4.0.1 (R Foundation for Statistical Computing, Vienna, Austria).Machinelearning models were constructed with PyCharm 11.0.14.1 (JetBrains s.r.o., Prague, Czech Republic).

Baseline characteristics of patients
From January 2014 to August 2019, 812 elderly patients (>=65 years old) underwent surgery for hip fractures at the First Medical Center of Chinese PLA General Hospital.We excluded 14 patients whose hip fractures were caused by tumors and one patient who underwent surgery for a hip fracture for the second time.At last, 797 patients were enrolled in the final analysis.The incidence of delirium was 9.28% (74/797).Males comprised 23.7% of the enrolled patients (189/797).The POD patients were older than non-POD patients (83 vs. 79, P < 0.001).
Tables 1 and 2 show the characteristics and perioperative variables of the 797 patients.The median age of POD patients was significantly older than non-PODs [83(76.25,87)vs. 79(73,84)].The incidence of depression/anxiety, renal insufficiency, and COPD in POD patients was higher than in non-POD patients.The use of benzodiazepines and antipsychotics in POD patients was more common than in non-POD patients (32.4% vs. 20.1%,17.6% vs. 2.1%).The median duration of surgery was 100 (80,120) min.Compared to non-POD patients, the POD patients had higher Troponin T, Myoglobin, Brain Natriuretic Peptide (BNP), and Creatine Kinase-MB(CK-MB) (P ≤ 0.001).

Development of a nomogram with logistic regression
557 patients in the training dataset were used to develop the logistic regression model.In the Supplementary File, Table S1 shows the univariate logistic regression analysis results.Variables statistically significant in the univariate analysis were included in the multivariate logistic regression analysis.Among elderly patients with hip fractures, age, renal insufficiency, antipsychotics, COPD, LDH, and CRP were independent risk factors for POD (shown in Table 3).The collinearity diagnostics were performed to multicollinearity among the risk factors.The variance inflation factors of the independent risk factors were all <2.In the univariate model, neutrophils, lymphocytes, inorganic phosphorus, myoglobin, lipase, direct bilirubin, AST, SPO 2 , PT, PTA, INR and use of intraoperative vasoactive drugs were statistically significant, but not in the multivariate model.
The prediction model was evaluated on 240 patients in the validation dataset.The AUCs of the training dataset and the validation dataset were 0.77 (0.696-0.845) and 0.71 (0.593-0.827) (Fig. 1A).The accuracy, recall, and precision were 68.8%, 65.2%, and 18.3% in logistic regression (Table 4).The nomogram of the prediction model was developed with the six variables and their points (Fig. 1B).The calibration plot revealed good predictive accuracy between the actual and predicted probability by Hosmer-Lemeshow test (P = 0.749) (Fig. 1C).According to the DCA of the training dataset, except for a small range of low preferences, intervening based on the prediction model produced excellent outcomes (Fig. 1D).

Development of prediction models with machine-learning algorithms
All variables were preprocessed before the machine-learning models were constructed.The top variables in the normalized importance are BNP, troponin T, CRP, and CK-MB.Table S2 and Fig. S1 of the Supplementary File show the variables' quantified importance.Moreover, the variables' correlation was also calculated and displayed in Fig. S2 (Supplementary File).
The AUCs of models with different machine-learning algorithms are shown in Fig. 2. The model of RF performed best of 5 models with an AUC of 0.81.Models' accuracy, sensitivity, precision, and F1 were calculated with a confusion matrix (Table 4).The accuracy ranged from 68.8%-91.9% in 5 models.RF performed the best sensitivity up to 95.9%.The precision of SVM was the highest (67.8%).
Model interpretation at the individual level was performed using the SHAP algorithms.We inputted the information of four different patients into the model, and the RF model provided a ranking of the importance of variables for each patient (Fig. 3A-D).Contributions of different predictors differed among individuals with different SHAP values.BNP level was the top variable in 3 patients of all 4 patients.The result was similar to the importance plots of all the models.Although causality could not be established based on the current study design, it is conceivable that individualized modification of these factors (lowering BNP and lowering amylase) may help to reduce the risk of POD.NSAIDs non-steroidal anti-inflammatory drugs, RBC red blood cell, WBC white blood cell, BUN blood urea nitrogen, Scr serum creatinine, BNP brain natriuretic peptide, ALT alanine aminotransferase, AST aspartate aminotransferase, LDH lactate dehydrogenase, CK creatine kinase, CK-MB creatine kinase-MB, GGT γglutamyl transferase, ALP alkaline phosphatase, CRP C-reactive protein, PaO 2 oxygen partial pressure, PaCO 2 partial pressure of carbon dioxide, SPO 2 pulse oxygen saturation, BE base excess, TT thrombin time, APTT activated partial thromboplastin time, PT prothrombin time, PTA plasma prothrombin activity, INR international normalized ratio, FIB plasma fibrinogen.

DISCUSSION
Hip fractures have a devastating effect on the quality of life and function, with a high risk of death in one year.Timely surgery is the primary method of treatment for the elderly after a hip fracture [1].However, the incidence of delirium in patients after hip arthroplasty surgeries can range from 4% to 53% [23].It's crucial to screen high-risk patients with preoperative and intraoperative factors as the first step toward effective management.So, one logistic regression model and five machinelearning models of POD prediction were developed in our retrospective cohort study.The AUCs of the logistic regression model were 0.77 in the training dataset and 0.71 in the validation dataset.The results were almost identical to Kim, E. M.'s risk score for POD prediction [13].The risk score developed by Kim, E.M. for predicting postoperative delirium in patients  undergoing hip arthroplasty surgery includes nine variables.However, in our logistic regression model, we only included six parameters and achieved an AUC of 0.77 in the training dataset.Similar studies that used logistic regression have also been conducted, with AUC values ranging from 0.67 to 0.79 [11,14,24].With the growing application of machine-learning algorithms in medicine, some researchers have tried to develop POD prediction models of hip fractures with machine-learning algorithms.Oosterhoff et al. developed five POD prediction models using different machine-learning algorithms for hip fracture patients, with the neural network and elastic-net penalized logistic regression models performing best, achieving an AUC of 0.79 [17].Zhao H. et al. also used four machinelearning algorithms to construct POD prediction models of hip fracture in a cohort of 245 patients, with an AUC of 0.779 [16].In our study, we developed five different machine-learning models for predicting POD in hip fracture patients.Among these models, the random forest model achieved the best performance, with an AUC of 0.81.Interestingly, the random forest model also performed best in our previous study on POD prediction [10].Shen J. et al. developed a risk score for predicting POD in hip fracture patients, using nine variables, and achieving an AUC of 0.833 [25].Yang Y. et al. constructed a nomogram for POD prediction using only three variables and achieved an AUC of 0.84.Notably, these studies achieved high AUCs by including patients who had delirium before surgery.Preoperative delirium has been identified as an independent risk factor for POD in previous studies [26].However, our study excluded patients with POD preoperatively, as they had received effective delirium management before surgery.Our prediction model aims to help clinicians identify high-risk patients for POD who may not have been recognized before surgery.
Our machine-learning models identified BNP, Troponin T, CRP, CK-MB, and other laboratory markers as the most important predictors of POD in hip fracture patients in the whole dataset.Intervening with these biomarkers may help reduce the incidence of POD in high-risk patients.In contrast, other machine-learning studies have identified well-known risk factors such as a history of stroke, preoperative delirium, preoperative dementia, preoperative mobility aid, and advanced age (older than 90) as important predictors of POD [16,17].These factors have been widely studied and cannot be modified [1,2,23,26].Therefore, our conclusion may have more practical implications for preventing POD in hip fracture patients by focusing on modifiable biomarkers that can be intervened upon to reduce the risk of POD.Besides, we introduce the SHAP to increase the interpretability of the model.The SHAP provides feature rankings for individual cases.It may help clinicians target specific interventions for patients at high risk of delirium, rather than employing a comprehensive approach for all patients.This individualized approach allows for a more efficient allocation of medical resources, as interventions can be tailored to address the specific contributing factors for each patient.
Despite its strengths, several limitations of our study should be acknowledged.First, it is a retrospective study.We used the DSM-IV criteria for POD by retrieving medical and nursing records [20].Because the identification of POD based on the confusion assessment method (CAM) or 3D-CAM was not available in a retrospective study, this method may miss some hypoactive POD patients.Nevertheless, those with mixed and hyperactive POD patients always need urgent intervention for their poor prognosis [27].The incidence of POD is 9.28%, which is lower, for we only identify the new-onset delirium after the surgery.Second, it is a single-center study, and only internal validation was performed.Therefore, extensive application of the model results may be limited.Third, although the AUC of our machine-learning model is acceptable compared with other machine-learning studies (AUC = 0.81) [16,17], the performance of such machine-learning models can still be improved by exploring new algorithms.
In conclusion, we constructed six POD prediction models for patients with hip fractures using logistic regression, RF, AdaBoost, XGBoost, GBM, and SVM.The RF, one of five machine-learning modes, achieved the best AUC with 0.81.By providing convenient POD risk stratification, the application of machine-learning models can improve outcomes for elderly patients with hip fractures.

Fig. 1
Fig. 1 Logistic regression algorithm predicts ROC curve, nomogram, DCA curve and calibration curve of the model.A ROC curve of logistic regression in the training dataset and validation dataset.B The nomogram of the logistic regression model.This nomogram was developed with six perioperative predictors.Find each predictor's point on the uppermost point scale and add them up.The total point projected to the bottom scale indicates the % probability of POD.C The calibration curve of the logistic regression model.D The DCA of the logistic regression model for the training dataset.DCA decision curve analysis.

Fig. 2
Fig.2The ROCs and AUCs of POD prediction models using the various machine-learning algorithms.ROC receiver operating characteristic curve, AUC area under the curve of ROC, RF random forest, GBM gradient boosting algorithm, XGB XGBoost, SVM support vector machine, ADA AdaBoost.

Fig. 3
Fig. 3 The SHAP values of the top 9 variables for four patients.A Patient 1. B Patient 2. C Patient 3. D Patient 4. SHAP Shapley additive explanations.

Table 1 .
Patient characteristics and baseline variables.Data are mean (standard deviation), n (%), or median (interquartile range).COPD chronic obstructive pulmonary disease, ASA American Society of Anesthesiologists physical status classification system, SBP systolic blood pressure, MAP mean arterial pressure.

Table 3 .
Multivariable logistic regression model of study variables vs. POD in the training dataset.

Table 4 .
Comparison of the parameters of models for prediction of POD.AUC area under the curve of ROC, RF random forest, GBM gradient boosting machine, AdaBoost adaptive boosting, XGBoost eXtreme gradient boosting, SVM support vector machine, LR logistic regression;