Machine learning model to predict hypotension after starting continuous renal replacement therapy

Hypotension after starting continuous renal replacement therapy (CRRT) is associated with worse outcomes compared with normotension, but it is difficult to predict because several factors have interactive and complex effects on the risk. The present study applied machine learning algorithms to develop models to predict hypotension after initiating CRRT. Among 2349 adult patients who started CRRT due to acute kidney injury, 70% and 30% were randomly assigned into the training and testing sets, respectively. Hypotension was defined as a reduction in mean arterial pressure (MAP) ≥ 20 mmHg from the initial value within 6 h. The area under the receiver operating characteristic curves (AUROCs) in machine learning models, such as support vector machine (SVM), deep neural network (DNN), light gradient boosting machine (LGBM), and extreme gradient boosting machine (XGB) were compared with those in disease-severity scores such as the Sequential Organ Failure Assessment and Acute Physiology and Chronic Health Evaluation II. The XGB model showed the highest AUROC (0.828 [0.796–0.861]), and the DNN and LGBM models followed with AUROCs of 0.822 (0.789–0.856) and 0.813 (0.780–0.847), respectively; all machine learning AUROC values were higher than those obtained from disease-severity scores (AUROCs < 0.6). Although other definitions of hypotension were used such as a reduction of MAP ≥ 30 mmHg or a reduction occurring within 1 h, the AUROCs of machine learning models were higher than those of disease-severity scores. Machine learning models successfully predict hypotension after starting CRRT and can serve as the basis of systems to predict hypotension before starting CRRT.


Results
Baseline characteristics. The mean age of all patients was 64 ± 15 years old, and 61.4% were male. Their systolic blood pressure (SBP), diastolic blood pressure (DBP), and mean arterial pressure (MAP) values were 114 ± 28, 59 ± 16, and 77 ± 17 mmHg, respectively. The target dose of CRRT was 40.7 ± 13.1 ml/kg/hr. Information on other features are shown in Table S1. None of the features differed between the training and testing sets.
Association between hypotension and mortality. The prevalence of hypotension which was defined as a reduction in MAP ≥ 20 mmHg and ≥ 30 mmHg within 6 h were 29% (n = 673) and 14% (n = 335), respectively. When the timeframe was within 1 h, the prevalence of a reduction in MAP ≥ 20 mmHg and ≥ 30 mmHg were 10% (n = 238) and 4% (n = 97), respectively. Figure S1 shows the nonlinear relationship between the odds ratio for ICU mortality and the reduction in MAP after CRRT. The patients with a larger decrease in MAP within 6 h or 1 h showed higher risk of intensive care unit (ICU) mortality than their counterparts.
Performance of machine learning models. When the machine learning models for a reduction in MAP ≥ 20 mmHg within 6 h were evaluated by area under the receiver operating characteristic curves (AUROCs), the extreme gradient boosting machine (XGB) model had the highest value of 0.828 (0.796-0.861), and the deep neural network (DNN) model had the second highest with an AUROC of 0.822 (0.789-0.856) ( Table 1). All of the AUROC values in machine learning models were higher than those obtained from SOFA, APACHE II, and MOSAIC scores (Ps < 0.001). When the outcome was defined as a reduction in MAP ≥ 30 mmHg within 6 h, the best model was the XGB with an AUROC of 0.861 (0.822-0.900). The light gradient boosting machine (LGBM) models achieved the next highest AUROC value of 0.845 (0.802-0.888). Even in this outcome, the machine learning models demonstrated superior performance to the SOFA, APACHE II, and MOSAIC scores (Ps < 0.001). The plots of AUROCs support these results (Fig. 1). When other outcomes were used such as setting the timeframe to within 1 h or nadir MAP of 65 or 55 mmHg, the XGB model had the higher AUROC values than the SOFA, APACHE II, and MOSAIC scores (Ps < 0.001) (hypotension within 1 h in Table S2; nadir MAP in Table S3).
Other performance indices such as accuracy, F1 score, recall, precision, F2 score, specificity, and Matthews correlation coefficient (MCC) for predicting decrease in MAP within 6 h are shown in Table 2. For the outcome of a reduction in MAP ≥ 20 mmHg, the LGBM model achieved the highest accuracy. The support vector machine (SVM) models showed the highest accuracy for predicting a reduction in MAP ≥ 30 mmHg. The XGB models showed the highest F1 score and MCC in predicting a reduction in MAP ≥ 20 mmHg and ≥ 30 mmHg among machine learning models. All of these indices in machine learning models were higher than those in conventional scoring models. When the outcome was defined using other criteria, the machine learning models had the higher AUROC values than the SOFA, APACHE II, and MOSAIC scores: the XGB model when the timeframe was 1 h (Table S4); and nadir MAP was used (Table S5). XGB models showed significantly higher values of AUROCs than logistic regression models for the outcome of a reduction in MAP ≥ 30 mmHg within 1 h, nadir MAP < 65 mmHg, and < 55 mmHg. In addition, XGB models showed higher values of F1 score and MCC than logistic regression models for all outcomes except reduction in MAP ≥ 30 mmHg within 6 h.

Rank of features in machine learning model.
To estimate the contribution degree of each feature in predicting the risk of hypotension, the feature ranking analysis was performed. The features contributing to the www.nature.com/scientificreports/ LGBM and XGB models were laboratory findings and vital signs (Figs. 2 and 3). Among laboratory findings, pH was the most important predictor, and serum protein and albumin were the next. Among vital signs, the MAP was the best contributor. In the SVM model, BPs were the most important in predicting MAP drop within 6 h, and some medications were important in predicting MAP drop within 1 h (Fig. S2). In the DNN model, BPs were the most important in the model performance, and other vital signs, pH, and some medications were determined to be important (Fig. S3).
The change in the model performance of XGB and DNN was evaluated by adding each of the top 10 features in order of ranking results of each model (Tables 3 and S6). In the XGB model, the AUROC values increased depending on the features used, whereas the accuracy, F1 score, and, MCC had an increasing trend from 5 to 10 features (Table 3). In the DNN model, increasing performance was shown for the top 30 features used in the model (Table S6). These results indicate that at least 20 or 30 features were needed to precisely predict the hypotension risk in the above machine learning models.
Calibration of models. When Brier's scores were calculated for calibration, the XGB model had the lowest value for most outcomes, and other models had relatively low values (Table S7). All machine learning models had lower values of Brier's scores than other conventional scores such as SOFA, APACHE II, and MOSAIC. The XGB models showed the lowest Brier's score among machine learning models predicting outcomes, except predicting the outcome of MAP ≥ 20 mmHg within 6 h. The XGB model had a lower Brier's score than the logistic regression model for the outcomes of reduction in MAP ≥ 20 mmHg and ≥ 30 mmHg within 1 h, and MAP < 65 mmHg and MAP < 55 mmHg within 6 h. Table S8 shows the AUROCs of the logistic regression and XGB models using SOFA, APACHE II, and MOSAIC scores as predictors, and their   www.nature.com/scientificreports/ Nested tenfold cross-validation. The AUROC values of machine learning models with the nested tenfold cross-validation were lower than the previous results (Table S9).

Discussion
Unexpected hypotensive events after starting CRRT are a critical issue because they contribute to worse outcomes, as noted in the above association with high ICU mortality 5,6 . Machine learning models such as XGB, LGBM, and DNN successfully predicted the risk of hypotension and performed better than conventional scoring models such as SOFA, APACHE II, and MOSAIC. The XGB model had the best performance among all models.  www.nature.com/scientificreports/ AUROC was significantly higher in the XGB model than in the logistic regression model only for outcomes of MAP < 65 mmHg, and MAP < 55 mmHg within 6 h, and MAP Δ30 within 1 h. However, the XGB model had higher F1 score and MCC for all outcomes except MAP Δ30 within 6 h than the logistic regression model. These results indicate that precise prediction of CRRT-related hypotension is achievable by machine learning algorithms, especially XGB, although complex and interactive relationships of several features exist. Based on the ranking analysis, at least 10 features were required to develop machine learning models, and the corresponding 10 features are as shown in the Figs. 3 and S3, including MAP, SBP, DBP, heart rate, pH, serum protein, and prothrombin time-international normalized ratio (PT INR). The value of pH is important to predict hypotension because it is well known that metabolic acidosis frequently causes hypotension 14 . Because the patients with prolonged coagulation time due to sepsis or acute liver failure have a high risk of hypotension, PT INR was important predictor feature 15,16 .
Critically ill patients undergoing CRRT are in a complex clinical situation, which frequently embarrass clinicians in determining the outcomes. Machine learning may overcome the difficulty of considering complex and numerous clinical situations. Several studies have applied machine learning algorithms to critically ill patients and have shown superior performance compared to existing models or scoring systems in predicting outcomes 17 . Our previous study also demonstrated that machine learning had better performance than conventional scoring systems, such as SOFA and APACHE II, in predicting mortality of CRRT patients 10 . The present study expands the utility of machine learning in predicting hypotension as other outcomes of CRRT and provides a clue on advanced management before the occurrence of hypotension.
Excessive ultrafiltration is thought to significantly affect hypotension during CRRT 13 . Other conditions such as reduced cardiac preload resulting from defective vasoconstriction and redistribution of fluids resulting from sepsis or inflammation also contribute to hypotension during CRRT 18,19 . Rapid clearance of plasma solutes by convention method results in osmolar reduction and shifts water from intravascular to interstitial compartments, consequently causing decreased effective arterial blood volume and hypotension 13 . Concurrent cardiac dysfunction can be aggravated by ultrafiltration or blood flow of CRRT, resulting in hypotension 20 . However, precise prediction of CRRT-related hypotension could not be obtained by this theoretical approach alone in real clinical practice. The present feature ranking analysis demonstrated that vital signs at the time of CRRT are the most important contributor to hypotension, which should be assessed before starting CRRT.
Although the results are informative, there are certain limitations to be discussed. Because of a single center design, external validation was not available. The sample size of the cohort was modest. The advantage of machine learning is its high performance, particularly with extremely large sample size. However, there is no specific cutoff on the sample size in machine learning algorithms, and the present sample size of 2349 with ≥ 90 features was greater than the sample size (n = 488) of the previous 258 studies which used machine learning algorithms to analyze ICU data 21 . Because the study analyzed a retrospective cohort, prospective validation is needed. The study identified the most important features with respect to predicting hypotension, but certain degrees of risk, such as the relative risk, could not be obtained. This is a common limitation of machine learning algorithms. Concerns could be raised regarding other issues such as overfitting and the effects of un-identified factors such as response to time-varying vasoactive support and ultrafiltration. The present non-nested cross-validation method could result in a possibility of overfitting.
The SOFA, APACHE II, and MOSAIC scores have been developed to predict mortality but not hypotension after CRRT, which might have low performance.

Conclusions
The application of machine learning algorithms improves the predictability of hypotension after starting CRRT, and machine learning performs better than conventional scoring models used in critically ill patients. If the machine learning-based prediction models are successfully applied to clinical practice, the overall patient outcomes will improve by proactive management of hypotension. Future studies will explore whether machine learning can predict other outcomes of CRRT and will validate results in an independent cohort.

Method
Data source and study subjects. A total of 2,756 adult patients (≥ 18 years old) who started CRRT due to acute kidney injury were retrospectively reviewed at Seoul National University Hospital from June 2010 to February 2020. Patients who had underlying end-stage renal disease (n = 344), stopped CRRT within 1 h after initiation (n = 49), and had no information on comorbidities or laboratory data (n = 14) were excluded. Accordingly, 2349 patients were analyzed in the present study. The patients were randomly divided into a training set (70%) to develop the models and a testing set (30%) to test and calibrate their performance. The study was approved by the institutional review board of the Seoul National University Hospital (no. H-2003-024-1106). All methods have been carried out in accordance with the guidelines, relevant regulations and ethical principles for medical research guided by the Declaration of Helsinki. The requirement of informed consent was waived by the board.
Study variables and outcomes. Using an electronic medical record system, a total of 92 features were used to develop machine learning models. We used the features before and at the time of starting CRRT during the model development. The features within 24 h prior to starting CRRT were medications, infusion rate of fluids, and laboratory findings. Other features were measured at the time of starting CRRT. Clinical features included age, sex, weight, application of the mechanical ventilator, and comorbidities, such as diabetes mellitus, hypertension, ischemic heart disease, chronic heart failure, stroke, peripheral vascular disease, dementia, chronic kidney disease including diabetic nephropathy, chronic obstructive pulmonary disease, connective tissue disease, peptic ulcer disease, cancer, and arrhythmia including atrial fibrillation, atrioventricular block, www.nature.com/scientificreports/ ventricular tachycardia, tachycardia-bradycardia syndrome, and total left bundle branch block. Vital signs such as SBP, DBP, MAP, heart rate, respiratory rate, and body temperature were measured at the time of initiating CRRT. The blood pressure values were continuously collected every 1 h or less after starting CRRT. The laboratory data included white blood cell counts, hemoglobin, hematocrit, platelet, total bilirubin, blood urea nitrogen, creatinine, total protein, albumin, pH, sodium, potassium, calcium, phosphate, uric acid, prothrombin time-international normalized ratio, activated partial thromboplastin time, partial pressures of arterial carbon dioxide and oxygen, partial pressure to fractional inspired oxygen, alveolar to arterial oxygen gradient, and the presence of bacteremia. As a setting value, target dose, blood flow rate, amount of dialysate and replacement fluids (pre-and post-dilution), target amount of input and output, the number of bicarbonate ampules mixed in dialysate and replacement fluids, and catheter type were collected. The information on the infused medications or fluids and their infusion rates were obtained, as shown in Table S1. The number of bicarbonate ampules mixed in these fluids were calculated. The Glasgow coma scales were calculated. The SOFA, APACHE II, and MOSAIC scores were measured based on the methods presented in the original studies [22][23][24] . Hypotension was defined as a reduction in MAP ≥ 20 mmHg from the initial value within 6 h. Additionally, other definitions were used such as a reduction in MAP ≥ 30 mmHg from the initial value, setting the timeframe to within 1 h, or nadir MAP < 55 or 65 mmHg. The ICU mortality, which was defined as all-cause death during the ICU admission, was estimated.
Statistical analysis and development of machine learning models. Development  LGBM, and XGB. We developed machine learning models using a tenfold cross-validation in the training dataset, and the models were evaluated using the test dataset to identify the performance of models. The SVM models used four kernels including linear, polynomial, sigmoid, and radial basis functions. For each kernel, tenfold cross-validation to determine the best set of hyperparameters (cost, gamma, degree, and coefficients) was performed using grid search. The kernels corresponding to the highest AUROC were derived from the final model. In the DNN model (i.e., artificial neural network with multiple layers between the input and output layers), optimal hyperparameters consisting of the size (number of hidden nodes) and decay (parameter for weight decay) with tenfold cross-validation and grid search were determined. When developing the SVM and DNN models, the continuous features were normalized, and categorical features were processed as a one-hot encoding. In the LGBM model, hyperparameters (max_bin, learning rate, and nrounds) were adjusted, and the model with the highest AUROC was selected for comparison. In the XGB model, hyperparameters (eta, gamma, max depth, and nrounds) were adjusted, and the model with the highest AUROC was selected for comparison. For comparing with machine learning models, we have developed logistic regression models predicting outcomes. Machine learning models using SOFA, APACHE II, and MOSAIC scores as predictors were developed and evaluated. To evaluate the suitability of machine learning algorithms to our data and compare among machine learning models, nested tenfold cross-validation was additionally conducted with total study data for predicting reduction in MAP ≥ 20 mmHg and MAP ≥ 30 mmHg from the initial value within 6 h, inner loop with tenfold for hyper-parameter tuning and an outer loop with tenfold for validation of models.
For performance indices, AUROC, F1 score, recall, precision, F2 score, specificity, and MCC were measured in the testing set. The AUROCs were compared between models using the DeLong test. The confidence intervals of AUROCs were estimated using the DeLong method 25,26 . MCC is an informative and truthful score in evaluating binary classification compared to accuracy and F1 score 27 . The MCC values of + 1, 0, and -1 represent perfect prediction, average random prediction, and inverse prediction, respectively. The threshold was determined when the F1 score was the highest. For calibration, Brier's scores were calculated, with those closer to 0 indicating good calibration. We ranked the importance of features in the SVM with weight vectors, the DNN with weight values, and the LGBM and XGB models with SHapley Additive exPlanations (SHAP) [28][29][30] . The performance of machine learning models with variable numbers of features in order of ranking were also evaluated. P values less than 0.05 were considered significant. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.