Machine learning-based mortality prediction model for heat-related illness

In this study, we aimed to develop and validate a machine learning-based mortality prediction model for hospitalized heat-related illness patients. After 2393 hospitalized patients were extracted from a multicentered heat-related illness registry in Japan, subjects were divided into the training set for development (n = 1516, data from 2014, 2017–2019) and the test set (n = 877, data from 2020) for validation. Twenty-four variables including characteristics of patients, vital signs, and laboratory test data at hospital arrival were trained as predictor features for machine learning. The outcome was death during hospital stay. In validation, the developed machine learning models (logistic regression, support vector machine, random forest, XGBoost) demonstrated favorable performance for outcome prediction with significantly increased values of the area under the precision-recall curve (AUPR) of 0.415 [95% confidence interval (CI) 0.336–0.494], 0.395 [CI 0.318–0.472], 0.426 [CI 0.346–0.506], and 0.528 [CI 0.442–0.614], respectively, compared to that of the conventional acute physiology and chronic health evaluation (APACHE)-II score of 0.287 [CI 0.222–0.351] as a reference standard. The area under the receiver operating characteristic curve (AUROC) values were also high over 0.92 in all models, although there were no statistical differences compared to APACHE-II. This is the first demonstration of the potential of machine learning-based mortality prediction models for heat-related illnesses.

www.nature.com/scientificreports/ During 2014-2018, death due to heat-related illnesses in the United States was reported to be an average of 702 per year 4 . In this background, medical practitioners are continuously challenged to generate high quality of care for heat-related illness. The most important treatment for heat-related illness is rapid and effective cooling. There are various cooling strategies such as cold-water immersion, administration of cold fluids, application of ice packs or wet gauze sheets, fanning, and cooling suits 2,5 . In addition, more invasive methods are selected for critical patients, such as an intravascular cooling device or extracorporeal circulatory support system 6,7 . Occasionally, artificial ventilation, hemodialysis, or liver transplantation might be necessary for organ support 8,9 . However, it is difficult for clinicians to optimize therapeutic intervention according to individual patient conditions. The availability of clinical prognostic tools could be helpful in deciding these treatment options. Furthermore, the prognostic model could be used retrospectively to assess the quality of care for heat-related illness.
In recent years, prognostic tools using machine learning have been widely developed and applied in medicine, as they often outperform conventional prediction methods 10 . In contrast, a machine learning-based mortality prediction model for heat-related illness has not been developed previously. In this study, we aimed to develop and validate machine learning-based mortality prediction models for use in hospitalized patients with heatrelated illnesses.

Methods
Data sources and ethical approval. The data for this retrospective cohort study were obtained from the "Heatstroke study" database in Japan. A heatstroke study was undertaken by the Japanese Association for Acute Medicine (JAAM) to clarify the epidemiology of heat-related illness in Japan. The data were manually recorded by a staff member or medical doctor at each participating hospital using specific record sheets. From 2014, patients with heat-related illness who were admitted to the hospitals were included in the heatstroke study, except for the period 2015-2016, in which the heatstroke study was not conducted. Diagnosis of heat-related illness was based on the judgement of the clinician in each participating hospital. Thus, data from the heatstroke studies in 2014 and 2017-2020, from 109 to 142 participating hospitals, were extracted for our study. The heatstroke study has been described elsewhere 11,12 .
The heatstroke study protocol was approved by the ethics committee of Showa University Hospital. Patient information was de-identified before being provided for use in this study. The requirement for patient informed consent was waived, as this was an observational study using anonymous data. The current study was conducted in accordance with the Declaration of Helsinki.

Study population.
Overall, 2855 patients with heat-related illness were identified from the heatstroke study data in 2014 and 2017-2020. Of these, 285 patients were excluded because they were not hospitalized or no information was available regarding their hospitalization. Further, cases with cardiac arrest at hospital arrival and incomplete data regarding survival outcome were excluded. In total, the data of 2393 patients hospitalized with heat-related illness met the inclusion criteria. Finally, the subjects were classified into two groups: training set (n = 1516, data from 2014, 2017-2019) and test set (n = 877, data from 2020) (Fig. 1).
Outcome and variable selection. In this study, the outcome was set as death during hospital stay. From the heatstroke study database, 24 variables with missing values below 25% of all samples were extracted as predictor features for the outcome. These variables were age, sex, location at the onset (indoor or outdoor), vital signs (systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, and body temperature), total Glasgow coma scale (GCS), peripheral oxygen saturation (SpO 2 ), and laboratory data [pH, base excess, hematocrit, platelet count, blood urea nitrogen (BUN), creatinine, total bilirubin, aspartate aminotransferase (AST), alanine aminotransferase (ALT), creatine kinase, sodium, potassium, glucose, and prothrombin time/international normalized ratio (PT-INR)] at patients' hospital arrival. Missing data were imputed from the median of each variable. Development of machine learning models. Four kinds of machine learning models including logistic regression, support vector machine, random forest, and XGBoost were trained by using variables selected for mortality prediction in the training set. First, feature scaling to normalize the range of independent variables was accomplished. In the process of training, tenfold stratified cross-validation was used to avoid overfitting of the model. In short, the training data were partitioned into 10 stratified subsets. Subsequently, 9 subsets (90% of training data) were used to train the model, and the remaining subset (10% of training data) was used for the validation. These training and validation processes were repeated 10 times with each of the subsets used once as a validation dataset, allowing us to obtain 10 estimates of predictive accuracy, which were averaged to obtain a single estimate. Because our data were imbalanced for the outcome, we used cost-sensitive learning. In addition, optimization of hyperparameters (values that control the machine learning process) was performed for each model (Supplementary Table 1).
To assess the feature importances for the model development, Gini importances were computed as the normalized total reduction of the criterion brought by the feature for random forest and XGBoost models. For the logistic regression model, absolute values of standardized beta coefficients were described.
Validation of developed machine learning models. The performance of the developed machine learning models was validated using the test data; this process was independent of the algorithm training process. We compared these models with the conventional acute physiology and chronic health evaluation (APACHE)-II score as the reference standard for prediction of the outcome. The area under the receiver operating characteris- www.nature.com/scientificreports/ tic curve (AUROC), the area under the precision-recall curve (AUPR), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were measured as the performance indicators. To observe the correlation between predicted and observed probabilities of mortality during hospital stay, we created calibration plots in the test set.

Libraries for data analyses and machine learning.
To present the patient data, the mean with standard deviation (SD) or median with interquartile range (IQR) was used for the numerical variables. For categorical variables, counts with percentages were reported. For comparison analysis between two samples, the t-test and Mann-Whitney U test were used for the means and medians of samples, respectively. The frequencies were compared using the chi-square test. The two-sided significance level for all tests was set at 5% (p < 0.05). Patient characteristics were analyzed using the SciPy (version 1.5.

Results
Characteristics of study subjects. The baseline characteristics of the included patients are shown in Table 1. The mean age of all included patients was 65 ± 22 years, and 70.4% of the patients were men. Outdoor heat-related illness accounted for 54.9% of all patients. The mortality rate during hospital stay was only 5.2%, indicating that the analyzed dataset was highly imbalanced for the outcome. In comparison between training and test dataset, there were significant differences for age, location at the onset, body temperature, SpO 2 , pH, BUN, creatinine, total bilirubin, creatine kinase, and sodium. However, most of these differences appear to be clinically irrelevant.

Assessment of variable importances for the model development.
Absolute values of standardized beta coefficients for logistic regression, as well as feature importances for random forest and XGBoost models, were assessed and the results were shown in Fig. 2. In all machine learning models assessed, total GCS score at patients' hospital arrival was the most essential variable for the prediction of mortality during hospital stay. Both AST and ALT levels in blood were ranked in the top 5 important features in all models. The other key variables  Performance analysis of the developed models and the reference standard in the test dataset. Figure 3 presents the receiver operating characteristic (ROC) curves and the precision-recall (PR) curves   Probability calibration curves. Probability calibration curves of prediction models in validation were described in Supplementary Fig. 1. All models were not well-calibrated, indicating that the uncertainty of the predicted probability. XGBoost was underestimated, whereas APACHE-II, logistic regression, support vector machine, and random forest were overestimated for the outcome probabilities.

Discussion
To our knowledge, the current study is the first to develop and evaluate a machine learning-based prediction model for the prognosis of heat-related illness. In summary, we selected 24 clinical predictors for mortality of heat-related illness from the Japanese heatstroke database. After training these variables using several machine learning algorithms of logistic regression, support vector machine, random forest, and XGBoost, validation of the developed models demonstrated reliable performance with reasonably high AUROC. In comparison of AUPR, all models showed significantly superior performances than APACHE-II as a reference standard. Heat-related illness can be severe, such as heatstroke, and is induced by an excessively hot and humid environment 2 . Therefore, it is certain that avoiding such an environment would be the best strategy to reduce the poor outcome of this disease. In fact, there has been growing evidence that the environment predisposes people to heat-related illness; in addition, the risk factors for heatstroke have been identified 13,14 . On the other hand, there are few studies on the prognosis of patients who actually develop heatstroke 15,16 . Owing to the lack of a specific mortality prediction tool for heat-related illness, general scoring systems for critically ill patients, such as sequential organ failure assessment (SOFA) and APACHE-II scores, have been commonly used to estimate the severity of this disease 12,17 . The development of specific and reliable prognostic models for heat-related illnesses is anticipated so that clinicians can make an informed decision for optimized treatment. In this context, the current study shows its importance and strength.
Recent evidence has shown the effectiveness of machine learning methods in the development of predictive models in medicine 18,19 . Similarly, we successfully developed a good prognostic model for heat-related illness by using a machine learning algorithm in this study. Referring to the AUROC values, our developed models could not show statistical superiority over the conventional APACHE-II score, even if the models demonstrated higher AUROC values over 0.92 compared to that of APACHE II score with 0.87. However, the current study included only 877 patients for the validation cohort. The limited sample size and lack of statistical power might be the reason why we were not able to find statistical differences in AUROC. More importantly, our data was imbalanced for the outcome with only 5.4% in validation. In the evaluation of performance for imbalanced dataset, AUPR is more appropriate than AUROC because it was specifically fitted for the detection of rare events. Thus, significantly higher AUPR values in the developed models than APACHE-II have encouraged the effectiveness of machine learning to detect rare cases of mortality in heat-related illness. However, calibration plots showed underestimated or overestimated prediction for outcome probability, indicating that these models should be used only for the classification problem.
Our prediction model has the potential to be used in clinical practice. Given that we used only laboratory data and clinical findings at the time of hospital presentation as the predictor variables, the prediction might be used by clinicians as a reference tool for early treatment selection, including internal cooling and cardiopulmonary bypass for severe heat-related illness, which require huge medical costs. Furthermore, the model might be used www.nature.com/scientificreports/ retrospectively to assess the quality of care for the treatment of heat-related illness. However, we should not use the machine learning model as a definite tool to decide treatment withdrawal. Notably, body temperature at hospital arrival was not ranked as the highest top five of the mortality predictors selected for machine learning development. In contrast, multiple organ dysfunction indicators were widely chosen, namely, Glasgow coma scale for dysfunction of the central nervous system, systolic blood pressure for circulatory dysfunction, SpO 2 for respiratory dysfunction, AST and ALT for hepatic failure, PT-INR for coagulopathy, and base excess for metabolic disorders. Inclusion of multiple organ injury markers as parameters is similar to general severity scoring models such as SOFA and APACHE II scores 20,21 ; however, variables specifically selected for mortality prediction of heat-related illness might lead to better improvement of predictive performance than the conventional methods. For example, the liver is a common site of tissue injury in heatstroke and causes poor outcome 22,23 . In our machine learning models, AST and ALT levels at hospital arrival were regarded as important predictive values, whereas total bilirubin was included as a hepatic injury indicator in SOFA and no information of hepatic injury in APACHE-II; this difference may affect the predictive ability. In addition, renal dysfunction is relatively common in heatstroke 17,24 . Creatinine level is included in the SOFA and APACHE II scores; however, it was not mainly regarded as the one of important predictors for mortality in our machine learning models, suggesting that complications of renal dysfunction in heat-related illness might not be a strong factor for poor outcome.
Although several variables such as preexisting medical conditions and coagulation abnormalities were recognized as risk factors for the occurrence or poor outcome of heatstroke [25][26][27] , they were not used in the development of our machine learning models because of the huge amount of missing data in the dataset. The performance of the model might improve if these variables are available for machine learning in the future structured dataset.
Our study has several limitations. First, our prediction model cannot be generalized for application on a global scale. Heat acclimatization can occur in response to heat stress; thus, vulnerability and severity of a heat-related illness can differ depending on the climate in different countries. As we used the Japanese registry database for both training and validation of the model, external validation using databases from foreign countries should be performed in the future. Second, we imputed missing values from the median of each variable. This method is widely used, and is a simple way to impute missing data; however, it could generate bias. Third, the results of evaluation measures for our prediction model demonstrated a wide range of confidence intervals, indicating the uncertainty of the model. This can be attributed to the inadequate total sample size and rare occurrence of outcome (death during hospital stay). However, it is difficult to accumulate data for heat-related illness owing to its seasonal and geographic characteristics. In fact, to our knowledge, there are no larger databases with clinical parameters, including laboratory testing data for heat-related illness, than our heatstroke study registry. Further accumulation of data for such illness is crucial to increase the certainty of the machine learning prediction model. Fourth, we did not focus on the neurologic sequelae of surviving heatstroke patients, which is an important complication of the disease 28 . Although we could not obtain information on the neurological prognosis to be assessed, survival without sequelae should be the primary goal of treatment in real-world practice and thus might exhibit a more significant outcome for the prediction. Fifth, APACHE-II score is not specific to heat-related illness, therefore our study does not guarantee the superiority of machine learning models over simple statistical models which was specifically developed for heat-related illness. Finally, there would be a criticism that machine learning models need a computing device to calculate the results, and a separate model just for the patients with heat-related illnesses would not be realistic. As our selected features were mostly vital signs, laboratory data, and patient background, we suggest the use of machine learning model as a plugin to the electrical hearth record, after the completion of further improvement in the performance and prospective studies for external validation in the future.

Conclusions
In conclusion, a novel mortality prediction model for patients hospitalized with heat-related illness was developed using a machine learning technique. Although further improvement in the performance quality with increased sample size or inclusion of important variables, as well as prospective validation in a clinical setting are needed, our study demonstrated for the first time the potential of machine learning-based prediction models for heatrelated illness.