Development of a deep learning model that predicts critical events of pediatric patients admitted to general wards

Early detection of deteriorating patients is important to prevent life-threatening events and improve clinical outcomes. Efforts have been made to detect or prevent major events such as cardiopulmonary resuscitation, but previously developed tools are often complicated and time-consuming, rendering them impractical. To overcome this problem, we designed this study to create a deep learning prediction model that predicts critical events with simplified variables. This retrospective observational study included patients under the age of 18 who were admitted to the general ward of a tertiary children’s hospital between 2020 and 2022. A critical event was defined as cardiopulmonary resuscitation, unplanned transfer to the intensive care unit, or mortality. The vital signs measured during hospitalization, their measurement intervals, sex, and age were used to train a critical event prediction model. Age-specific z-scores were used to normalize the variability of the normal range by age. The entire dataset was classified into a training dataset and a test dataset at an 8:2 ratio, and model learning and testing were performed on each dataset. The predictive performance of the developed model showed excellent results, with an area under the receiver operating characteristics curve of 0.986 and an area under the precision-recall curve of 0.896. We developed a deep learning model with outstanding predictive power using simplified variables to effectively predict critical events while reducing the workload of medical staff. Nevertheless, because this was a single-center trial, no external validation was carried out, prompting further investigation.


Data preprocessing
The pseudonymized identification code and hospitalization date were combined to create a unique classification code according to each individual hospitalization date, which was defined as the individual hospitalization identification code (IHID).The collected data were classified according to IHID, sorted in ascending order of vital sign measurement time, and missing values among SBP, DBP, HR, RR, BT, and SpO 2 were replaced with the immediately preceding values.In addition, the interval of vital sign measurement time was calculated within the same IHID (each vital sign measurement time-previous measurement time, in minutes), and this was defined as the measurement interval.Since the normal ranges of BP, HR, and RR in children differ according to age, z-scores for each age were calculated and used for analysis.Centile charts of vital signs for each age developed in a previous study were used for z-score conversion 17 .
Critical events were defined as cases where CPR occurred in the general ward, unexpected transfers to the ICU, and cases of mortality (results of CPR or discontinuation of life-sustaining treatment) 18,19 .Critical records were defined as the data measured from 6 h before the occurrence of the critical event to the time of occurrence in the case of unexpected ICU transfer or mortality, and in the case of CPR, it was defined as the data measured from 6 h before the occurrence to 30 min after the occurrence (from 6 h before CPR until death in the case of mortality after CPR).In order to perform deep learning on critical records, the total records were divided into two groups: critical group and non-critical group.Since the records of individuals who experienced a critical event will have a mixture of critical records and non-critical records, IHID's non-critical records with critical events were excluded from the non-critical group.In addition, since it is expected to be an imbalanced dataset in which the size of the non-critical group is substantially larger than the sample size of the critical group, only the last records for each IHID among the non-critical groups were used for deep learning.In general, it is common sense that vital sign records measured during hospitalization for each IHID are not limited to just one occurrence but rather numerous.Therefore, we anticipated that retaining only the last record per IHID among the vital sign records in the non-critical group, and utilizing all records in the critical group, would relatively alleviate the imbalance between the two groups.R version 4.3.1 (R Foundation for statistical computing, Vienna, Austria; https:// www.r-proje ct.org) was used for data preprocessing, and open packages such as the generalized additive models for location scale and shape and sitar were used in this process [20][21][22] .

Deep learning and data analysis
The preprocessed dataset was divided into a training set and a test set at a ratio of 8:2, and each was used for model training and testing.A simple artificial neural network (ANN) algorithm based on the multilayer perceptron was used for deep learning.Nine parameters used for learning were age, sex, z-score of SBP, z-score of DBP, z-score of HR, z-score of RR, BT, SpO 2 , and the measurement interval.The above features were normalized to a value between 0 and 1.The ANN model was composed of 3 hidden layers (each with node counts of 128, 128, and 64, respectively), and a 30% dropout was applied after each layer.The Adam optimizer and rectified linear unit activator were used in the process 23 .It was trained for 10,000 epochs with a learning rate of 0.0001 using Python version 3.8.10(Python Software Foundation, Beaverton, OR, USA; https:// www.python.org).Scikit-learn library was used for normalization 24 , PyTorch was used for model training and test 25 , and matplotlib and Shapley additive explanation (SHAP) library were used for visualization 26 .Since the measurement interval value of the first record for each IHID cannot be calculated (missing value), the average value of all measurement intervals was imputed.Continuous variables were described as median (interquartile range) and categorical variables as number (%).

Outcomes
The primary outcome of this study was the overall predictive performance of the developed model.Accuracy, AUROC, and area under the precision-recall curve (AUPRC) were used to evaluate the predictive performance of the model.The secondary outcomes included subdividing critical events into CPR occurrence, unexpected ICU transfer, and mortality, respectively, and evaluating the performance of the developed model for each.Additionally, based on the time elapsed before a critical incident occurred, measurements were divided into six subgroups: 0-1 h, 1-2 h, 2-3 h, 3-4 h, 4-5 h, and 5-6 h.For each subgroup, the predictive performance of the model was included.It also included an assessment of the importance of the prediction process for each feature used in learning and the correlation between features.

Baseline characteristics
During the study period, 13,787 patients were hospitalized a total of 22,184 times, and 1,039,070 vital sign records were analyzed.When analyzed by IHID, the age at admission was 69.0 (23.0-135.0)months, and 9,485 (42.8%) were girls.The duration of hospitalization was 3.0 (2.0-7.0)days.
Of the total records of vital signs, 632 (0.1%) cases were critical records, and the median measurement interval was 161.0 min.Detailed descriptions of SBP, DBP, HR, RR, BT, and SpO 2 are summarized in Table 1.There were 14,227 records remaining after data preprocessing; the age was 74.0 (22.0-139.0)months, and 6,041 (42.5%) were girls.The critical group included 632 (4.4%) of the patients, and among the critical records, 261 instances involved CPR, 238 cases involved unplanned ICU transfers, and 141 cases involved fatalities.There were 8 records of patients who died as a result of CPR.Additional information is described in greater depth in Table 2.The calculated mean value for imputing missing data in the first measurement interval for each IHID was 276.17.
Among the features used to predict the outcomes, measurement interval had the highest impact, followed by SpO 2 and a z-score of RR (Fig. 3).How the model prediction impact changes according to the high and low values of each feature is shown in Fig. 4. The lower the measurement interval (blue), the higher the impact on the model output, and the higher the measurement interval (red), the lower the impact.SpO 2 also showed the same pattern as the measurement interval.On the other hand, greater z-scores for RR and HR had a greater impact on outcomes, while lower z-scores for RR and HR had a lesser effect on outcomes (Fig. 4).
The correlation between the features was studied to further characterize the model.The SHAP value (the impact of the model output) increased with a smaller measurement interval, as in the prior results, but this time around, the z-score of HR had no discernible impact on the value (Supplementary Fig. S4).Regardless of whether  www.nature.com/scientificreports/ the measurement interval was high or low, SpO 2 and SHAP values consistently had an inverse correlation, and this tendency was more pronounced when the measurement interval was smaller (Supplementary Fig. S5).The supplementary figures provide a summary of the inter-feature influence of parameters that are not mentioned above (z-score of RR, Supplementary Fig. S6; z-score of HR, Supplementary Fig. S7; sex, Supplementary Fig. S8; age, Supplementary Fig. S9; z-score of SBP, Supplementary Fig. S10; z-score of DBP, Supplementary Fig. S11; and body temperature, Supplementary Fig. S12).

Discussion
Through this study, we created a deep learning model that uses simplified variables, including vital signs, age, sex, and measurement interval, to predict the need for intervention in pediatric patients who are deteriorating.Our approach, in contrast to earlier studies, predicts the probability of transfer to the ICU using only a handful  of variables without the need for accumulated measurements.Furthermore, the model demonstrated an AUROC of 0.986 and an AUPRC of 0.896, which were significantly better than those of earlier studies 15,16 .Numerous studies on previously developed PEWS have reported outstanding AUROC values of around 0.9, but the process of collecting and calculating the parameters for the scoring system is complex and timeconsuming, which can significantly increase the workload of the medical staff.Even when the efficacy of the prediction model is high, its impracticality can become an obstacle in clinical settings.It is important to evaluate the workload of medical staff, especially in an environment with limited medical resources [27][28][29] .The prediction model created in this study can decrease such workload for the medical staff because it utilizes vital signs, sex, and age as parameters that are expressed in plain values and are easy to access because they are collected in the hospital electronic medical record system.Moreover, predictions with a deep learning model can be generated  automatically without manually entering values into a computer, which can eliminate the workload of the medical staff entirely.In this investigation, the measurement interval was used as a learning parameter as opposed to the LSTM model study, which needs consecutive measurement results.Vital signs are typically not monitored as regularly in general wards as they are in ICUs, but the frequency increases if a patient's condition deteriorates.We were able to create a prediction model without the necessity of 20 consecutive observations because our prediction model was built to reflect this idea.As a result, predictions can be made before a collection of subsequent measurements is complete.
In the detailed analysis of critical events, AUROC consistently exceeded 0.96 for all CPR occurrences, ICU transfers, and deaths, mirroring the performance in predicting overall critical events.However, AUPRC exhibited a notable decline, possibly stemming from the model's lack of specialized training for individual events.Subsequent subgroup analysis by time interval yielded unexpected results.Contrary to expectations, proximity to critical events did not necessarily enhance prediction performance.Remarkably, the model demonstrated superior results across all time periods compared to the overall critical events prediction.The black box nature of deep learning made it challenging for the authors to provide a definitive explanation for these results.Yet, upon reflection, it was noted that the model was developed without the intention of making predictions based on a series of continuous measurements; instead, it analyzed only measurements from a single timestamp.Another crucial point to consider was that the parameters used for learning did not incorporate information capable of estimating the time from measurement to event occurrence, which was deemed a significant explanatory factor.
The persisting question surrounded the superior results observed in the time-specific subgroups compared to the overall performance.It was hypothesized that as measurements corresponding to critical events were divided into subgroups, the imbalance between the non-critical group and critical subgroups increased, thus maintaining an excellent AUROC.Additionally, to explain the enhanced AUPRC, the authors considered the homogeneity of the data.The non-critical group in the study comprised the last vital sign measurements taken before discharge from patients without a critical event, making it a relatively stable and homogeneous group.Conversely, the critical group, subject to medical interventions, naturally exhibited diversity in collected measurement values.It was reasoned that the longer the collection time, the greater the diversity, and narrowing the collection time window would decrease this diversity.Therefore, as the time window for measurement value collection narrowed, the homogeneity of the collected measurement values increased.Even if measurements at 5-6 h were relatively stable compared to those at 0-1 h, the existence of characteristics clearly distinguishable from the non-critical group just before discharge could contribute to elevated AUROC and AUPRC levels.Still, it is crucial to acknowledge that this explanation is rooted in assumptions and hypotheses, lacking concrete, objective evidence.Therefore, the interpretation and judgment of these findings are ultimately left to the readers.
This study has several limitations.The first is that no external validation was done, as the study was only conducted at one center.During the early stages of development, the PEWS performed outstandingly, but validation tests conducted in diverse settings had mixed results.Although the AUROC and AUPRC of our predictive model were high, we cannot ensure that the performance can be duplicated in other hospitals or in other target populations, as in the case of PEWS.Although overfitting was minimized by applying a 30% dropout to each layer, the possibility of overfitting the dataset in this study cannot be ruled out.Therefore, it is necessary to conduct follow-up studies for external validation in collaboration with other hospitals.Another limitation is that in the first measurement for each IHID, the measurement interval is inevitably missing, and in this case, the average value of the entire measurement interval was replaced.Considering that the factor with the most influence on our predictive model is the measurement interval (Fig. 3), it may be difficult to guarantee its performance for predictive power with only the first measurement.However, the total measurement interval was Figure 4. SHAP values for each feature used in the model.Shows the change in the impact value for the model output depending on whether the value of each feature is high (red) or low (blue).For example, when the measurement interval is low (blue), the SHAP value is higher than when the measurement interval is high (red), thus it can be interpreted that a short measurement interval is important in predicting the patient's deterioration.SpO 2 = oxygen saturation, RR = respiratory rate, HR = heart rate, SBP = systolic blood pressure, DBP = diastolic blood pressure, SHAP = shapley additive explanations.240 (162.0-480.0)minutes (Table 2), and the SHAP value changed rapidly when the measurement interval was low (Fig. 4).Therefore, the possibility of significantly changing the risk can be considered sufficiently low even if the average value of the interval was used for the first measurement taken in the general ward.In addition, an essential aspect to address in this study is that, even though deep learning models exhibit proficiency in predicting critical events, it is imperative to closely monitor a patient's organ function preceding major occurrences such as CPR or mortality.Despite the strong predictive capabilities of these models, the meticulous monitoring of a patient's organ function by medical staff remains indispensable for gaining insights into the patient's dynamic health status, allowing timely interventions and personalized care.We believe that the synergistic use of predictive models and continuous monitoring can ensure a comprehensive and proactive approach to patient care in critical situations.

Conclusion
Herein, we developed a deep learning model that predicts critical events using simplified variables.The performance of the model was excellent and worked without consequential serial measurements.A well-designed follow-up multicenter study is needed for external validation.

Figure 1 .
Figure 1.Receiver operating characteristic curve and precision-recall curve.AUROC = area under the receiver operating characteristic curve, AUPRC = area under the precision-recall curve.

Figure 3 .
Figure 3. Impact on the output of each variable used in the model.The higher the mean SHAP value (the longer the blue bar to the right), the greater the impact on the predictive model.SpO 2 = oxygen saturation, RR = respiratory rate, HR = heart rate, SBP = systolic blood pressure, DBP = diastolic blood pressure, SHAP = shapley additive explanations.

Table 1 .
Baseline characteristics of all vital sign records.Values are presented as median (interquartile range) or number (%).
*The centile chart developed in the previous paper was used to calculate the z-score by age.

Table 2 .
Characteristics of datasets used to develop deep learning models.Values are presented as median (interquartile range) or number (%).VS, vital sign; CPR, cardiopulmonary resuscitation; ICU, intensive care unit.*The centile chart developed in the previous paper was used to calculate the z-score by age.