Introduction

Traumatic injuries have long been scaled to assess the severity of patients and especially to predict mortality. To quantify the severity of such injuries, various metrics using anatomic and physiologic elements have been established based on conventional statistical tools such as logistic regression (LR)1,2,3,4,5,6,7. Since the 1970s, the injury severities have been classified through an anatomical-based coding system named the abbreviated injury scale (AIS). The AIS classifies individual injuries by body region from one to six. The system has been constantly revised and improved, and the latest revision (AIS 2005—Update 2008 and AIS 2015), including 1999 injury descriptors, was recently announced1.

The most popular model to assess the trauma severity is the injury severity score (ISS), which pertains to the sum of the squares of the three highest AIS scores2. Similarly, the new ISS (NISS), which is designated to replace the ISS, pertains to the squares of the three highest AIS scores regardless of the body region3,4,5. Using the ISS and NISS, mortality has been predicted based on a certain score; however, the associated mortality prediction accuracy is low for patients with multiple injuries in different areas6,7.

As a more complex combination model including the physiological status, the trauma and ISS (TRISS) has been widely used8. In addition, other combination models, such as A Severity Characterization of Trauma (ASCOT)9, the Harborview Assessment for Risk of Mortality (HARM)10, the Trauma Risk Adjustment Model (TRAM)11, and the Kampala Trauma Score (KTS)12 were developed. However, these scoring systems are overly complex to be applied in actual practice. Thus, the anatomic injury scoring systems based on ISS and NISS are still widely used despite their low prediction accuracy, and it is necessary to develop a simple and accurate prediction method.

Recently, artificial intelligence (AI) technologies have been applied in various medical domains to predict the patients’ outcome with a high accuracy13,14,15. However, in the trauma field, an effective AI prediction model has not been established yet. Therefore, this study was aimed at developing an AI model to accurately predict mortality as an alternative to ISS and NISS.

Methods

Patients and datasets

This study was approved by the four institutional review boards of Wonkwang University Hospital (WKUH), Chonnam National University Hospital (CNUH), Wonju Severance Christian Hospital (WSCH), and Gachon University Gil Hospital (GUGH). Informed consent was waived. All methods were performed in accordance with the relevant guidelines and regulations. These four hospitals include the regional trauma centers of South Korea, corresponding to the level-1 trauma centers in the United States. In each regional trauma center, three or more dedicated trauma coordinators and surgeons review the medical records of the trauma patients to be input to the Korean trauma data bank (KTDB). To be registered in the KTDB as a trauma patient, the following eligibility criteria must be satisfied: (1) discharged with hopelessness, died in the trauma bay, or transferred to other hospitals, and (2) admitted in a trauma center16. However, among the patients satisfying the above conditions, patients with only a superficial injury, first-degree burn and corrosive injury, frostbite, or intoxications are not registered in the KTDB. For this study, we considered the following inclusion criteria: (1) trauma patients registered in the KTDB, (2) adult patients aged over 19, (3) patients who stayed more than 6 h in the trauma center.

In the KTDB, the AIS 2005—Update 2008 data of 42,933 patients17 registered in four trauma centers (WKUH [from Jan 2017 to Dec 2019], CNUH [from Jan 2014 to Dec 2019], WSCH [from Jan 2014 to Dec 2019], and GUGH [from Jan 2014 to Dec 2019]) are available. We excluded the data of patients aged less than 19 (n = 4562) or those who spent less than 6 h in the hospital (n = 648). The patient information, including age, gender and the AIS scores, is summarized in Table 1. Among the included 37,762 patients, 36,493 (96.6%) survived (average age: 57.0 ± 17.80) and 1269 (3.4%) deceased (average age: 64.66 ± 16.42). More details of the age and gender according to the four trauma centers are summarized in Supplementary Table 1.

Table 1 Patient information of age, gender, and the AIS scores (0–6) for six body regions: head/neck, face, thorax, abdomen, extremity, and external, according to the survived and deceased groups.

The data of 37,762 patients were considered and divided into the training and testing cohort in a 9:1 ratio in a stratified fashion according to the trauma center and survival status. As presented in Table 2 and Fig. 1a, the training and testing datasets included 33,986 (32,844 survived and 1142 deceased) and 3776 (3649 survived and 127 deceased) patients, respectively. The testing dataset was used only to evaluate the performance.

Table 2 Summary of training, validation, and testing datasets.
Figure 1
figure 1

Overview of training and validation of the AI models to predict the in-hospital mortality in trauma patients: (a) Process flow of the AI model development, and (b) DNN (Region-46) architecture with a five-layer deep neural network consisting of an input layer, three fully connected (FC) layers, and output layer.

New trauma severity features

The original AIS codes have 1999 injury codes, which are excessively complex to be used as input features in AI models and may lead to model overfitting. Consequently, we established new anatomical categories named Region-46 and Region-6 by reducing the AIS codes into 46 categories and six categories, respectively, based on the organ locations.

To generate Region-46, we reduced the 1999 AIS codes to 46 features through organization according to the organ location, clinical significance, and frequency. The complete list of Region-46 is presented in Supplementary Table 1. In the head region, we categorized the codes by the cerebellum, cerebellum epidural, intracerebellar, cerebellum subdural, cerebrum, cerebrum epidural, intracerebral, cerebrum subdural, skull, and head. The relatively less important and less frequent injuries were categorized as head or skull. In the face region, no frequently occurring code represented a severe injury, and thus, all the corresponding codes were categorized as the face. In the neck region, we categorized the codes by the neck vessel, neck esophagus, neck trachea, and cervical spine. The other corresponding codes were categorized as the neck. In the thorax region, we categorized the codes by the thorax vessel, bronchus, diaphragm, thorax esophagus, heart, lung, pericardium, thorax trachea, rib, and thoracic spine. The other codes were categorized as the thorax. In the abdomen region, we categorized the codes by the abdomen vessel, bladder, colon, duodenum, small bowel, kidney, liver, mesentery, omentum, pancreas, rectum, spleen, stomach, and lumbar spine. The other codes were categorized as the abdomen. In the upper extremity region, all the codes were categorized as upper extremity because the scores of all the codes were low overall. In the lower extremity region, the pelvis was the most clinically significant organ, and the other injuries were less significant. Thus, we categorized the codes as the pelvis and lower extremity. In the external region, the AIS scoring codes were low, and thus, all the codes were categorized as external. In Supplementary Table 2, we also presented the example of how AIS codes were converted to Region-46 codes for a novice coder with trauma knowledge to understand the process.

When using the Region-46 categories as input features, each input feature was the sum of the squares of the AIS scores for the same feature. For instance, if two AIS scores (valued 2 and 3) corresponded to the neck among the Region-46, the AI model input for the neck feature was \(2^{2} + 3^{2} = 13\). In this manner, each patient data point pertained to 46 feature values used for the AI model input layer to predict mortality.

Region-6 was considerably simpler than Region-46, as it scored the body regions identical to the six regions of the ISS (head/neck, face, thorax, abdomen, extremity, and external). ISS takes the highest AIS severity code in each of the three most severely injured body regions. Each three AIS score is squared and added (i.e. ISS = 52 + 42 + 22).

Training and cross-validation of the DNN models

To develop the AI models for mortality prediction, a five-layer DNN was adopted. We developed three DNN models with variable input features, as follows: (1) DNN (Region-46) with 46 input features, (2) DNN (AIS) with the original 1999 AIS codes, and (3) DNN (Region-6) with six body regions.

Furthermore, up to five hidden layers were examined, along with each layer depth (node) up to the previous layer depth (node). For the fully connected layers (FCs), the dropout rate was varied from 0 to 0.5 in 0.1 increments. The last FC layer was fed to a sigmoid layer, providing the probabilities for the patient mortality. We implemented and trained the models using TensorFlow and Keras. Moreover, model training was realized using the ADAM optimizer and a binary cross-entropy cost function with a learning rate of 0.0001 and batch size of 64 on an NVIDIA GeForce GTX 1080 Ti GPU.

For the hyperparameter search and cross-validation, a grid search with a 10-repeated tenfold cross-validation was performed. Among the 10 folds, a single fold was retained as the validation data, and the remaining nine folds were used as the training data. Specifically, the training data (n = 33,986) were randomly shuffled for each trauma center and partitioned into 10 equal folds in a stratified manner: Each fold included 3284 survived and 114 deceased data. The process was repeated 10 times, with each of the 10 folds used once as the validation data. Because the amount of survived data overwhelmed the deceased data, we randomly downsampled the survived data. By matching the amount of data in the two groups, the model bias toward survived data could be minimized.

Considering the cross-validation analysis results, the five-layer DNN was selected. The architecture of DNN (Region-46) is illustrated in Fig. 1b. The proposed five-layer DNN included an input layer, three FC layers, and an output layer. The input layer containing 46 features was fed to three FC layers in series, consisting of 64, 32, and 1 node(s) with a leaky rectified linear unit. To alleviate the overfitting issue, dropout and L2 regularization were applied in the last two FC layers. For the second and third FC layers, the dropout rates were 0.5 and 0.3, respectively, and regularization parameters were 0.001. The last FC layer was fed to a sigmoid layer.

By repeating this stratified tenfold cross-validation process 10 times, 100 models from the five-layer DNN were derived. Based on these models, we ensembled the results by averaging the final outputs associated with the mortality probabilities.

Performance evaluation of AI models

The prediction performance of the DNN models was evaluated using the isolated testing dataset (n = 3776) and by applying the sensitivity, specificity, accuracy, and balanced accuracy metrics, defined as

$$Sensitivity = { }\frac{TP}{{TP + FN }},$$
(1)
$$Specificity = { }\frac{TN}{{TN + FP }},$$
(2)
$$Accuracy = { }\frac{TP + TN}{{TP + TN + FP + FN }},$$
(3)
$$Balanced \;Accuracy = { }\frac{Sensitivity + Specificity}{{2 }},$$
(4)

where TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative, respectively. The balanced accuracy was used to account for the considerable imbalance between the survived and deceased groups (29:1).

Next, we compared the performance of the proposed DNN model and other external AI models such as LR, random forest (RF), and support vector machine (SVM). The models were separately trained to perform cross-validation and determine the optimal hyperparameters.

Finally, we compared the results from the conventional ISS and NISS techniques, as follows:

  1. 1.

    ISS-16: mortality declared if \({\text{ISS}} \ge 16\)

  2. 2.

    ISS-25: mortality declared if \({\text{ISS}} \ge 25\)

  3. 3.

    NISS-16: mortality declared if \({\text{NISS}} \ge 16\)

  4. 4.

    NISS-25: mortality declared if \({\text{NISS}} \ge 25\)

Results

For all 37,762 patients, we counted the number of patients associated with each of the Region-46 features and summarized the counted numbers in Supplementary Table 3. The number of patients associated with lower extremity (n = 12,280) was the highest, followed by upper extremity (n = 11,548), face (n = 8310), head (n = 7200), rib (n = 6147) and thorax (n = 4763). Note that each patient may belong to multiple features due to multiple injuries.

Training and cross-validation of the DNN models

The results of the cross-validation accuracy with the metrics of sensitivity, specificity, accuracy and balanced accuracy are presented in Supplementary Table 4. The highest balanced accuracy (84.61%) corresponded to the proposed model, DNN (Region-46), followed by that of the DNN (Region-6) (83.32%) and DNN (AIS) (80.64%). DNN (AIS) achieved a smaller balanced accuracy than that of DNN (Region-46) and DNN (Region-6), indicating that overfitting may have occurred, as expected; 1999 input features are too many to learn patterns based on 33,986 training data.

Validation of DNN models in the testing dataset

The prediction performances of the DNN models evaluated using the independent testing dataset (n = 3776) are summarized in Table 3. The highest balanced accuracy (85.05%) corresponded to DNN (Region-46), followed by that of DNN (Region-6) (83.62%) and DNN (AIS) (81.27%). The highest area under receiver operating characteristic (AUROC) corresponded to DNN (Region-46) (0.9084), followed by that of DNN (Region-6) (0.9013) and DNN (AIS) (0.8819). Thus, we chose DNN (Region-46) as the proposed AI model to predict the in-hospital mortality of trauma patients. Notably, DNN (Region-46) demonstrated a high prediction performance in terms of sensitivity (83.46%), specificity (86.63%), and accuracy (86.52%).

Table 3 Comparison of the prediction performances of the prediction models on the test dataset.

Comparison with other prediction models

The prediction performances of the external AI models (LR, RF, and SVM models) and traditional prediction models (ISS-16, ISS-25, NISS-16, and NISS-25) are presented in Table 3. Details of the performance comparison according to the four trauma centers are summarized in Supplementary Table 5.

The highest balanced accuracy pertained to DNN (Region-46) (85.05%) and that of the other AI models ranged from 79.40 to 83.62%. Among the input features based on the DNN, the highest AUROC values pertained to Region-46 (0.9084), followed by those of Region-6 (0.9013), and AIS (0.8819); the corresponding values of ISS and NISS were 0.8709 and 0.8681, respectively (Fig. 2a). Among the different AI models based on Region-46, the highest AUROC corresponded to DNN (Region-46), followed by that of the LR (0.9013), RF (0.8853) and SVM (0.8829) (Fig. 2b).

Figure 2
figure 2

Receiver operating characteristic curves: (a) ISS, NISS, DNN (AIS), DNN (Region-6) and DNN (Region-46), and (b) four models of logistic regression (LR), random forest (RF), support vector machine (SVM) and DNN based on Region-46: LR (Region-46), RF (Region-46), SVM (Region-46) and DNN (Region-46), respectively.

Notably, the traditional prediction models based on the ISS and NISS exhibited a lower prediction performance (balanced accuracy ranging from 70.82 to 80.50) than AI models based on LR, RF, SVM, and DNN (balanced accuracy ranging from 79.40 to 85.05).

Discussion

We proposed a DNN model with 46 features, DNN (Region-46) as the AI prediction model to predict the in-hospital mortality of trauma patients, based on a comparison with DNN (AIS), DNN (Region-6), other AI models based on the LR, RF, and SVM methods, and traditional prediction models based on the ISS and NISS. In the independent testing datasets, DNN (Region-46) outperformed the ISS-16 as follows: sensitivity (83.46% vs. 80.32%), specificity (86.63% vs. 80.68%), accuracy (86.52% vs. 80.67%), balanced accuracy (85.05% vs. 80.50%), and AUROC (0.9084 vs. 0.8709).

The AIS-based ISS has been widely used to evaluate the severity of trauma patients since its introduction in the 1970s1,2,18. Most subsequent systems such as the NISS19, exponential ISS (EISS)20, and anatomic profile score (APS)21 could not supplant the ISS4. Nevertheless, the ISS system based on the AIS involves several limitations22,23. First, a major severe injury may be underestimated because only the single highest score is selected in the same region. Second, the severity may differ if a different combination of injuries has the same ISS. Third, the ISS and mortality do not have a linear association, and a negative relationship may also occur in certain ranges. Finally, the AIS has the same weighting for different organs even though all organs have a different prognosis.

These limitations of the ISS can be alleviated by using AI models24. The DNN model can assign different weights to each input data. Thus, we fragmented the conventional ISS system and used the AI methodology to overcome the problem of nonlinearity and weights for each AIS score. In this study, a key challenge was to decide the predictors and features, as the input variables significantly affect the results of machine learning20.

A key advantage of DNN (Region-46) is that it was trained with a large training cohort (n = 37,762) with highly comprehensive data involving a reasonable number of AIS score data points. The Ministry of Health and Welfare established up to 17 regional trauma centers in Korea (by 2021). All the regional trauma centers were mandated to register severe trauma patients in the KTBD. Numerous variables, including the demographics, pre-hospital and in-hospital information, time factors, clinical characteristics, vital signs, trauma scores, and final outcome information, were recorded. Therefore, the quantity and quality of the KTDB data were satisfactory25.

In general, the DNN involves high computational complexity, although it operates more effectively and flexibly than the RF, LR, and SVM, owing to the learning of the nonlinear problem. Nevertheless, owing to the simplified input features, DNN (Region-46) required only 59.52 s for training and 2.19 ms for instance testing, indicating its suitability for practical mortality prediction. Moreover, DNN (Region-6) provided accurate prediction results, requiring only 50.39 s for training and 2.11 ms for instance testing. Thus, if Region-46 is not promptly available, DNN (Region-6) can be used as an alternative method.

The original AIS codes have 1999 injury codes, which are overly complex to be used as input features in AI models and may lead to model overfitting in the current training dataset involving 37,762 patients. The number of features considerably influences the performance of deep learning. The presence of excessively few or many features does not ensure high performance. In general, many data points are required for numerous features. Consequently, we established a new anatomical category, Region-46, by fragmenting the individual AIS codes and aggregating them into 46 anatomic regions. Our proposed Region-46 (46 input variables) could provide more accurate prediction results using more subdivided features than ISS (six input variables). In addition, it minimizes the overfitting issues using reasonably fewer features than AIS (1999 input variables). The better performance of the Region-46-based AI model was possible because 1999 AIS codes were with a sparse matrix, which contains very few non-zero values.

Another advantage of Region-46 is that we derived it from the conventional AIS coding system. It indicates that the Region-46-based AI model enables us to train the model accurately within the existing coding system, without significant changes in the system such as the coding method. Nevertheless, the advantages of Region-46 are based on the assumption that there is no massive data. It is obvious that we can train AI models more accurately using more features from 1999 AIS codes if we have massive data. We postulate that a considerably larger dataset is required to utilize 1999 AIS codes as input features without overfitting. Thus, future research will proceed in the direction of increasing the number of features by collecting more data, and the ultimate goal is to use all 1999 AIS codes.

In addition to the data sample size, this study involves several limitations to be solved. First, this study is retrospective, although the database was collected prospectively. In this study, we performed a retrospective study to investigate the potential of AI models based on our proposed new anatomical injury score. We believe that new prospective studies will be able to minimize the biases and establish the clinical impact of the prediction model. In addition, the prospective study should be able to update AI models in real-time in more regional trauma centers. For future work, we plan to develop a framework that can use real-time data in connection with all the regional trauma centers in Korea, which will improve the model for better generalization. Second, we excluded the patients who died within six hours from arrival because we assumed that the patients were likely not comprehensively evaluated owing to the insufficient time. This aspect may contribute to selection bias. Thus, our future work would include the patients irrespective of the time of death, which may provide a more robust AI model to predict the severity of trauma patients. Third, the input data did not include the physiologic and demographic parameters such as blood pressure, heart rate, mental status, respiration, age, sex, and comorbidities, while likely leading to an inherently limited accuracy. However, we noted the inherent potential of solely AIS code analysis that would be a basis of the future complex models. In particular, this study highlights the substantial scope for improvement in terms of the injury scale. Compared to ISS, more complex combination models such as TRISS8, ASCOT9, HARM10, TRAM11, and KTS12 use physiologic and demographic parameters to improve the performance. Although these complex combination models have not provided sufficient accuracy to replace the ISS, they have the potential to provide personalized prediction information of trauma severity. Thus, in our future work, we will extend our model to include the physiologic and demographic parameters towards personalized severity prediction. Finally, in this study, we considered mortality as an indicator of the severity of trauma patients as a conventional scoring system focused on mortality. However, regarding survivors, we also should be able to predict other outcomes such as morbidity, prevention effectiveness, or burden of cost. By developing new indicators that predict these outcomes, we could provide guidelines for entire trauma system including heath care provider, agencies, and institutions.

In conclusion, we developed a five-layer DNN model with an input layer involving 46 new trauma severity features (Region-46), which exhibited a higher prediction performance than the conventional ISS and NISS systems. To the best of our knowledge, this study represents the first attempt to employ AI prediction models using anatomic injury severity information. Furthermore, this study emphasizes the need for reinvention of a new injury severity scoring system. We expect that this research can provide reference to incorporate AI techniques in trauma care and improve our understanding of the determinants of injury severity. We believe our endeavor will contribute to reducing the societal burden of injury.