Predicting acute pancreatitis severity with enhanced computed tomography scans using convolutional neural networks

This study aimed to evaluate acute pancreatitis (AP) severity using convolutional neural network (CNN) models with enhanced computed tomography (CT) scans. Three-dimensional DenseNet CNN models were developed and trained using the enhanced CT scans labeled with two severity assessment methods: the computed tomography severity index (CTSI) and Atlanta classification. Each labeling method was used independently for model training and validation. Model performance was evaluated using confusion matrices, areas under the receiver operating characteristic curve (AUC-ROC), accuracy, precision, recall, F1 score, and respective macro-average metrics. A total of 1,798 enhanced CT scans met the inclusion criteria were included in this study. The dataset was randomly divided into a training dataset (n = 1618) and a test dataset (n = 180) with a ratio of 9:1. The DenseNet model demonstrated promising predictions for both CTSI and Atlanta classification-labeled CT scans, with accuracy greater than 0.7 and AUC-ROC greater than 0.8. Specifically, when trained with CT scans labeled using CTSI, the DenseNet model achieved good performance, with a macro-average F1 score of 0.835 and a macro-average AUC-ROC of 0.980. The findings of this study affirm the feasibility of employing CNN models to predict the severity of AP using enhanced CT scans.


Definition Diagnosis of AP
A diagnosis of AP was made according to the 2012 revised Atlanta classification and definitions of AP 2 ; patients had to meet any two of the following conditions: (1) abdominal pain consistent with the characteristics of AP; (2) serum amylase (or lipase) greater than three times the upper limit of normal; and (3) characteristic findings of AP on imaging.

Definition of CTSI
Balthazar proposed the CTSI score based on enhanced CT images, considering pancreatic inflammation and the area proportion of pancreatic necrosis 15 .The detailed scoring criteria are provided in Table 1.

Classification of AP severity
In this study, we used two approaches for assessing AP severity.Firstly, classification based on CTSI scores: a total CTSI of 0-3 indicated MAP, 4-6 indicated moderately severe AP (MSAP), and 7-10 indicated SAP.Secondly, the classification based on the 2012 revised Atlanta classification, which also defined three degrees of AP severity, as outlined in Table 2.

Pancreatic inflammatory
Normal pancreas 0 Focal or diffuse enlargement of the pancreas 1 Intrinsic pancreatic abnormalities and inflammatory changes in the peripancreatic fat 2 Single, ill-defined area of fluid collection 3 Two or multiple, poorly defined area of fluid collections 4

CT scans acquisition and labeling
All patients underwent standard contrast-enhanced abdominal CT examinations using a single-source, 64-multidetector CT scanner.Specific parameters were as follows: slice thickness of 1.0 mm and a matrix size of 512 × 512.CT scans usually consisted of 300-350 slices.Following non-enhanced CT acquisition, enhanced CT images were obtained after intravenous administration of nonionic iodinated contrast material (300 mg/mL of iodine) at a dose of 1.2 mL/kg and an injection rate of 2.5 mL/s using an automatic power injector.CT scans were saved in digital imaging and communications in medicine (DICOM) format to the picture archiving and communication system (PACS).
Portal venous phase CT images (50-70 s after contrast injection) were extracted from the PACS and used in this study.The CT image window width was adjusted to 200, and the window position was set at 45. Raw Hounsfield unit (HU) values were rescaled to a range of 0 to 1.Each CT scan included 256 manually selected slices encompassing the pancreas, and the CT scan image size was reshaped to 64 × 128 × 128.
CT scans were labeled for AP severity based on CTSI or the 2012 revised Atlanta classification.CTSI scores were determined by radiologists (Du and Geng, each with more than 10 years of experience) using the CTSI criteria according to the CTSI criteria.The radiologists were blinded to patient clinical symptoms and treatment.AP severity based on CTSI was determined by the calculated CTSI score, while severity based on the Atlanta classification was extracted from the database, recorded during patient hospitalization.These data were classified into MAP, MSAP, and SAP groups accordingly.Notably, the AP severity classification based on CTSI did not exactly match that based on the Atlanta classification.Each labeling method was used independently for model training and validation.

Training and test datasets
The 1,798 CT scans were randomized into the training dataset (n = 1,618) and the test dataset (n = 180) at a ratio of 9:1.To enhance the training process, data augmentation techniques such as random rotation and translation were applied to the CT scans.As the CT scans were represented as rank-3 shape tensors (samples, depth, height, width), an additional dimension of size 1 at axis 4 was added to enable 3D convolutions (samples, depth, height, width, 1).The training dataset included both the raw CT scans and augmented CT scans, while only the raw CT scans were used for model evaluation in the test dataset.

DenseNet model
A three-dimensional DenseNet CNN model was developed for this study, utilizing the network architecture presented in Fig. 1.The model consisted of four modules, each comprising a dense block and a transition block.Within the dense block, the output Xi of layer i satisfied expression (1), where the nonlinear transformation function Hi(•) incorporated batch normalization and convolution.The last module connected the fully connected layers and applied the Softmax function to produce the final predictions.

Model evaluation
Confusion matrices were used to assess the accuracy of pairwise classification between different categories of patients.Since the task involved multiclassification prediction, the metrics such as AUC-ROC, precision, recall, and F1 score were calculated.The macro-average values of these metrics, computed as the arithmetic mean across individual classes, were used to evaluate the model's performance.In this study, macro-average metrics were employed instead of micro-average metrics to evaluate model performance in the triple classification task 25 .Macro-average metrics, which assign equal importance to each class, are considered more suitable for imbalanced datasets compared to micro-average metrics 26 .By giving equal weight to each class, macro-average metrics provide objective results for imbalanced datasets, allowing for reliable evaluation.

Visual interpretation of the models
The interpretation of model predictions was achieved by employing Gradient-weighted Class Activation Mappings (Grad-CAMs) extended to the 3D setting 27 .These visual explanations represent heat maps superimposed on each slice, providing insights into the model's decision-making process.To visualize the Grad-CAMs, we overlay the Grad-CAMs on each input slice, offering a comprehensive view of the prediction rationale. (1)

Ethics statement and informed consent statement
The study was approved by the Institutional Ethics Committee of the General Hospital of Western Theater Command (No. A20200212008).The requirement for obtaining written informed consent from patients was waived by the Institutional Ethics Committee of the General Hospital of Western Theater Command due to the retrospective nature of this study.Our study was conducted according to the ethical standards of the 1964 Declaration of Helsinki and its later amendments.

Methods statement
All methods were carried out in accordance with relevant guidelines and regulations.

Experimental environment and statistical analysis
This study was conducted on a computer with an NVIDIA(R) RTX(R) 3090 TI GPU and Intel(R) Core(R) CPU i9-12900 K processor.Python 3.9.0(Python Software Foundation, Wilmington, DE, USA) was used for data extraction and preprocessing, model development and validation, and visualization and statistical analysis.To calculate the Ninety-five percent confidence intervals (CIs) for performance evaluation metrics such as accuracy and F1 score, we implemented bootstrapping with 1,000 iterations 28 .This allowed us to derive values from these iterations, upon which the CIs were computed.Statistical significance was computed with the same bootstrapping method 29 .P < 0.05 was considered statistically significant.

Results
In this study, a total of 1,798 enhanced CT scans from 1,561 patients were included (Fig. 2).These CT scans were labeled according to both the CTSI (MAP: 769, 42  3.There were no significant differences between the two groups in terms of age, gender, etiology, length of hospital stay, and mortality rate (P > 0.05).Furthermore, the time interval from the onset of symptoms to the CT examination for both groups was 5.4 ± 0.9 days and 5.4 ± 0.8 days, respectively, demonstrating no significant variance (P > 0.05).
The performance of the trained models are summarized in Table 4, and the confusion matrices depicting the prediction results are presented in Fig. 3.The results revealed that the DenseNet model achieved favorable predictions for both the CTSI-and Atlanta classification-labeled CT scans, with accuracy exceeding 0.7 and  AUC-ROC exceeding 0.7.Notably, the model trained with CTSI-labeled CT scans demonstrated particularly favorable performance, with a macro-average accuracy of 0.899, macro-average F1 score of 0.835, and macroaverage AUC-ROC of 0.980.The ROC curves of the DenseNet model predictions for the two different labeling methods are illustrated in Fig. 4.Both methods exhibited the best prediction performance for SAP, likely due to the more pronounced CT image changes observed in patients with SAP, making them more easily distinguishable by the model.Furthermore, the model demonstrated superior predictions with CTSI labeling compared to the Atlanta classification (macro-average AUC-ROC: 0.980 vs. 0.864, P < 0.05; macro-average F1 score: 0.835 vs. 0.670, P < 0.05).
The visualization results of our model can be seen in Fig. 5.We selected three representative CT scan slices of AP and employed Grad-CAMs to visualize the regions influencing the decision-making process in our trained DenseNet models.Our findings reveal that in the case of MAP with a pancreatic enlargement (Fig. 5A), both DenseNet models trained with the two annotation methods focused on the pancreas and the surrounding peripancreatic region.However, in instances with more noticeable pancreatic morphological changes (Fig. 5B  and C), the DenseNet model trained using CTSI-labeled CT scans more effectively accentuated the areas corresponding to pancreatic necrosis and peripancreatic accumulation.

Discussion
Our study confirms the feasibility of using the CNN models, grounded on 3D DenseNet, to predict the severity of AP using enhanced CT scans.Regardless of whether the models were trained on CT scans labeled CTSI or Atlanta classification, the trained model consistently yielded robust classification performance, with a macroaverage AUC-ROC score surpassing 0.8.
In recent years, advancements in AI and deep learning have enabled the application of CNN models, such as ResNet, DenseNet, Inception, and VGG, in the automatic analysis of CT scans [30][31][32][33] .DenseNet has demonstrated favorable predictive performance in studies on the prediction of various conditions, including COVID-19 34 .Previous studies predominantly utilized 2D CT slices for training and testing CNN models 35 .However, since CT scans inherently provide 3D information, processing them using 2D models may lead to the loss of valuable information and compromise predictive efficacy.With the advancement of computing capabilities, the use of 3D models for direct processing of CT scans has become feasible.Studies involving the automated identification of COVID-19 patients have shown promising results using 3D CNN models 36 .Therefore, in this study, we employed a 3D DenseNet model based on these previous findings.
To the best of our knowledge, there is limited research on using deep learning models to predict the severity of AP from CT scans.In a recent study, Chen et al. employed an image-deep learning model based on MobileNetV2 and trained it on non-enhanced CT scans obtained within 72 h of onset 20 .In this study, we trained a CNN model based on 3D DenseNet using enhanced CT scans from patients with AP, typically taken around 5.4 days after symptom onset.The focus of the two studies varies, and the superior predictive performance of our models demonstrates the potential advantage of using enhanced CT scans for severity prediction of AP through CNN models, aligning with pancreatitis treatment guidelines and clinical experience.However, both studies preliminarily suggest the promising potential of CNN models in predicting the severity of AP using CT scans.However, there remains a considerable gap between the current capabilities of these models and their potential clinical applications, emphasizing the need for continued research.
The Atlanta classification is currently the most prevalent method for severity classification in AP.While there is a correlation between the CTSI and Atlanta classification, they are not identical 37,38 .The primary rationale for employing the CTSI in our research stems from its exclusive reliance on CT imaging, which sets it apart from the Atlanta Classification which also factors in additional data beyond the scope of CT imaging.Given these distinctions, CT images may not carry sufficient information for a precise Atlanta classification, and the utilization of Atlanta classification labels in training the CNN model may compromise the model's generalization capability.www.nature.com/scientificreports/Indeed, in this research, we amassed as many enhanced CT scans of patients with AP as possible.Several patients underwent multiple CT scans, and if these scans met the study's inclusion and exclusion criteria, they were included.As pancreatic necrosis can develop during the early stages of AP, the CTSI classification for these patients may fluctuate over different time points.However, their classifications according to the Atlanta criteria remained consistent.Training the model on distinct CT images carrying identical labels may detrimentally impact the model's predictive capacity.Our results showed that the predictive performance of the model trained on CT scans labeled with CTSI outperformed the model trained on CT scans labeled with the Atlanta classification (macro-average AUC-ROC: 0.980 vs. 0.864).These findings indicated that training a CNN model using enhanced CT images based on the CTSI can achieve better predictive performance, and it demonstrates the ability of the CNN model to capture information about changes in the severity of AP from CT images.
In this study, we did not perform an a priori extraction of the region of interest (ROI) pertaining to the pancreas.Currently, there are no mature algorithms for accurately classifying pancreatic necrosis, peripancreatic necrotic accumulation, and normal pancreas 39 , and manual ROI labeling or training a separate model for automated segmentation would require substantial time and effort 40 .
One of the key advantages of deep learning models is their ability to automatically extract quantitative features from high-throughput images, analyze image data in-depth, and translate microscopic lesion changes into quantitative measures 41 .In this study, referred to some previous CNN model research [42][43][44] , we adopted an endto-end approach that bypasses ROI extraction and directly employs the entire CT scan for model development.Fortunately, our results affirmed the feasibility and effectiveness of this direct approach in predicting the severity of pancreatitis.Regardless, employing an appropriate pancreatic segmentation algorithm or extracting ROI could potentially enhance the predictive performance of the model.This issue can be further explored in subsequent research.
To further evaluate whether our model effectively focuses on the pancreas and peripancreatic necrosis, we employed the Grad-CAMs visualization technique to highlight potentially decision-related areas in CT scan slices 27 .The Grad-CAM results confirmed that the model successfully attends to areas of both pancreatic and extrapancreatic necrosis, further supporting the model's applicability.
In the field of artificial intelligence for pancreatitis, one of the ultimate goals might be to dynamically predict the severity of a patient's condition and their clinical prognosis based on various collected data, thereby guiding clinical diagnosis and treatment.In future research, through model improvements, such as adopting attentionbased transformer models and incorporating time-series based recurrent neural networks, we may not only further enhance the predictive performance of the model and reduce computational load, but also potentially achieve a dynamic evaluation of pancreatitis severity using heterogeneous data from different time points.This could enable us to rapidly and accurately determine the severity of AP using available data.
However, this study has certain limitations inherent to its design and objective conditions.Firstly, it was a single-center study, which limits its generalizability, although sample consistency was high.Conducting multicenter studies and external validation would further strengthen the predictive efficacy of the model.Secondly, as mentioned earlier, ROI extraction was not attempted in this study, and it remains unclear whether performing ROI extraction would improve the model's prediction performance.This aspect can be explored in future research.Thirdly, in this study, we focused on developing a 3D DenseNet CNN model.Future studies can investigate additional CNN models and cross-modal hybrid models that integrate both imaging information and clinical data to enhance the model's performance in predicting AP severity.
In summary, our findings demonstrate that the constructed 3D DenseNet CNN model exhibits reliable predictive capability in classifying AP severity after training with enhanced CT scans, highlighting the feasibility of using CNN models for automatic AP severity classification based on imaging data.Moreover, this study provides insights for the development of more comprehensive models that incorporate both imaging information and clinical data for predicting the severity of pancreatitis.Further advancements in this area can lead to improved clinical decision-making and better patient outcomes.

Figure 3 .
Figure 3. Confusion matrices of the model prediction.The DenseNet model was used to estimate the severity of acute pancreatitis based on (A) CTSI-labelled CT scans and (B) Atlanta classification-labelled CT scans.Note: 1 CTSI, computed tomography severity index; 2 Atlanta, 2012 revised Atlanta classification of acute pancreatitis; 3 MAP, severe acute pancreatitis; 4 MSAP, moderately severe acute pancreatitis; and 5 SAP, severe acute pancreatitis.

Figure 5 .
Figure 5. Visual explanations of model predictions using Grad-CAMs.The left column shows original CT scan images, the middle column displays visual explanations obtained from the model trained with CTSI-labeled CT scans, and the right column shows visual explanations from the model trained with Atlanta-labeled CT scans.(A) CT scan slice with mild pancreatic enlargement; (B) CT scan slice with peripancreatic accumulation; (C) CT scan slice with extensive pancreatic necrosis and peripancreatic accumulation.

Table 2 .
Grades of acute pancreatitis severity based on the 2012 revised Atlanta classification.

Table 3 .
Demographic characteristics and clinical outcomes of the training and validation cohort.LOS length of stay.*P < 0•05.

Table 4 .
Predictive performance of the models trained using CT scans labeled with CTSI and Atlanta classification.CTSI computed tomography severity index, MAP severe acute pancreatitis, CI confidence interval, MSAP moderately severe acute pancreatitis, SAP severe acute pancreatitis, Atlanta 2012 revised Atlanta classification of acute pancreatitis.