Intestinal perforation (IP) in preterm infants is a life-threatening condition that requires urgent surgical intervention. Delayed diagnosis of IP can lead to serious complications and even death in certain patients. There are two major causes of IP in premature infants: necrotizing enterocolitis (NEC) and spontaneous intestinal perforation (SIP)1,2,3,4,5. NEC and SIP are acquired diseases, which means that IP occurs due to the actions of various risk factors. IP associated with NEC (NEC-IP) is an advanced stage of NEC in which NEC has progressed to a severe state. On the other hand, SIP can occur unexpectedly without preceding symptoms or signs6. Many previous studies have investigated the risk factors and pathogenesis of these two types of IP. According to the literature, possible risk factors include prematurity, low birth weight, perinatal use of medications (ibuprofen, steroids, indomethacin, and inotropic agents), maternal chorioamnionitis, patent ductus arteriosus (PDA), and high-grade intraventricular hemorrhage (IVH grade III or IV)5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22. Despite these efforts, due to the nature of retrospective studies with a small number of patients, there is thus far no consensus concerning risk factors. Furthermore, the pathogenesis of IP in premature infants is still not clearly understood; as a result, neonatologists and pediatric surgeons have difficulty predicting and preventing the occurrence of NEC-IP and SIP23,24,25,26.

Recently, artificial intelligence (AI) technologies have been used to solve various medical problems, including diagnosing diseases, choosing treatments, and predicting the risks and progression of diseases27,28,29,30. Machine learning (ML) algorithms, specifically artificial neural networks (ANNs), allow more accurate estimation of complex patterns through learning from training data and are able to predict which patients will develop a disease as an early and supportive diagnostic tool for clinicians. To date, only a few studies have used ML algorithms to predict IP in premature infants. Irles et al.31 reported an ML model predicting NEC-IP, and Lure et al.32 reported an ML model to differentiate NEC-IP and SIP prior to surgical interventions. Lin33 et al. reported an ML technique for individualized NEC risk scores using intestinal microbiota data. However, despite these efforts, studies concerning ML technologies in neonatal IP are still lacking, and there are no definitive tools to predict NEC-IP and SIP in advance. Therefore, our study aimed to develop our own ML models for predicting IP in very low birth weight (VLBW) infants using a nationwide cohort dataset and to compare the performance of the new models with that of classic ML models. Furthermore, to ensure that our novel predictive ML models were valid in real neonatal intensive care (NICU) settings, we tested new ML models with individual cases from our institution and assessed whether these new models achieved high predictive accuracy.


Baseline characteristics of the patients

The baseline characteristics of the infants are shown in Table 1. Of the 12,555 patients, 521 patients had NEC-IP (4.1%), and 208 patients had SIP (1.7%). Three groups of VLBW infants were compared for analysis: (A) control group without NEC-IP or SIP, (B) patients with NEC-IP, and (C) patients with SIP. Factors such as low gestational age, low birth weight, need for resuscitation at delivery, respiratory distress syndrome (RDS) and administration of surfactant, use of certain medications (steroids, ibuprofen, and inotropes), hypotension, IVH grade III/IV, sepsis, and PDA on treatment were significantly associated with IP, including both NEC-IP and SIP.

Table 1 Clinical characteristics of the patients.

Proposed machine learning algorithms

In the work of Irles et al.31, ML-based approaches were proposed to predict NEC values, and our neural approaches produced promising NEC prediction results using a dataset that included 852 positive cases. However, it is more difficult to collect data for NEC-IP and SIP than for NEC; thus, the number of datasets, especially for positive cases, is very limited (521 and 208 positive cases for NEC-IP and SIP, respectively). To solve this problem, we utilized relevant information from a neural network trained for NEC prediction to elevate the diagnostic accuracy of SIP and NEC-IP.

In this study, we introduced several deep neural networks for predicting NEC, NEC-IP, and SIP in infants by taking a 54-dimensional input vector as the network input, and the output of each network was a single value for binary classification problems. Specifically, we first developed our baseline neural network (Model 1) based on the conventional multilayer perceptron (MLP) architecture and then trained Model 1 to predict NEC, NEC-IP, and SIP separately. For ANNs, it is well known that simply adding layers can lead to performance improvement by enforcing more nonlinearity; however, it causes the overfitting problem when the training dataset is insufficient. Therefore, we stacked the layers more deeply than the conventional models in diagnosis, as introduced in a previously published study31, but determined the hyperparameters (e.g., the number of channels) carefully and added more advanced techniques, such as batch normalization34 and drop-out35, to avoid the overfitting problem and facilitate stable training. Specifically, Model 1 is a binary classifier that is composed of 5 hidden layers; each layer is formed from a block (Fig. 1a) arranged with an activation function, batch normalization and dropout. It takes a 54-dimensional vector as an input and, after the application of all layers, renders feature vectors of dimension. Model 1 is a typical neural approach and can produce promising results when a large number of datasets (e.g., NEC) are available. However, the outcomes from Model 1 can be vulnerable in real-world scenarios where the training dataset is insufficient.

Figure 1
figure 1

Architecture of machine learning models. (a) Model 1 is a baseline neural network based on the conventional multilayer perceptron (MLP) architecture. (b) Model 2 is composed of two different MLPs. One branch predicts NEC, and the other branch predicts either NEC-IP or SIP. The feature vectors from the 3rd layer of the network for NEC are concatenated with the 4th layer of the MLP branch for NEC-IP and SIP. (c) Model 3 has the same network architecture as that of Model 1. Pretrained Model 1 for NEC was further fine-tuned to estimate NEC-IP/SIP.

Therefore, we attempted to further improve our baseline Model 1 to predict NEC-IP and SIP more accurately since the lack of data problem for NEC-IP and SIP is more serious than that for NEC. To alleviate this problem, we developed a new approach that transferred information from the network to predict NEC values to help estimate NEC-IP and SIP. Note that transfer learning in the field of deep learning is one of the most widely used approaches to solve lack of data problems36,37,38,39,40. Therefore, we present additional models (Model 2 and Model 3) that can exploit information achieved from the NEC dataset to predict SIP and NEC-IP.

Specifically, Model 2 is composed of two different MLPs. One branch predicts NEC, and the other branch predicts either SIP or NEC-IP (Fig. 1b). Notably, at the 4th layer of the MLP branch for NEC-IP and SIP in Model 2, feature vectors from the third layer of the network for NEC are fed as an additional input. By concatenating the feature vectors from NEC, we can utilize the information for NEC in predicting NEC-IP/SIP.

Unlike that of Model 2, the network architecture of Model 3 is the same as that of Model 1. We employed conventional transfer learning to utilize information from NEC to estimate NEC-IP/SIP and fine-tuned the pretrained Model 1 to the specific NEC-IP/SIP datasets (Fig. 1c).

Comparison of performance between classic ML models and proposed ANN models

We provide prediction results in Table 2 to compare traditional ML models with our neural approaches. We observed that the proposed neural approach (Model 1) outperformed traditional learning-based methods in terms of area under the receiver operating characteristic curve (AUROC) scores for all cases (NEC, NEC-IP, and SIP).

Table 2 Model performance of classic ML models for predicting NEC, NEC-IP, and SIP.

Moreover, our extended networks (i.e., Model 2 and Model 3), which we designed to mitigate the lack of data problem for NEC-IP (521 positive cases) and SIP (208 positive cases), showed improved results in predicting NEC-IP and SIP over the baseline network (Model 1).

In particular, compared to Model 1, which was trained on the NEC dataset, our proposed methods for NEC-IP and SIP (Model 2 and Model 3) exhibited improvements. Notably, Model 2 directly utilizes features distilled from Model 1, Model 3 fine-tunes Model 1 for the prediction of either NEC-IP or SIP, and Model 2 outperforms Model 3 in terms of AUROC, as shown in Table 3 and Fig. 2. The performance metrics of these models using balanced validation dataset are described in Supplementary Tables 1 and 2. This performance improvement achieved in Model 2 and Model 3 indicates that information extracted to predict NEC can also be used to predict NEC-IP and SIP more accurately. Moreover, the proposed direct feature distillation of Model 2 rather than the conventional fine-tuning approach (i.e., Model 3) can be a recommendable option for addressing problems with limited data.

Table 3 Performance of the proposed ANN models in predicting NEC, NEC-IP, and SIP.
Figure 2
figure 2

Receiver operating characteristic curves of proposed ML models for (a) NEC prediction, (b) NEC-IP prediction, and (c) SIP prediction.

Application of the new ANN models in a real clinical environment

To assess the feasibility of the algorithm in a real clinical environment, we tested our newly developed algorithms using the patient data from our institution, which were not included in the training dataset. A total of 57 VLBW infants who were born at our hospital between 2019 and 2020 were included in this test analysis. Among the three ANN models, Model 2 achieved the highest AUROC scores: 1.0000 for the prediction of NEC-IP and 0.9364 for the prediction of SIP (Table 4 and Fig. 3).

Table 4 Test results for 57 cases within a real NICU environment.
Figure 3
figure 3

Receiver operating characteristic curves of proposed ML models from 57 test cases. (a) NEC prediction, (b) NEC-IP prediction, and (c) SIP prediction.


Our novel ML algorithms predicted NEC-IP and SIP in VLBW infants with favorable AUROC scores, outperforming all other classic ML algorithms. One of our algorithms exhibited an AUROC score of 1.0000 for predicting NEC-IP and 0.9364 for predicting SIP in real clinical settings. Our study shows the integration of a vast nationwide dataset with ML, and the resulting model can be used to predict the possibility of specific medical conditions in patients who may not perfectly represent the signs and symptoms of the disease.

Among preterm infants in the NICU, various medical problems exist at the same time, and multidisciplinary collaboration is required to make medical decisions for these patients. IP is one of the most devastating medical conditions that occurs in the NICU. Early diagnosis, swift judgment, and prompt surgical intervention are required to prevent severe complications and poor outcomes41. However, as we explained earlier, predicting the occurrence of IP is difficult. By running this algorithm, it is possible to analyze every preterm infant who is admitted to the NICU and predict each patient’s likelihood of developing NEC-IP and SIP. Thus, early predictions of these serious medical conditions could provide clinicians with a much more stable management environment, enabling them to make better treatment decisions.

In recent years, AI and big data have been increasingly integrated into medicine because it is difficult for the unaided human clinician to acquire all the latest published knowledge, as is required by modern evidence-based medicine30,42. AI is an important resource for medical research in that it can efficiently process large amounts of data. Additionally, AI can produce consistent and unbiased results without fatigue. Several studies in healthcare research have reported sufficient or even better risk prediction by AI methods compared to existing models43,44,45,46,47. Especially these days, in solving difficult problems with big data which have complex distribution, neural approaches exhibit excellent performance. Since neural networks can have non-linearity from adding layers and fine-tune parameters by transfer learning36,40, they can get the upper hand in complex tasks. To sum up, ANN is an appropriate model for IP prediction, as it can handle imbalanced big data efficiently. However, even if the overall cohort is large, diseases with low prevalence always suffer from a lack of data. Due to the character of ML, it can produce excellent results only with a large amount of training data. Thus, applying ML to diseases with low prevalence remains a challenge48,49. To overcome the data imbalance problem, several studies have applied data processing techniques such as oversampling50 and undersampling51 as we did in our ANN models.

To further improve the performance of our models, we modified them by adding another branch or pretraining it based on an algorithm that predicts other relevant disease with higher prevalence in order to help the model predict the target disease more accurately. As a result of these adjustments, the modified algorithms (Model 2 and 3) achieved better performance than the original model. According to previous studies, NEC-IP and SIP are regarded as separate disease entities, and the pathogenesis of SIP does not appear to correlate with that of NEC5,6,52,53,54,55. Notably, however, training of NEC prediction improved not only the accuracy of NEC-IP prediction but also the accuracy of SIP prediction. These results show that pre-ML training with more prevalent medical conditions can help AI predict the occurrence of the target disease more accurately. Our study also highlights that it is necessary to customize the algorithm for each disease to apply an ML model in real clinical settings, especially if the disease is rare.

Although our study showed a favorable outcome, it had its share of limitations. First, a limited number of factors are included because only data collected from the Korean Neonatal Network (KNN) were used. It is expected that the collection of further IP-related data, such as clinical symptoms, vital signs, and radiologic findings, will enable the model to produce better results. In addition, the limitations of AI studies, such as representation, homogeneity, and accuracy, were observed in this study. Another limitation is that it is difficult to determine how AI methods generate results due to the nature of self-extracted data from large datasets49,56,57.

In conclusion, we developed our own ANN models to predict IP early in VLBW infants, and these new models achieved higher accuracy than classic ML algorithms. To our knowledge, this is the first study to develop an ML model to predict both NEC-IP and SIP using nationwide VLBW infant data. In addition, the newly proposed ANN models showed excellent performance within real NICU clinical settings. When more clinical data, such as vital signs, radiologic findings, biomarkers, and laboratory results, are gathered, we believe that a more accurate ML model will be developed, thereby achieving early prediction of these serious medical conditions and better clinical outcomes for VLBW infants.


Data collection

We derived data from infants registered in the KNN, a nationwide prospective cohort registry of VLBW infants58. Their clinical data were collected from 74 participating NICUs across the country and analyzed retrospectively for this study. Prior to participation in the KNN registry, informed consent was obtained from the parents of each infant, and all methods were carried out following relevant guidelines. This study was approved by the Hanyang University Institutional Review Board (IRB No. 2013-06-025-043).

The cohort comprised 12,555 VLBW infants born between January 5, 2013, and December 31, 2018, weighing less than 1500×g.

Disease definitions

NEC was defined according to Bell's modified staging grade ≥ II. NEC-IP was diagnosed when patients with NEC underwent any kind of abdominal surgical intervention (peritoneal drainage or laparotomy). SIP was defined when the patients underwent surgical intervention due to IP and the surgeon found no predisposing causes, such as NEC, intestinal atresia, or meconium plug. The full list of 54 variables used in ML analysis is shown in Supplementary Table 3.

Comparisons of baseline characteristics

A total of 54 variables, including various maternal and perinatal factors, were collected for ML. Among them, 18 clinical factors that were proposed as possible risk factors for either NEC-IP or SIP in previous studies were analyzed using conventional statistical methods. Student’s t-test was performed to analyze the continuous variables, and the chi-squared test was used to analyze categorical variables. Statistical significance was set at P < 0.05. The Statistical Package for the Social Sciences version 22.0 for Windows software program (IBM Corp., Armonk, NY, USA) was used in all statistical analyses.

Data preprocessing

Our dataset was composed of 12,555 infants in total; we divided them into training and evaluation datasets (Table 5). Moreover, to facilitate the network training procedure, a data preprocessing step was applied (Fig. 4). First, to solve the missing data problem in the given dataset, we imputed the missing (null) values with plausible values. Specifically, input values were categorized into ordinal, continuous and categorical types. In the case of ordinal inputs, we imputed the null values with the mode (most frequently occurring) values, and in the case of continuous inputs, we imputed the null values with the mean values. Finally, missing values of categorical inputs were replaced with the median values. After recovering the data, we mapped the dataset between 0 and 1 by using min–max normalization. Finally, to mitigate the data imbalance problem (e.g., 11,703 negative and 852 positive cases for NEC in Table 5), which is a common intrinsic feature of disease datasets, we oversampled the smaller category (positive cases) and undersampled the bigger category (negative cases), as suggested in previous studies59,60.

Table 5 Size of training and evaluation datasets.
Figure 4
figure 4

Flowchart of data processing. Input values were categorized into ordinal, continuous and categorical types. To solve the data imbalance problem, oversampling and undersampling technique were applied. Then, the data were normalized.


We used the binary cross entropy (BCE) loss to train Model 1, and Model 1 was separately trained to predict NEC, NEC-IP, and SIP. For Model 3, pretrained Model 1 for NEC was further fine-tuned, and the separately updated parameters using the BCE loss were used to predict NEC-IP and SIP.

Unlike Model 1 and Model 3, Model 2 was trained in a two-stage manner. In the first stage, the left branch of the network (Fig. 1b) was trained for NEC with the BCE loss. Then, the BCE loss as well as the BCE for NEC in the left branch were used to jointly optimize the network for either NEC-IP or SIP. In both training steps, oversampling and undersampling technique were used to alleviate the data imbalance problem. We employed the two-stage training scheme rather than using the one-shot joint training approach, as we could retain more stability during the training from the two-stage approach.

To train the three proposed models (i.e., Model 1, Model 2, Model 3), the Adam optimizer61 with a learning rate of 0.0001 was used for loss minimization. We used a dropout rate of 0.235 and a batch size of 12834. Instead of fixing the number of iterations, we stopped the training using the early stopping technique to avoid the overfitting problem. When the loss increased seven times in a row, we stopped the training (i.e., early stop) and took the parameters just prior to the loss increase. The models were implemented in PyTorch with Python62, and the Scikit-learn library63 was used to evaluate the results (e.g., AUROC, F1-Score, and ROC curve).