Introduction

The first week of life is considered the most vulnerable period for newborns in terms of mortality, particularly among very low birth weight (VLBW) preterm infants1. Despite surviving these critical initial weeks following birth, VLBW preterm infants remain at a heightened risk for adverse long-term neurodevelopmental outcomes2. This risk can be primarily attributed to intraventricular hemorrhage (IVH), with approximately 95% of IVH cases occurring during this period3.

The significance of a dependable and timely risk assessment tool for early mortality and incidence of severe IVH cannot be overstated. This tool could not only provide a structured framework for parents and healthcare providers during the decision-making process but also offer valuable insights into recommending appropriate levels of care based on estimations of mortality and poor outcomes. For example, patients with a high probability of severe IVH may require tailored circulatory management strategies4. Moreover, the early identification of infants at the highest risk of developing severe IVH holds promise for enhancing the design of future clinical studies and optimizing the selection of participants for trials5.

In Taiwan, the incidence of preterm births has gradually increased from 8.85% in 2004 to 10.73% in 2014, a trend observed on a global scale6 Notably, the preterm birth rate in Taiwan has surpassed that in most OECD countries7 However, to the best of our knowledge, the existing literature has only identified certain risk factors associated with mortality and severe IVH in Taiwan8,9,10 The establishment of a nationwide outcome predictor applicable for the Taiwanese population remains an unmet need.

Therefore, this study aimed to develop and validate a straightforward machine learning (ML)-based outcome estimator, utilizing readily available data shortly after birth, to predict the probability of early mortality and development of severe IVH in VLBW preterm infants.

Methods

Study design and population cohorts

In this retrospective observational study, cohort data of VLBW preterm infants was obtained from the Taiwan Neonatal Network, established in 2016 to compile nationwide clinical data of preterm infants delivered in Taiwan from 33 medical centers. The enrollment criteria outlined by the Taiwan Neonatal Network include live-born infants born in Taiwan, with birth weights ranging from 401 to 1500 g or gestational ages ranging from 22 weeks 0 days to 29 weeks 6 days. This data was then used to establish and investigate two cohorts.

Cohort 1 comprised infants born between 2016 and 2021. Their data were collected for subsequent model development, internal validation and model comparison. Cohort 2 comprised infants born in 2022 and was included in the external validation.

The inclusion criteria were gestational age (GA) between 22 weeks and 0 days to 36 weeks and 6 days and a birth body weight (BBW) of less than 1500 g. Infants with missing data were excluded.

This study has been approved by the National Cheng Kung University Hospital Institutional Review Board (A-ER-111–115). The need of informed consent was waived by the National Cheng Kung University Hospital Institutional Review Board due to the fact that data were anonymized and de-identified. All methods were performed in accordance with the relevant guidelines and regulations.

Outcomes

The primary outcomes of the study included: early mortality, severe IVH, and early poor outcomes (early mortality or severe IVH). Early mortality was defined as death within the first week of life and severe IVH was defined as IVH grade III or IV on cranial ultrasound, graded using Volpe’s grading system11

Data preprocessing

We collected essential data as variables for each enrolled infant, resulting in a total of 23 variables. These variables included the following: antenatal steroid use; prenatal magnesium sulphate (MgSO4) use; pregnancy-induced hypertension; chorioamnionitis; GA; BBW; multiple births; Cesarean section; small for GA (defined as birth weight below the 10th percentile for GA, referencing values for birth weight distributions from a previous study of the Taiwanese population)12; sex; 1-min Apgar score; 5-min Apgar score; body temperature (defined as the rectal temperature measured for the first time within the first hour of birth); early-onset sepsis (defined as culture-proven sepsis occurring within 72 h of birth); respiratory distress syndrome; congenital anomalies (including chromosomal anomalies, skeletal dysplasia, inborn errors of metabolism, lethal or life-threatening anomalies in the cardiovascular, gastrointestinal, genitourinary, or pulmonary system, and other lethal or life-threatening anomalies); and seven delivery room resuscitation managements, including, neonatal resuscitation, oxygen supplementation, delivery room continuous positive airway pressure ventilation, positive pressure ventilation, endotracheal tube ventilation, chest compressions, and epinephrine administration. RapidMiner software version 10.0 (Altair Engineering, Troy, MI, USA; www.rapidminer.com) was used for data input and the cleaning of missing data.

Selection of variables

To facilitate practical applicability, we conducted variable selection using the information gain attribute evaluator provided by Weka software version 3.8.6 (Waikato Environment for Knowledge Analysis, Hamilton, New Zealand). After measuring the entropy gain in relation to the outcomes, an information gain attribute evaluator was used to evaluate the significance of each of the 23 variables13 Additionally, we conducted an evaluation of collinearity between each variable. In the interest of establishing a more streamlined model, we selected the top-ranked variables based on their ranking.

ML algorithms and model building

The flow chart for building models using ML algorithms via Orange software version 3.34.0 (University of Ljubljana, Ljubljana, Slovenia)14 is shown in Fig. 1.

Figure 1
figure 1

Flowchart of machine learning to build the predictive model.

These models were developed using six algorithms: k-nearest neighbor (kNN), decision tree, random forest, neural network, logistic regression, and gradient boosting.

Brief descriptions of the six ML models are as follows:

  • The kNN algorithm15 is an ML instance-based model that stores all instances of the training dataset and makes predictions based on neighborhood proximity, as defined by a similarity metric.

  • The decision tree algorithm16 is a tree-structured prediction model that starts with a root node and progresses to a leaf node. Each internal node represented a predictor variable, each internal node connection represented a choice, and each leaf node represented the outcome variable.

  • The random forest algorithm17 is an ML ensemble model that combines multiple decision trees to achieve increased prediction accuracy. Each uncorrelated decision tree in the random forest makes a prediction, and the prediction with the largest number of votes is used as the final prediction for the algorithm.

  • The neural network algorithm18 is an ML model that mimics the signal transmission through neurons in the human brain. The algorithm comprises multiple layers of nodes: an input layer, multiple hidden layers, and an output layer. Each node functions as a neuron, with a threshold value. If the collected signal reaches this threshold, the nodes are activated and the signal is transferred to the next layer in the network. Predictions were continuously generated until the signal reached the output layer.

  • The logistic regression algorithm19 was used for binary and multiclass classifications. It utilizes a cost function, often known as a sigmoid function, to provide an estimate of probability values ranging from zero to one.

  • The gradient boosting algorithm20 is another ensemble model that incorporates a large number of ML models to provide strong predictors. The algorithm uses a gradient boosting technique to calculate the residual error by training a simple base learner on all the training datasets. A new learner is then created to forecast the prior residual error and increase the accuracy of the prediction model.

Internal evaluation

A tenfold cross-validation approach was employed for internal model validation. The dataset was randomly divided into 10 groups, with nine groups used for training and one for testing in each iteration. The average performance of the test results was subsequently used to assess the overall performance of the model across all the groups.

Model comparison

The performance of all prediction models was assessed by comparing the area under the curve (AUC) using the Orange software. Additionally, calibration plots and mean Brier scores, calculated with the assistance of Python, were employed to evaluate the predictive ability and goodness of fit of the models. This facilitated the observation of agreement between the actual and predicted probabilities.

External validation

The predictive models that exhibited outstanding performance, developed using the Cohort 1 dataset, were subsequently applied to the Cohort 2 dataset for external validation. Furthermore, the AUCs were computed to assess their performance in this independent dataset.

Equation development

The intercepts and coefficients for the selected attributes across different outcomes were calculated using Orange software. Subsequently, we formulated the corresponding equations and developed estimators to predict the probabilities of various target outcomes.

Results

Study population and patient characteristics

A total of 8531 newborns were enrolled during the study period. However, 711 newborns were excluded due to missing data and 349 were excluded because they died within 12 h of delivery. Consequently, 7471 newborns with complete records were included in the final study. Cohort 1 and 2 included 6558 and 913 infants, respectively.

In Cohort 1 (Table 1), there was a significant difference (p < 0.05) between each variable and target outcome, except for: the use of prenatal MgSO4 between the group with and without severe IVH (p = 0.157); multiple births, across all outcomes (p = 0.671 in early mortality, p = 0.32 severe IVH, and p = 0.22 early poor outcomes); and congenital anomalies between the group with and without severe IVH (p = 0.76).

Table 1 Demographic data of the participants.

In Cohort 2 (Table 1), there were no significant differences in antenatal steroid use, prenatal MgSO4 use, pregnancy-induced hypertension, multiple births, Cesarean section, small for GA, sex, early onset sepsis, congenital anomalies, neonatal resuscitation, oxygen supplement, chest compression, or epinephrine administration between infants with and without early mortality (p = 0.17, 0.19, 0.76, 0.38, 0.49, 0.38, 0.97, 0.29, 0.57, 0.64, 0.45, 0.16, 0.60, respectively). Similarly, there were no significant differences in multiple births between the group with and without severe IVH (p = 0.29) and with and without early poor outcomes (p = 0.20). The discrepancy observed, wherein significant differences were found between each variable and the target outcome in Cohort 1, whereas such differences were not apparent in Cohort 2, could potentially be attributed to the limited sample size of Cohort 2.

Selection of predictors

Attribute selection, based on the Weka information gain attribute evaluator, enabled the condensed and generic application of the prediction models. The actual values generated by the evaluator for each variable were listed in Fig. 2 and Supplementary Table S1, revealing notable distinctions between the top five ranked variables and those ranked sixth and beyond. Furthermore, variables ranked second to fifth exhibited similar scores. Consequently, the initial selection included the top five variables: gestational age (GA), birth body weight (BBW), 1-min Apgar score, 5-min Apgar score, and endotracheal tube ventilation during initial resuscitation, for model development.

Figure 2
figure 2

Radar charts of attribute selection with the information gain attribute evaluator. The top five critical variables on the radar chart are GA, BBW, endotracheal tube ventilation, 5-min Apgar score, and 1-min Apgar score. GA gestational age, BBW birth body weight, ETT endotracheal tube, Apgar 5 5-min Apgar score, Apgar 1 1-min Apgar score, RDS respiratory distress syndrome, BT body temperature, epinephrine epinephrine administration, PPV positive pressure ventilation, CPAP continuous positive airway pressure, E_sepsis early onset sepsis, NRP neonatal resuscitation, PIH pregnancy-induced hypertension, C/S Cesarean section, SGA small for gestational age.

Additionally, considering collinearity concerns, further analysis was conducted using Variance Inflation Factor (VIF) values21 as presented in the Supplementary Table S2. This analysis indicated significant collinearity between the 1-min Apgar score and the 5-min Apgar score. Based on prior research22 The 5-min Apgar score is regarded as a more reliable predictor of neonatal outcomes compared to the 1-min Apgar score. Therefore, we opted to exclude the 1-min Apgar score from our prediction variables during model development.

Model development and comparison

The four most crucial variables, which were top-ranked and showed no significant collinearity, were utilized in the development of prediction models using Orange software. The internally validated receiver operating characteristic (ROC) curve results (Fig. 3) indicated that the neural network, logistic regression, and gradient boosting models were the most optimal predictive models for all target outcomes, with AUC values of 0.87, 0.86, and 0.86, respectively, for the prediction of early mortality; 0.82, 0.82, and 0.81, respectively, for severe IVH; and 0.84, 0.84, and 0.83, respectively, for early poor outcomes. The calibration plot illustrates the consistency between predictions and observations across different percentiles of predicted values. Comparing the calibration of all models through a scatter plot reveals the agreement between predictions and observations. According to Fig. 4, both logistic regression and neural network models demonstrated superior calibration performance, as depicted in the calibration plot. Furthermore, the logistic regression model achieved the best mean Brier score across three predictive outcomes, with a score of 0.0685, followed closely by the neural network model, which attained the lowest mean Brier Score of 0.06906. In contrast, the kNN and decision tree models exhibited less favorable calibration performance, with the highest mean Brier scores recorded at 0.0811 and 0.08123, respectively.

Figure 3
figure 3

ROC curve analysis of six prediction models in the internal validation set. (a) ROC of early mortality; (b) ROC of severe IVH; (c) ROC of early poor outcomes. ROC receiver operating characteristic.

Figure 4
figure 4

Calibration plot and mean Brier score of six prediction models in the internal validation set. (a) Calibration plot of early mortality. (b) Calibration plot of severe IVH. (c) Calibration plot of early poor outcomes. (d) Mean Brier score of three target outcomes.

For external validation by Cohort 2, we utilized the most powerful prediction models, namely logistic regression and neural network models. The results of the ROC curve analysis (Fig. 5) indicated exceptional predictive capabilities across all outcomes. Specifically, the AUC values were 0.90 and 0.89, respectively, for early mortality prediction; 0.84 and 0.83, respectively, for severe IVH prediction; and 0.86 and 0.84 for early poor outcome prediction for the logistic regression and neural network models, respectively.

Figure 5
figure 5

ROC curve analysis of three prediction models in the external validation set. (a) ROC of early mortality; (b) ROC of severe IVH; (c) ROC of early poor outcomes. ROC receiver operating characteristic.

Equation development

We used Orange software to calculate the intercepts and coefficients necessary for constructing the prediction models through logistic regression. The results are summarized in Table 2. An equation was formulated for each target outcome as follows: outcome estimators suitable for clinical applications were developed using Microsoft Excel 2016.

Table 2 Intercept and coefficient values of the attributes in various models developed using logistic regression.

As an illustrative example, consider a premature male infant born with a GA of 24 weeks and birth weight within the range of 601–700 g. The 5-min Apgar scores were 6, respectively. Importantly, intubation was not required during initial neonatal resuscitation in the delivery room. By inputting these parameters into the outcome estimator, we ascertained the following probabilities: 20% likelihood of early mortality, 35% likelihood of severe IVH, and 44% likelihood of early poor outcomes (Table 3).

Table 3 A table of the early poor outcomes estimator.

Discussion

In this study, we used a nationwide retrospective database comprising data on VLBW preterm infants and their associated variables collected immediately after their initial management in the delivery room. Our objective was to develop a predictive model for early mortality, severe IVH, and early poor outcomes using an -ML approach. Following the application of this approach, we identified GA, BBW, 5-min Apgar score, and intubation in the delivery room as the top four most crucial factors for constructing prediction models. Notably, we found that both the logistic regression and neural network models demonstrate superior performance, as indicated by their higher AUROC values. This suggests that they have better discriminative ability in distinguishing between different outcomes. Additionally, these models are well-calibrated, meaning that the predicted probabilities align closely with the observed frequencies of outcomes. Moreover, they have been effectively validated across different cohorts within this study, highlighting their robustness and generalizability across diverse populations or settings. Overall, the logistic regression and neural network models excel in terms of their high AUROC values, good calibration, and successful validation across various cohorts, making them reliable predictors of outcomes in this study.

Currently available scoring systems for predicting early mortality in neonates include: the Clinical Risk Index for Babies (CRIB) II23 Score for Neonatal Acute Physiology Perinatal Extension II (SNAPPE-II)24 and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) calculator25 for neonatal conditions or outcomes. These prediction models have been widely employed and subjected to external validation in multiple studies26

In our research, similar to CRIB II and NICHD, we identified GA and BBW as significant risk factors. A systematic review underscored the significance of these risk factors in neonatal mortality in neonatal intensive care units, with GA and BBW emerging as the most frequently cited contributors to neonatal mortality27 Additionally, an investigation conducted on the Taiwanese population, using data from birth certificates and death registries, established a robust correlation between GA, BBW, and the incidence of early mortality28

In 1952, Dr. Virginia Apgar pioneered the development of a scoring system designed to evaluate the physical condition of newborns and gauge their need for resuscitative interventions. Her groundbreaking work revealed a significant correlation between neonatal survival up to 28 days of age and the infant’s condition at delivery29 Notably, contemporary research has substantiated the enduring relevance of the Apgar score, reaffirming its significance nearly five decades later30

Although the Apgar score was initially conceived to assess term infants during an era characterized by high neonatal mortality rates among preterm infants, a recent investigation showed that the relative risk of neonatal mortality consistently escalates as the Apgar score diminishes across all GA categories31 Similarly, we included the Apgar score as a pivotal variable for outcome prediction in our study.

In our study, intubation emerged as the most important variable among all initial management procedures conducted in the delivery room. Notably, corroborative research conducted in countries such as Korea32 Iran33 Thailand34 and Brazil35 has similarly identified intubation as a pivotal risk factor for neonatal outcomes.

In our study, antenatal steroid administration and multiple births did not demonstrate statistical significance as variables for outcome prediction despite their inclusion in the NICHD calculator. This discrepancy may be attributed to the high prevalence of antenatal steroid administration in Taiwan, where 85% of the patients in our study received this treatment, in contrast to the population encompassed by the NICHD calculation, where approximately 70% received antenatal steroids. These demographic differences within the study population may have attenuated the influence of these variables on study outcomes.

In contrast, Boghossian et al.36 reported that the beneficial effects of antenatal steroids on mortality were statistically significant, primarily in infants born between 24 and 25 weeks of gestation. This observation suggests that the efficacy of antenatal steroids in reducing mortality may be contingent on GA.

Multiple births were associated with a notably elevated risk of mortality, particularly among extremely premature infants born at 26 weeks of gestation or earlier, as indicated in prior research37 In our study cohort, where the mean GA of the infants was 28.7 weeks, this characteristic may explain why antenatal steroid administration and multiple births were not significant factors in our analysis.

ML is a subset of artificial intelligence that has been extensively used in healthcare38 According to a recent systematic review39 concerning the deployment of ML models for forecasting neonatal mortality, prominent ML algorithms include neural networks, logistic regression, and random forests. The reviewed articles collectively reported a mean AUC range spanning from 58.3 to 97.0%, with the average exceeding 70%. These findings underscore the ability of ML models to predict neonatal mortality. In our ML -based predictive models, the AUC values demonstrated a comparable and laudable level of performance when juxtaposed with other ML-based models.

In the context of predicting IVH, it is noteworthy that all four variables incorporated into our predictor previously demonstrated strong predictive capabilities for IVH, with particular emphasis on GA. Furthermore, the significance of endotracheal tube ventilation has been underscored in the literature. Additionally, when comparing our IVH predictor to previous models (AUC 0.67–0.85 for severe IVH prediction), our predictor exhibits an outstanding performance40

Notably, despite external validation of the CRIB II, SNAPPE-II, and NICHD prediction models in diverse study populations, none of these models incorporated data from the Taiwanese population into their assessments. Predictive methodologies rely heavily on epidemiological population data to predict specific outcomes41 It is important to emphasize that the utility of a predictive model may be compromised by the possibility that the model is built upon data that could become outdated by the time it undergoes validation.

To the best of our knowledge, our predictive model represents a pioneering endeavor in the development of outcome-predictive models. This was the first initiative to construct such models based on the most current and comprehensive datasets available in Taiwan. Moreover, our model can predict early mortality, severe IVH, and early poor outcomes in VLBW preterm infants immediately following their initial management in the delivery room. Remarkably, this predictive capability was achieved using only four factors, eliminating the need for time-consuming blood sampling; however, these inherent advantages may facilitate widespread application in the Taiwanese population.

Limitations

This study had several limitations. First, restrictions imposed by the available databases impeded the collection of precise clinical data such as blood pressure, oxygen demand, and comprehensive laboratory data encompassing hemograms, biochemical markers, and blood gas analyses. The inclusion of these clinical parameters could potentially enhance the predictive performance of the model26,39. Second, for privacy protection, the Taiwan neonatal network database recorded anonymous information, with gestational age rounded down and birth body weight recorded in ranges. These unavoidable limitations may impact the collinearity between variables. Third, while our prediction models demonstrated a high degree of accuracy in forecasting outcomes, they lack adaptability over time. As clinical dynamics evolve, these models may experience a decline in predictive accuracy. Fourth, variations in management and procedures across institutions may introduce potential biases that could be unavoidable in our study. Fifth, it is important to acknowledge that ML models may inadvertently manifest bias and discriminatory tendencies. Therefore, additional external validations across diverse population groups are required. This validation should explore whether the model generated can be applied with equal efficacy to populations other than Taiwanese cohorts to ensure a broader range of applicability.

Conclusions

In this study, we developed an outcome predictor designed to predict early mortality, severe IVH, and early poor outcomes in preterm VLBW infants. This predictive model relied on the assessment of four readily available factors immediately after birth: GA, BBW, 5-min Apgar score, and endotracheal tube ventilation during initial resuscitation. Our analysis has yielded a formula that demonstrates exceptional performance, as evidenced by the high AUC values in both the internal validation cohort and the independent external validation population. Furthermore, it is well-calibrated, as evaluated by calibration plots and mean Brier scores. This prediction formula may prove to be a valuable tool and provide essential prognostic information for parents, aiding them in making informed decisions regarding the care and future of VLBW preterm infants. Furthermore, it may offer healthcare providers valuable guidance and facilitates the formulation of effective decision-making strategies for the clinical management of vulnerable infants. However, further validation across diverse populations is required to ensure broader applicability. Moreover, the inclusion of clinical parameters may further improve model accuracy.