Early prediction of severe retinopathy of prematurity requiring laser treatment using physiological data

Background Early risk stratification for developing retinopathy of prematurity (ROP) is essential for tailoring screening strategies and preventing abnormal retinal development. This study aims to examine the ability of physiological data during the first postnatal month to distinguish preterm infants with and without ROP requiring laser treatment. Methods In this cohort study, preterm infants with a gestational age <32 weeks and/or birth weight <1500 g, who were screened for ROP were included. Differences in the physiological data between the laser and non-laser group were identified, and tree-based classification models were trained and independently tested to predict ROP requiring laser treatment. Results In total, 208 preterm infants were included in the analysis of whom 30 infants (14%) required laser treatment. Significant differences were identified in the level of hypoxia and hyperoxia, oxygen requirement, and skewness of heart rate. The best model had a balanced accuracy of 0.81 (0.72–0.87), a sensitivity of 0.73 (0.64–0.81), and a specificity of 0.88 (0.80–0.93) and included the SpO2/FiO2 ratio and baseline demographics (including gestational age and birth weight). Conclusions Routinely monitored physiological data from preterm infants in the first postnatal month are already predictive of later development of ROP requiring laser treatment, although validation is required in larger cohorts. Impact Routinely monitored physiological data from the first postnatal month are predictive of later development of ROP requiring laser treatment, although model performance was not significantly better than baseline characteristics (gestational age, birth weight, sex, multiple birth, prenatal glucocorticosteroids, route of delivery, and Apgar scores) alone. A balanced accuracy of 0.81 (0.72–0.87), a sensitivity of 0.73 (0.64–0.81), and a specificity of 0.88 (0.80–0.93) was achieved with a model including the SpO2/FiO2 ratio and baseline characteristics. Physiological data have potential to play a significant role for future ROP prediction and provide opportunities for early interventions to protect infants from abnormal retinal development.


INTRODUCTION
Retinopathy of prematurity (ROP) is a common disease in preterm infants, and accounts for 5-20% of childhood blindness in developed countries. 1 The number of infants at risk of ROP has increased worldwide due to advances in perinatal and neonatal care with improved neonatal survival rates. Timely screening and intervention of high-risk infants for ROP is essential to prevent development of severe ROP. 2 However, screening infants for ROP is thought to be painful, stressful and causes physiological instability. 3,4 Classifying infants according to risk of developing ROP could be useful to identify those where ROP progression should be monitored carefully through extra screening and conversely avoid unnecessary screening examinations in infants who are predicted to be low risk.
However, the American Academy of Ophthalmology recently assessed the accuracy of available prediction models for clinically significant ROP and reported that model optimization is needed before clinical application can be reached. 5 Many putative risk factors for ROP have been described, including gestational age and birth weight, perinatal and postnatal inflammation, pulmonary complications, and anemia. 6 Most guidelines use birth weight and gestational age to identify infants in need of ROP screening, which is highly sensitive, but results in screening of many infants who will not develop severe disease. 5 Fluctuations in oxygenation, resulting in oxidative stress, are implicated in ROP. 7,8 Several large randomized controlled trials reported a higher incidence of ROP when high oxygen saturation (SpO 2 ) targets were used, 9,10 although this association was not found in other trials. 11,12 Hyperoxia related factors such as the use of supplemental oxygen, oxygen concentration, duration, and prolonged mechanical ventilation are among the most frequently described risk factors for the development of severe and treatment-requiring ROP. 6 Most preterm infants in the neonatal intensive care unit are continuously monitored. Although the obtained physiological monitor data are used to detect trends and identify severe episodes of physiological instability, this snapshot method means that these data are often not used to their full potential and opportunities to improve clinical management may be missed. [13][14][15] Analyzing physiological data could potentially lead to the identification of meaningful patterns that would otherwise be difficult or even impossible to identify, and may enable the prediction of individual risk for ROP. Previously Sullivan et al. investigated whether measures of heart rate and SpO 2 in the first 7 days of life can improve predictions of mortality and morbidity in very low birth weight infants. 16 Whilst physiological data improved prediction of death, severe intraventricular hemorrhage and bronchopulmonary dysplasia, it did not improve prediction of severe ROP beyond that made from demographics. A more detailed analysis of physiological data, including measures of oxygen requirement, may shed light on the predictive ability of these data for ROP and whether a particular timeframe has greater utility.
The aim of this study was to investigate if physiological monitor data and oxygen requirement during the first 30 days after birththe period before the first ROP screening-differ between preterm infants with and without severe ROP requiring laser treatment, and to examine the ability of these data to predict infants who will develop severe ROP requiring laser treatment.

Study design and population
In this retrospective cohort study, physiological monitor data from the first postnatal month were analyzed and used to build classification models to distinguish infants with and without severe ROP requiring laser treatment. Preterm infants who were screened for ROP at the level III Neonatal Intensive Care Unit (NICU) of the Erasmus MC Sophia Children's Hospital between August 2016 and August 2020 were eligible for inclusion. Exclusion criteria were a gestational age >32 weeks, a birth weight >1500 g, and <80% availability of the physiological data in the first two weeks after birth. Before the exclusion criteria were applied, the infants were randomly 1:1 divided using R Software (version 4.1.1, R Foundation for Statistical Computing, Vienna, Austria) into a training dataset and a test dataset. The medical ethics committee of the Erasmus University Medical Center granted a waiver from approval for this study according to the Dutch Medical Research Involving Human Subjects Act (MEC-2018-1106).

Data acquisition
Demographic and clinical characteristics were collected from the electronic medical records. The DIGIROP-Birth was calculated using the website that the investigators of the score provided (https://www.digirop.com/ index.html). 17 Data on the fraction of inspired oxygen (FiO 2 ), SpO 2 , and heart rate were collected from bedside monitors. Further details are given in the Supplementary Methods.

ROP screening and outcome
The outcome of the study (i.e., the class for the machine learning) was whether infants did or did not develop severe ROP requiring laser coagulation. Laser coagulation was administered according to the ETROP criteria. 18 The incidental use of anti-VEGF (bevacizumab) was not taken into account in this study. All preterm infants who were born with a gestational age <32 weeks and/or birth weight <1500 g were, according to local protocol, screened for ROP. The first screening was scheduled from 5 weeks postnatally, but not before 31 weeks of postmenstrual age. Data on ROP screenings were collected from the reports of the experienced ophthalmologists (S.E.L., A.M.T.) who performed the screening.

Data processing
Features were derived from the data on the SpO 2 , heart rate, and FiO 2 and calculated using the R Software (version 4.1.1, R Foundation for Statistical Computing, Vienna, Austria) or MATLAB (R2021a). 19 The methods to calculate these features are presented in Supplemental Table 1. The skewness of the SpO 2 and heart rate were calculated with the "skewness" function from the R package "e1071" as a measure of symmetry in the distribution of the data, with a value of 0 indicating that the data has a symmetric (normal) distribution. 20 The mean SpO 2 , heart rate, FiO 2 , and SpO 2 /FiO 2 ratio were calculated per day. The skewness of the SpO 2 and heart rate, area under the 80% SpO 2 curve, area above the 95% SpO 2 curve, number of desaturations, number of bradycardia, and number of tachycardia were processed as total amount per day. Measurements marked as invalid by the bedside monitors were excluded from the analysis.

Statistical analysis
Baseline characteristics were analyzed using the Wilcoxon rank sum test, X 2 test, and Fisher's exact test. Time periods with differences in each physiological feature between the laser and the non-laser group in the training data were identified using a non-parametric cluster analysis described by Maris and Oostenveld, 21 see Supplementary Methods.
To explore the predictive ability of physiological data, random forest classification models were built on the training dataset to distinguish between the laser and the non-laser group, see Supplementary Methods. Model performance was measured using balanced accuracy, sensitivity, specificity and the Matthew's correlation coefficient (MCC) with leave-oneout cross-validation in the training set. 95% confidence intervals were calculated with the Agresti-Coull interval.
To compare the predictive ability of physiological features at different time periods and with demographic features, 10 different models were built in the training dataset. Model 1 was trained on all physiological data from the first 30 days after birth. Model 2 was trained on the physiological features that contained significant differences between the laser and non-laser group, based on the non-parametric cluster analysis. Model 3 was trained on physiological features in a selected time period where most significant clusters were identified. To compare different physiological features, univariate analyses were performed including the physiological features from the first 30 days after birth; model 4 presents the best performing model from these analyses. After this, model 5 was trained on demographic features around birth only to compare the predictive ability of physiological data with demographic features. Then, the demographic features were added to the best performing models including physiological data (model 6 and 7). Model 8 included clinical features only (data on administered treatments and comorbidities; of note some of these occurred after the first 30 postnatal days) and model 9 and 10 also included the features of the best performing physiological models. As most infants who required laser treatment were born before 28 weeks of gestation a subgroup analysis was performed including infants born ≤28 weeks of gestation by retraining the two best performing models to increase the comparability between the laser and non-laser group.
Two-sided mid-p value McNemar's tests 22 were used to compare models. Significance was set to a p value <0.05; however, statistical comparisons were calculated as a guide only (without correction for multiple comparisons) and should not be interpreted as clearly indicating a better model as this is context dependent. 23 The probability that an observation comes from a particular class was calculated by averaging over all trees in the ensemble using the MATLAB "predict" function. Differences in the sensitivity and specificity of the best performing models were calculated for different probability thresholds.
The two models with the highest performance scores were trained on the whole training set and externally validated by applying the models with the original probability threshold of 0.5 to the independent test set. These models were also trained on the training and test set combined to investigate if model performance improved by increasing sample size.

Study population
A total of 269 preterm infants with a gestational age below 32 weeks were screened at least once for ROP. Overall, 61 (23%) infants were excluded from the analyses-34 in the training dataset and 27 in the test dataset. Infants were excluded due to a birth weight >1500 g (N = 9) or <80% availability of monitor data in the first 2 weeks after birth (N = 52, infants were admitted elsewhere first (N = 50) or due to technical problems (N = 2)). Of the remaining 208 infants, 100 were included in the training set and 108 in the independent test set. Thirty (14%) infants required laser treatment, of whom 15 infants were in the training dataset and 15 infants in the test dataset. The incidence of laser treatment did not differ between the training and the test dataset (p = 1.00), although the infants in the training dataset were more often male (p < 0.01) and had a longer NICU admission time (p = 0.04) (Supplemental Table 2).
Demographic and clinical differences between the laser and non-laser group Infants who required laser treatment (assessed in the training dataset) had a lower gestational age (p < 0.01) and birth weight (p < 0.01) compared to infants who did not require laser treatment (Table 1). Infants who required laser treatment more often received treatment with inotropes (p = 0.01) and dexamethasone (p < 0.01) during NICU admission. The DIGIROP probability was higher in the infants with laser treatment compared to the infants without laser treatment (p < 0.01).
Physiological data significantly differ between infants requiring laser treatment and those who do not In the training dataset, time periods were identified with a significantly higher FiO 2 level, SpO 2 /FiO 2 ratio, time spent at an SpO 2 > 95%, and incidence of desaturations in the laser group compared to the non-laser group (Fig. 1). The area under the 80% SpO 2 , the percentage of time spent at an SpO 2 ≤ 80%, and the skewness of the heart rate had significant time periods with lower levels in the laser group. No significant time periods were identified in the mean SpO 2 , the skewness in the SpO 2 , the area above the 95% SpO 2 , the mean heart rate, and the incidence of bradycardia and tachycardia.
Physiological data are predictive of ROP treatment A tree-based classification model including all physiological data from the first 30 postnatal days had a balanced accuracy of 0.67 (0.57-0.76), a sensitivity of 0.60 (0.50-0.69), and a specificity of 0.74 (0.65-0.82) in the training set (Table 2, model 1, and Fig. 2). Model performance was higher, though not significantly different, when including the physiological data with significant time periods only, based on the non-parametric cluster analysis, from the first 30 days (Table 2, model 2). The highest t statistic levels in most features were found between day 5-15 and day 25-30 (Supplemental Fig. 1). The balanced accuracy and specificity of   Fig. 1 Graphs representing the physiological features for the group with and without laser treatment. Physiological features included the fraction of inspired oxygen (FiO 2 ), oxygen saturation (SpO 2 )/FiO 2 ratio, SpO 2 , skewness of the SpO 2 , area <80% SpO 2 curve, time percentage <80% SpO 2 , area >95% SpO 2 curve, time percentage >95% SpO 2 , incidence of desaturations, heart rate, skewness of the heart rate, and incidence of bradycardia and tachycardia in the first 30 postnatal days for the group with (red) and without (blue) laser treatment. Data are presented as median (IQR). The gray shaded areas mark time periods with significant differences (p < 0.05, non-parametric cluster analysis) between the laser and the non-laser groups. Table 2. Performance scores of the decision tree classification models, obtained in the training set.   Table 3).

Combining physiological with demographic and clinical features
The balanced accuracy, sensitivity, and specificity of the model including demographic features (gestational age, birth weight, sex, multiple birth, amount of prenatal glucocorticosteroids doses, route of delivery, and Apgar score at 1 and 5 min) only (Table 2, model 5) were lower, although not significantly, compared to models 3 and 4 with physiological data only. We next investigated whether combining demographic and physiological data improved prediction. Adding the demographics to the model including the incidence of desaturations, area under the 80% SpO 2 curve, percentage of time below 80% SpO 2 , percentage of time above 95% SpO 2 , FiO 2 , SpO 2 /FiO 2 ratio, and skewness heart rate from day 5 to day 15, did not significantly improve model performance ( As most infants who required laser treatment were born before 28 weeks' gestation models may be particularly applicable in the youngest infants. The subgroup analysis including infants with a gestational age ≤28 weeks only (N = 79) was performed by retraining model 6 and 7 as they were the highest performing models using data collected in the first 30 postnatal days before ROP screening is conducted (unlike the models including clinical data). Model

Improved identification of infants requiring laser treatment and model validation
To determine whether a model can identify all infants requiring laser treatment, the probability thresholds of the best performing models were varied (Fig. 2c) Models 6 and 7 were validated on the data of the independent test set (Table 3). To assess whether sample size had an effect on performance the models were also retrained on data from the training and test set combined (Table 3).

DISCUSSION
This study investigated whether routinely monitored physiological data from the first 30 postnatal days (a clinically relevant period before the first ROP screening) could be used for the early identification of preterm infants at risk of severe ROP requiring laser treatment. Differences were observed in the level of hypoxia and hyperoxia, the fractional inspired oxygen requirement, and skewness of the heart rate; and the physiological data, especially the SpO 2 /FiO 2 ratio, could be used to achieve good predictions of ROP requiring laser treatment. Prediction performance was increased, though not significantly, by including baseline demographics and clinical factors. Importantly, good predictions were achieved using physiological data from only day 5 to day 15 after birth, highlighting both the early predictive ability of physiological data and a time window during which infants may be especially vulnerable to oxidative stress and so could particularly benefit from careful management of oxygenation and ventilation.
Periods with increased use of supplemental oxygen and a higher level of hypoxia and hyperoxia were detected in the first  Fig. 2 Graphs representing the performance of the classification models. a Confusion matrices of models including all physiological data from day 1 to day 30 after birth (model 1), including physiological data with significant clusters from day 1 to day 30 (model 2), including physiological data with significant clusters from day 5 to day 15 (model 3), including the SpO 2 /FiO 2 ratio from day 1 to day 30 (model 4), including demographic features (model 5), including physiological data with significant clusters from day 5 to day 15 and demographic features (model 6), and including the SpO 2 /FiO 2 ratio from day 1 to day 30 and demographic features (model 7). The green areas include the true positive and negative rates (TPR; TNR), and the red areas include the false positive and negative rates (FPR; FNR). Class 0 = no laser treatment, class 1 = laser treatment. b Receiving operating characteristics (ROC) curves of models 1-7 illustrating the discriminative ability of the models. A point at the diagonal dotted line means that it is not better than a random guess, a point above the dotted line means better than a random guess, and a point below the dotted line means worse than a random guess. c The sensitivity and specificity of the two best performing models (models 6 and 7) with a varying probability threshold set between 0 and 1.
postnatal weeks in infants with ROP requiring laser treatment. The association between ROP and the use of supplemental oxygen, periods of hyperoxia and hypoxia and time outside the targeted saturation range have been described in multiple studies. 2,12,[24][25][26] The classic oxygen-induced retinopathy animal model describes the effect of hypoxia and hyperoxia in the pathophysiology of ROP with vaso-obliteration during hyperoxia exposure and vasoproliferation during relative hypoxia, although further research is still needed for detailed understanding of pathological changes. 27 The ability of physiological and demographic data to predict severe ROP has been previously investigated by Sullivan et al. 16 , although the pulse oximetry variables did not improve prediction in their analysis. Adding physiological data to the demographic data in our study did not significantly increase the performance of the models. However, 4 more patients out of 15 were correctly identified as high-risk patients for developing ROP compared with the model using demographic data only which is clinically important and the lack of statistical significance is likely limited by the small sample size. The use of supplemental oxygen, and consequently the SpO 2 /FiO 2 ratio, was higher in our study during the whole study period apart from the first few days after birth. The SpO 2 /FiO 2 ratio performed best in the univariate analyses and might be a sufficient reflection of the saturation profiles of patients at risk.
Recently, the DIGIROP-Birth and DIGIROP-Screen have been developed to predict ROP requiring treatment using birth demographics and ROP progression data. 17,28 The DIGIROP-Birth probability in our data was significantly higher in the laser group compared to the non-laser group. In initial model development we aimed to achieve high accuracy; however, clinically it is essential that no infants requiring treatment are missed. Varying the probability threshold of our model enabled us to achieve 100% sensitivity, with a specificity of 39%, similar to the results of DIGIROP-Screen at birth, and similar or higher compared to former developed models including the WINROP, G-ROP criteria, and a model based on Swiss data. [28][29][30][31] As physiological data from only the first 30 days after birth were included in our model, the number of infants who are unnecessarily screened could already be reduced by 40% prior to the first screening. We speculate that combining the physiological data with the ROP progression data could further improve prediction, especially when using photographic documentation and telemedicine. 32,33 A limitation of our study is the relatively low number of infants with ROP requiring laser treatment and the small overall data size. This likely explains the drop in performance of the classification models when applied to the independent test cohort. Intra-and inter-site differences should be assessed to further optimize the algorithm, using much larger sample sizes. 34 This highlights the need for sharing physiological and neonatal outcome data; developing a consensus of a standardized format for data sharing is important to facilitate these aims. 13 The NEDROP2 study and a large Swiss study reported that, respectively, 39 out of 1085 (3.6%) and 94 out of 7817 (1.2%) fully screened infants received treatment for developing ROP. 35,36 This is lower compared to the 14% of infants in our study who required laser treatment, probably because our study population is not fully comparable to these studies which included infants born at both high-care and regional centers and highlights the likely need to include infants from lower-care settings to create a model that will be accurate in all centers. Moreover, the relatively small sample size may have exacerbated problems caused by imbalance of the data. Training decision tree classifiers on an unbalanced dataset can lead to frequency bias and a poor accuracy, by learning from data observations that occur more frequently. To avoid bias and increase the model performance we used a random undersampling algorithm, and balanced accuracy was calculated to better reflect model performance.
In summary, routinely monitored physiological data is predictive of ROP requiring laser treatment in preterm infants, although it did not significantly improve prediction from baseline characteristics in this study. Continuous analysis of these data will improve our understanding of important time periods when infants are at risk for developing ROP to optimize preventive policies. Combining physiological data with existing risk scores and automatic photographic documentation needs further investigation with larger sample sizes but could potentially enhance objective assessment of ROP risk profiles in the future.

DATA AVAILABILITY
All data generated or analyzed during this study are included in this article and can be obtained from the corresponding author upon request. The model performance is reported as the balanced accuracy, sensitivity, specificity, and Matthew's correlations coefficient (MCC) with the 95% CI for the balanced accuracy, sensitivity, and specificity.