Applying a data-driven approach to quantify EEG maturational deviations in preterms with normal and abnormal neurodevelopmental outcomes

Premature babies are subjected to environmental stresses that can affect brain maturation and cause abnormal neurodevelopmental outcome later in life. Better understanding this link is crucial to developing a clinical tool for early outcome estimation. We defined maturational trajectories between the Electroencephalography (EEG)-derived ‘brain-age’ and postmenstrual age (the age since the last menstrual cycle of the mother) from longitudinal recordings during the baby’s stay in the Neonatal Intensive Care Unit. Data consisted of 224 recordings (65 patients) separated for normal and abnormal outcome at 9–24 months follow-up. Trajectory deviations were compared between outcome groups using the root mean squared error (RMSE) and maximum trajectory deviation (δmax). 113 features were extracted (per sleep state) to train a data-driven model that estimates brain-age, with the most prominent features identified as potential maturational and outcome-sensitive biomarkers. RMSE and δmax showed significant differences between outcome groups (cluster-based permutation test, p < 0.05). RMSE had a median (IQR) of 0.75 (0.60–1.35) weeks for normal outcome and 1.35 (1.15–1.55) for abnormal outcome, while δmax had a median of 0.90 (0.70–1.70) and 1.90 (1.20–2.90) weeks, respectively. Abnormal outcome trajectories were associated with clinically defined dysmature and disorganised EEG patterns, cementing the link between early maturational trajectories and neurodevelopmental outcome.


Methods
Datasets and outcome assessment. Data was recorded from the NICU of the University Hospitals, Leuven (Belgium) in accordance with the relevant guidelines and regulations and approved by the ethics committee of the University Hospitals, Leuven. All patients were recruited after informed consent from the parents. For each patient, between one and five EEGs were recorded at different PMAs over the period the patient remained in the NICU. All EEGs were measured using the standard 10-20 electrode system with nine channels (Fp1, Fp2, C3, C4, T3, T4, O1, O2 and reference Cz) at a sampling frequency of 250 Hz.
Two separate cohorts of patients were recruited, resulting in two separate EEG datasets referred to as 1  and 2  in this study, respectively. 1  consisted of 89 EEG recordings from 26 patients with PMA range 27.3-42.0 weeks (median recording duration (IQR): 4 h 23 m (3 h 53 m-5 h 46 m), median number of recordings per patient: 3 (3)(4)) and were recruited in the period February 2013 -September 2014. These were preselected based on a normal neurodevelopmental outcome at nine and 24 months of corrected age, clinically defined according to Pillay et al. 32 and Dereymaeker et al. 33 (2)(3)(4)) and had BSID-II scores at nine months of corrected age. 2  was recruited in the period July 2013 -December 2015. This dataset further included both normal and abnormal outcome patients organised into groups based on their severity of abnormality: 1. Normal outcome: 68 recordings (21 patients) selected using the same criteria as 1. 2. Mild abnormal outcome: 33 recordings (nine patients) with BSID-II scores (motor or mental) range 71-84. 3. Severe abnormal outcome: 20 recordings (six patients) with a BSID-II score (motor or mental) ≤70 and/or the presence of cerebral palsy. 4. Lost-to-follow-up/death: 14 recordings (three patients) who died before follow-up at nine months.
A combined clinical summary of both datasets, sorted by outcome group, is presented in Table 1. Appropriate statistical tests were performed to check if there were any unexpected significant differences in the measures/ treatments between outcome groups. eeG pre-processing. Each EEG recording was bandpass filtered in the range 0.5-40 Hz and visually labelled for periods of artefacts (such as electrode drop-offs and movement) by a trained clinician (AD).
Accounting for differences across sleep state was an important consideration 34 , with the previously developed CLASS algorithm for the robust detection of QS (see Dereymaeker et al. 33 ) applied to each recording to obtain QS and non-QS (comprising both AS and wakefulness) periods while excluding periods that were also labelled as artefact. To additionally compare the effects of separating for sleep state, clean, non-sleep state specific two-hour periods of EEG were also extracted (incorporating at least one full sleep cycle).
Within each QS, non-QS, and full-cycle period, 113 different features were extracted from each channel in 30 second epochs. The majority of this feature set have also been used and defined in a previous study on automated sleep staging in term-age babies (see Pillay et al. 32 ), and is summarised in Table 2. Further technical descriptions of these features are detailed in Appendix A.1.
Features were selected across a wide range of preterm, infant and adult EEG studies as part of a data-driven approach, which utilizes the data and subsequent analyses to identify important features rather than pre-selecting them based on prior clinical knowledge. Medians were taken across each channel (unless the feature utilised all channels) and epoch 30  Obtaining median values across the full recording ensured a robust approach for estimating the most representative features for the given sleep state and recording, reducing the risk of additional variability that may be brought about by any residual or missed short-duration artefacts (due to muscle twitch 33 etc.).
A summary of this pre-processing procedure is presented in Fig. 1.
Brain-age prediction and deriving trajectories. With a 24-month normal neurodevelopment at follow-up, 1 provided the most reliable estimate of a normal developing EEG and has been used extensively in previous studies on preterm brain maturation and sleep staging 23,33,[35][36][37][38] . Consequently, PMA was considered a reasonable estimate of EEG brain-age in this dataset 30 and made it ideal as the training set for the brain-age prediction model. While 2  also included normal outcome data, this was assessed at only nine months follow-up so this could not be combined with 1 for training. Consequently, all of 2 (normal and abnormal outcome data) was kept separate from 1 throughout the analyses.
To perform brain-age prediction, we used a Random Forest (RF) regression model.
The RF regression model. The RF is ideally suited to data where the number of features is greater than the number of recordings, does not require prior standardization, and intrinsically selects those features that are most predictive without the need for additional feature selection. Briefly, an RF consists of a series of decision trees. Each tree is 'grown' by selecting a feature from the full feature set that best splits the training data, with the process then subsequently repeated on the split data and so on, until a stopping criterion is achieved. Note that the same feature can be selected multiple times in a single tree. Finally, a new test data point is then passed through this trained tree to generate a regression estimate.
In an RF, instead of choosing from all features for each split in the tree, a subset of N f features are selected at random each time. A total of N trees trees are trained, and the average obtained across the trees for the final estimate. Optimal feature choices at each split are assessed by maximising the drop in the Sum of Squared Error (∆SSE) between the estimates and true values. The overall ∆SSE for each feature (across all splits and trees) additionally provides a measure of the feature importance in the model.

Wavelet decomposition
Mean absolute value and SD of amplitudes in D3, D4, D5, A5 bands, Ratio of absolute mean values in adjacent sub bands, Sum of squared coefficients in D3, D4, D5, A5 bands.

Fourier-transform
Full band power (delta-beta range), Mean frequency, Spectral roll-off SD, centroid SD, flux SD, Power spectral entropy, 90% spectral edge frequency, Relative band power in delta, theta, alpha, beta bands, Mean frequency in delta, theta, alpha, beta bands, EEG spectral beta/alpha ratio, www.nature.com/scientificreports www.nature.com/scientificreports/ Deriving brain-age trajectories. RF brain-age models were trained using the PMA as the training labels and combining the features from both QS and non-QS (226 features per recording). Each brain-age trajectory was then derived by plotting the patient's brain age recording estimates against their PMA.
Typically, the choice of the RF parameter N trees is not very sensitive if set sufficiently high so we selected a value of 1500 throughout. Choosing N f , however, affects the choice and variation of features used by the RF. A larger number of N f reduces the variability of the features selected across trees, while smaller N f results in a wider range of the feature set represented 39 . As the aim of identifying abnormal deviations linked to neurodevelopmental outcome may be strongly dependent on this choice, training and testing was repeated and presented for all values of N f across the full range (i.e. N f = 1 to N f = 226).
Various studies have presented the effects of RF training at extreme values of N f , with differing conclusions [39][40][41] . Therefore, before comparing between the normal and abnormal outcomes, we performed a preliminary study to first assess the stability of the RF across the full N f range using only the normal-outcome data at 24 months follow up (1). 1 was used for training and testing the models by Leave-One-Subject-Out Cross-Validation (LOSO CV). This would also importantly confirm if a strong correlation was observed between the brain-age estimates and PMA.
We then trained RF models on the entirety of the feature data for 1  before testing this model on 2 to derive the brain age trajectories for each outcome group (across the full N f range). Although outcome was not explicitly used during training and testing, the deviation of these trajectories was hypothesized to increase in patients with abnormal outcome. From the trained models, the most prominently selected features were then identified.
As a supplementary analysis to confirm the benefits of combining both QS and non-QS features in the model, we also repeated the above analysis on various alternative models that were trained using features from a) only QS, b) only non-QS, and c) the full-cycle periods, for comparison. trajectory summary metrics. With brain-age trajectories estimated for each patient, metrics were next derived to summarize them. When comparing trajectories between the normal and abnormal outcome groups, the precise nature of potential deviations was unknown. Therefore, three different metrics were chosen to capture different aspects of the deviation: 1. The trajectory's Root Mean-Squared Error (RMSE) is a standard measure of the overall regression accuracy between brain-age and PMA and was calculated here across the recordings for each patient trajectory. 2. The adjusted correlation coefficient (r) was calculated to provide a measure of the trajectory correlation strength with PMA 30 . A linear mixed-effects model was fitted to the output, with brain-age as the response variable, PMA as the predictor variable and a random intercept to account for the repeated recordings in each trajectory. The square root of the adjusted R 2 resulted in r. Extremities of 1 denotes a perfect linear correlation and 0 denotes no correlation. 3. The maximum deviation (δ max ) measures the maximum deflection of each trajectory from the normal trend. A regression line was first fitted to the training output from 1 to define the normal trend, and the absolute difference between each test recording and this fitted line was determined. The maximum absolute difference was then chosen to summarize each trajectory.

Statistical tests.
When comparing the values of the trajectory summary metrics between the normal and abnormal outcome groups across the full range of N f , the Wilcoxon rank-sum test was calculated at each value of N f to determine statistically significant differences at p < 0.05. To correct for multiple comparisons, a cluster-based non-parametric permutation test was then applied by grouping significant values at adjacent N f to form clusters and using the total rank-sum values within each cluster as the cluster mass test statistic (see Maris & Oostenveld (2007) 42 ). Statistically significant clusters were identified at p < 0.05 (one-tailed test).
To further assess if the features that proved prominent in the trained models could be considered as potential biomarkers for maturation and outcome, a univariate linear mixed-effects model was fitted for each identified www.nature.com/scientificreports www.nature.com/scientificreports/ feature, with fixed effects for the PMA and outcome and random intercepts to account for the repeated recordings. The p-values for the fixed effects were assessed for significance at p < 0.05.
Results preliminary assessment of model stability and brain-age correlation. Figure 2 shows the results of the preliminary study to assess RF stability and brain-age correlation with PMA (using combined QS and non-QS feature sets). The trajectory RMSEs and r across N f are shown in Fig. 2a,b, respectively, on the left-out data after performing LOSO on 1.
After an initial under-fitting up to N f = 40, results showed an accurate and consistently stable model up to the maximum value of N f = 226 with RMSE (median (IQR)) in the range 0.90 (0.55-1.45) -1.05 (0.60-1.50) weeks. Values of r were also consistently stable in the range N f = 40 to N f = 226, remaining in the range 0.92-0.94. This confirmed that the model exhibited a stable behaviour up to the highest values of N f , with values of r suggesting a strong correlation between the estimated brain-age and PMA throughout. Figure 2c shows a scatter plot of the estimated age against PMA at the maximum N f = 226, further illustrating this strong positive correlation. Linking brain-age trajectories with neurodevelopmental outcome. Figure 3 shows the overall test performance on 2 after training on the entirety of 1  . The trajectory summary metrics are shown for all values of N f , and compared between the normal and abnormal outcome groups. Figure 3a shows the results of RMSE. After initial underfitting of the models, results showed a clear, stable separability between the outcome groups in the range N f = 20 to N f = 226, with deviations in brain-age trajectories consistently greater in the abnormal outcome group. After performing the cluster-based non-parametric permutation test (at p < 0.05), a single significant cluster was found in the range N f = 78 to N f = 226 (p = 0.015). Within this cluster, the location of the largest median difference in RMSE between the outcome groups had a median of 0.75 (0.60-1.35) weeks for the normal outcome group and 1.35 (1.15-1.55) for the abnormal outcome group. We also compared the abnormal subgroups of mild and severe outcome at the same location (ignoring the death subgroup due to limited patients) to assess if there were further differences. Mild abnormal outcome had a lower median RMSE (1.19 (0.75-1.42)) than severe abnormal outcome (1.36 (1.27-2.48)) although this was not statistically significant. Figure 3b shows similar results for δ max , with a significant cluster observed in the range N f = 114 to N f = 226 (p = 0.015) between normal and abnormal outcome. The location of largest difference within this cluster had a median δ max of 0.90 (0.70-1.70) weeks for normal outcome and 1.90 (1.20-2.90) weeks for abnormal outcome. As with RMSE, a further comparison between the mild and severe abnormal subgroups revealed no statistically significant difference, though the mild abnormal subgroup had a lower median δ max of 1.28 (1.10-2.25) than the severe abnormal subgroup of 1.80 (1.33-3.70). Figure 3c compares the values of r between the outcome groups. No cluster analysis was applied here as single values of r were compared over N f . Values of r remained very high at >0.9 across N f , with the difference between the outcome groups dropping to 0.01. These minor differences at higher N f (and the counter-intuitively higher correlation in abnormal outcome than normal outcome) suggested that these were only due to small discrepancies between the data distribution in the outcome groups rather than a meaningful difference.
Achieving similarly high results for r between groups, while still observing large differences between RMSE and δ max , suggested that the trajectories could be largely driven by single-recording deviations. To confirm this, we plotted the trajectories in Fig. 4 (separated for outcome) at N f = 226 as all significant clusters for the trajectory metrics incorporated the maximum value of N f . www.nature.com/scientificreports www.nature.com/scientificreports/ From Fig. 4, comparatively more deviant trajectories were emphasized by fitting a regression on the normal outcome trajectories (left) and calculating the 95% prediction interval as a simple threshold. Any patient trajectory where at least one recording exceeded this threshold was then highlighted in red. In general, this illustrated that while a few trajectories showed a systematic deviation, the majority were largely influenced by the effects of single recordings. Furthermore, deviant trajectories exhibited both accelerated (brain-age > PMA) and decelerated/delayed (brain-age <PMA) changes.
Finally, repeating the above analysis on comparatively simpler models that utilised QS only, non-QS only and the full-cycle periods, confirmed that combining both QS and non-QS was indeed the optimal choice. This additional study is presented in Appendix A.2.

Biomarkers of maturation & neurodevelopmental outcome. To identify which individual features
were potential biomarkers of maturation and outcome, Fig. 5 shows the top six features that were prominently selected by the RF models. The feature ranking was decided from the ∆SSE values at the maximum N f = 226, and ∆SSE then plotted for the full N f range. The shaded regions in each plot denote values of N f where the feature rank matched the rank at N f = 226. The p-values after fitting univariate regression models to each feature for PMA and outcome is also shown, along with a '+ve' or '−ve' to denote the direction of the feature's trend with PMA.
The Line Length Burst% (in non-QS) showed by far the largest increase with N f indicating that it was selected with increasing frequency by the model as a discriminating feature for predicting brain-age. It was the only feature that was consistently ranked as first across N f and by showing a high individual significance for both PMA and outcome, was the driving feature behind the better separation of normal and abnormal outcome trajectories at high N f . Line Length Burst% measures the total percentage duration of burst EEG activity (which is strongly www.nature.com/scientificreports www.nature.com/scientificreports/ characteristic of preterm EEG) that occurs within a 30 second window, using the line length burst detection algorithm developed by Koolen et al. 23 , and showed a positive trend with PMA.
Mean absolute ratio of IM6/IMF5 also showed some prominence and the negative trend with PMA described the shift towards higher frequencies and less slow-wave activity in the EEG, while the positive trend in the Hilbert median envelope amplitude in theta described the increase in the amplitudes of the EEG within the theta (3-8 Hz) band (see Appendix A.1 for more detailed definitions). Skewness measures in QS and non-QS (reflecting the shift of the EEG amplitude distribution away from the normal distribution) were the next common features but were no longer consistently ranked. These findings show the dominance of the first three features, but most notably Line Length Burst% (in non-QS), and its efficacy as a maturational and outcome-sensitive biomarker.
To better understand the behaviour of Line Length Burst% (non-QS) in relation to the EEG itself, Fig. 6 provides examples of recordings belonging to patients that presented normal and abnormal trajectories, focussing on non-QS EEG.
As the EEG matured normally (Fig. 6a), the significantly positive correlation of Line Length Burst% (in non-QS) and PMA (from Fig. 5) reflected the transition from clear discontinuous burst to inter-burst interval (periods of suppressed EEG between successive bursts) EEG behaviour of preterm babies, to a more continuous, less regular EEG activity towards term-age.
Resulting observed trajectory delays (as illustrated in Fig. 4) largely driven by Line Length Burst% (in non-QS) reflected changes in specific morphologies in the EEG, resulting in a delay for what was expected at the given age and a dysmature-labelled EEG (as shown in Fig. 6b).
Observed accelerated trajectories in this study were largely recording-specific and further inspection of the EEG recordings showed them to be predominantly disorganised (as opposed to dysmature), with one such recording shown in Fig. 6c. This was due to the EEG morphology exhibiting unusual deformed activity (e.g. 'cogwheel-shaped' brushes with high frequency components and repetitive rhythms), without signs of acute depressions (e.g. longer inter-burst intervals or periods of discontinuity appropriate for age) which is linked with abnormal neurodevelopmental outcome 43 . There were also abundant abnormal sharp transients, asymmetry, burst asynchrony and possible inter-ictal rhythmic activity 19,44,45 . This particular patient developed a severe brain lesion which was reflected through this abnormal evolution of their EEG behaviour.

Discussion
This study utilises a data-driven approach to link early maturational deviations to neurodevelopmental outcome. It is the first method that successfully incorporates objective features as part of a brain-age model to achieve this task, over a wide PMA range, identifying key biomarkers of maturation and outcome. The approach also considered differences across sleep state and avoids the need for a ground truth at the time of recording.
Increasing N f causes a rise in RMSE and δ max in those patients with abnormal neurodevelopmental outcome, while maintaining a consistently stable and excellent performance for normal outcome patients. Moreover, the majority of maturational deviations were driven by single-recording deviations in the trajectory rather than an overall change in trajectory correlation. Deviations also proved to exhibit both a decelerated and accelerated rate of maturation.
Trends in the behaviour with N f is linked to the fundamental processes of the RF and further helps identify key biomarkers driving these changes. At low N f values, few features are chosen randomly for selection for each split in a decision tree and this increases the range of features represented across the entire forest. The lack of significant differences between abnormal and normal outcomes at low N f are due to the reduced specificity to outcome-sensitive features highlighting that, while many features show strong trends with PMA, not all can quantify the differences across outcome. This is one of the key limitations of existing brain-age models, which are validated only on normal outcome patients. www.nature.com/scientificreports www.nature.com/scientificreports/ At high values of N f , the majority of the feature set is presented at each tree split increasing the chance of selecting the same feature multiple times. If there are features that are much more strongly correlated with PMA, this concentrates the model on this smaller subset of features. Here, we also observe the intriguing effect of improving the separation of outcome, as those features most strongly correlated to PMA also exhibit very strong outcome sensitivity.
More specifically, Line Length Burst% (in non-QS) provides the strongest correlations and is by far the most prevalent feature used in the brain-age models. It also proves to perform best within non-QS periods instead of the corresponding values for QS, which was previously used to demonstrate the efficacy of this feature 23 . This discovery motivates a more in-depth look into non-QS as a potentially more informative sleep state for assessing neurodevelopmental outcome.
By term age, subtle changes in the EEG background are apparent in QS by visual inspection (e.g. changes in inter-burst interval length and burst activity), and not solely due to the naturally reduced level of artefacts in this sleep state. However, the more severe the underlying functional disturbance, the more the EEG patterns will have changed over the whole sleep cycle (e.g. changes in asynchrony, amplitude and continuity) and hence also in non-QS 46 . During the preterm period, the well-described discontinuity (burst to inter-burst intervals) related to QS spans most of the preterm postnatal period before developing into continuous activity by term-age. On the other hand, active sleep (part of non-QS) shows an additional preceding period of semi-continuous activity. These additional transitionary phases during non-QS development may explain the better age discrimination of the Line Length Burst% in non-QS across PMA compared to other features in QS. It may also explain why a non-QS period of recording better highlights a deviating trajectory (see Fig. 6), as the corresponding discontinuity during QS is not yet sufficiently discriminative.
Linking delayed trajectory behaviour with the EEG identified a clear dysmaturity, while accelerated trajectories were most strongly linked with disorganised patterns. In the latter, we assume that such morphological changes to the EEG are reflected in the outcome-sensitive features and, consequently, it is more likely that the disorganised nature of the EEG activity results in key outcome-sensitive features exhibiting values not fitting with the age-specific values, but to a more 'continuous' EEG. This results in an over-estimation of the brain-age and accelerated trajectory that still correlates with abnormal outcome.
With these observations in mind, we note that delayed trajectories could also be caused by abnormal disorganised patterns (as well as delayed dysmaturity), such as due to an increased burst-suppression activity 47 . Similarly, accelerated trajectories may also correspond to an accelerated dysmature EEG (as well as disorganised activity).  confirms the presence of accelerated dysmaturity in preterm development that still results in an abnormal outcome due to the increased speed of maturation 18 . www.nature.com/scientificreports www.nature.com/scientificreports/ The presence of deviating trajectories could warrant extra EEG monitoring and/or brain imaging, as it may reflect altered developmentally regulated neuronal pathways 48,49 , which may be mediated by the effect of interventions/environmental stresses (e.g. deterioration after a sepsis episode, exposure to pain and extra-uterine environment, brain injury, episodes of cardiorespiratory instability etc.). Since babies had to be (respiratory and hemodynamically) stable during their EEG recordings, there was no intention to record EEG during acute adverse conditions. We did see changes in consecutive cranial brain ultrasound (see Table 1) that were more significant in the severe/death outcome group. We noticed that the EEG deviations frequently preceded persistent ultrasound brain abnormalities (if any) and therefore assume that the observed EEG changes reflect chronic stage abnormalities rather than acute stage abnormalities.
One of the limitations of this study remains that, while we introduce a serial analysis of EEG recordings per patient, there are only a limited number of recordings which are all measured at irregular age intervals. This is largely to maintain the delicate balance between accurate diagnostics and avoiding unnecessary disturbance to the vulnerable baby. It does, however, make it difficult to obtain a higher resolution timing of abnormal deviations. This said, Le Bihannic et al. (2012) and Périvier et al. (2016) showed in line with our results that transient moderate-to-severe EEG abnormalities (present on at least one EEG during consecutive measurements) based on visual inspection, were independent predictors of neurodevelopmental outcome 2,27 . Hayashi-Kurahashi et al. Figure 6. 20 second non-QS EEG epochs extracted from recordings that formed a patient's brain-age trajectory. On each EEG panel from top to bottom: First four EEG channels reflect the right hemisphere from anterior to posterior. Next four channels reflect left hemisphere from anterior to posterior. Channels 9-10 are right midline and channels 11-12 are left midline. The PMA and GA of the baby's recordings (in weeks) and the overall clinical assessment of the EEG morphology (normal, dysmature or disorganised) are also given. (a) EEGs from a patient born at 30 4/7 weeks GA with a normal estimated brain-age trajectory and normal (clinically defined) EEG characteristics. (b) A patient born at 25 4/7 weeks GA with a delayed estimated brain-age trajectory and abnormal, dysmature EEG. (c) A patient born at 25 5/7. The first recording at 27 6/7 weeks PMA had an accelerated estimated brain-age trajectory and abnormal, disorganised EEG. The second recording at 30 4/7 weeks PMA had both disorganized and dysmature behaviour.

Scientific RepoRtS |
(2020) 10:7288 | https://doi.org/10.1038/s41598-020-64211-0 www.nature.com/scientificreports www.nature.com/scientificreports/ (2012) conducted serial EEG recordings during three time periods in preterm infants <34 weeks. EEG abnormality grades (based on visual assessment) on at least one EEG during the first month of life, was significantly correlated with neurodevelopmental outcome at 12 to 18 months 50 . Based on these study results, we hypothesize that one abnormal EEG recording may already reflect impaired neuronal pathways, and this becomes more prominent when performing a serial recording study 50,51 .
It is difficult to quantitatively compare our data with other maturational studies since these did not include abnormal neurodevelopmental outcome in the same manner. This said, Tokariev et al. (2019) did find, based on quantitative analysis of one EEG at term equivalent age in extremely preterm infants (around 26 weeks GA), a difference in sleep-state-related connectivity in the posterior networks. This was significantly correlated with visual performance and showed a non-significant negative correlation with social-emotional performance at 24 months follow-up 52 .
Neurodevelopmental assessment at nine months of corrected age in one of the cohorts (i.e. 2  ) is a second limitation of this study and may underestimate more subtle motor or behavioural disturbances which may only become apparent at later follow-up. Furthermore, separating into outcome groups results in reduced data sizes, inhibiting robust pre-selection of the outcome-sensitive features before performing brain-age prediction. These limitations make it challenging to derive a precise and reliable threshold for identifying deviating trajectories, preventing the model from performing a fully automated classification of outcome at this point in time.
27 female and 38 male preterm infants were analysed in total, with significantly more preterm males in the abnormal outcome group. Male disadvantage in the preterm population has been a consistent finding over the past decades, including higher rates of preterm birth, increased mortality and (long-term) morbidity 53,54 . Clarifying the underlying mechanism of gender-unequal consequences of prematurity is an important topic as it might become a target for precision medicine. We have previously tested on 1  (normal outcome) the effect of gender on the maturation of some of the features defined in this study (including Line Length Burst%) and did not find a significant gender effect in this group with normal neurodevelopmental outcome 55 . Griesmaier et al. (2014) did find sex-related differences in the maturation of amplitude-integrated EEG signals during the first four weeks, however, these results were not correlated with outcome 56 . In future studies with larger data sizes, it would be of major interest to delineate whether sex-related differences in maturational trajectories can be observed and whether these altered neurophysiological mechanisms can be related to long-term outcomes.
We also acknowledge the important effect of commonly used drugs (such as fentanyl, morphine, caffeine and phenobarbital) in the NICU on early brain activity 12 . The complex interplay between optimal comfort/ pain management and exposure to pain-relieving sedative drugs (all with potential adverse long-term consequences 57,58 ) challenges current neonatal practice for compassionate treatment. Assessing the effect of both sedatives and pain on functional brain maturation is a complex but intriguing question. Since these premature infants were mainly recorded after the first week of life, only one baby received sedative drugs (Fentanyl) during the recording. Fentanyl can suppress the EEG and lengthen the inter-burst intervals which could theoretically influence Line Length Burst% and other features. Malk et al. (2014) showed how EEG bursts were also influenced by drugs (changes in duration and enhanced oscillation at higher frequencies) and thus presented altered EEG characteristics 12 .
All babies received caffeine until 35-36 weeks PMA. Caffeine has been reported to influence EEG continuity, so the effect is considered to be applicable to all. In-depth drug analysis is beyond the scope of this paper but is certainly an important direction to explore if we wish to achieve future bedside implementation of these monitoring techniques.
In summary, future work requires a more regular (ideally weekly) recording of EEG and a larger data size to train models specifically on outcome-sensitive features, additionally including corrections for gender and drug use. Increasing data size and resolution may produce more detailed trajectories that may better show subtler changes in behaviour due to altered neuronal pathways. It could also open the possibility of directly and reliably classifying the deviations as normal or abnormal, by introducing robust thresholds based on the magnitude of the trajectory deviations and/or latency. From this, the relationship between outcome estimation and number of EEGs recorded could also be investigated. Efforts are being made within the neonatal research community to better share and expand such databases.
This important study cements the link between brain-age trajectories and neurodevelopmental outcome. Furthermore, by presenting an approach that identifies the important sleep states, combines an objective set of data-driven features (of which key outcome-sensitive maturational biomarkers are identified) and avoids the need for visual grading at the recording level, it is an important next step towards the automated early identification of abnormal neurodevelopmental outcome.

Appendix
A.1 Feature extraction. As listed in Table 2, the data-driven feature set included amplitude-based features such as skewness (a measure of similarity of the amplitude distribution compared to a normal distribution, where negative values denote the a left-sided skew, and positive values a right-sided skew), as well as measures of signal complexity such as Multiscale Entropy (MSE) 35 and fractal dimension.
The Line Length method of Koolen et al. 23 measures changes in the amplitude and frequency of the signal across successive one-second windows, using a threshold to then classify high amplitude, high frequency periods as EEG bursts. The percentage of the total duration that are burst periods within a 30 second window is then defined as the Line Length Burst%.
The rEEG 24 similarly summarizes discontinuity by identifying the inter-burst intervals in the signal after converting the multichannel EEG to an approximation of amplitude-integrated EEG.
Features were also derived from EEG decompositions. Classical delta (0.5-3 Hz), theta (3-8 Hz), alpha (8-12 Hz) and beta (12-30 Hz) band powers were determined from the Fourier transform of the signal. Related decompositions based on the Discrete Wavelet Transform (DWT) 59 and Empirical Mode Decomposition (EMD) 60 were also calculated to better capture the signal's non-stationarity. The DWT decomposed the signal into Detail and Approximation bands up to the fifth level (denoted D3 to D5 and A5, respectively) 59 while the EMD produced six bands known as Intrinsic Mode Functions (denoted by the set IMF1-IMF6) 60 . IMF1 contains the highest frequency information and IMF6 the lowest frequencies. Variation coefficient of each IMF provided a mean-standardized measure of the signal variance and fluctuation index measured the intensity of the signal change.
Further applying the Hilbert transform to the IMFs and Fourier bands obtained band-specific measures of signal envelope and instantaneous frequency 61,62 . A.2 Comparison of results across sleep states. When comparing the models for QS, non-QS, combining QS and non-QS, and the full-cycle periods, all models were analysed in the same manner as for the combined QS and non-QS model presented in the main results. The stable range for each model was observed to be always from N f = 20 to the maximum N f value (113 for QS only, non-QS only and full-cycle models, and 226 when combining QS and non-QS). Results are presented in Table 3. Asterisks in the column headings denote that the values were part of a statistically significant cluster (p < 0.05) across N f . The maximum median differences between normal and abnormal groups within the significant cluster (or across the stable range if there was no significance) were calculated for each trajectory summary metric, as well as the median values at those locations for both normal and abnormal outcomes. As a further comparison, Cliff 's delta was also determined at these locations as a measure of effect size, quantifying the degree of overlap between the normal and abnormal groups 63 (values > 0.28 denote a medium to large effect size). Comparisons using r were not included here as it proved uninformative based on the results in the main body of text.
From the results in Table 3, all models had a statistically significant cluster, apart from the full-cycle periods, showing the importance of separating for sleep state. While combined QS & non-QS and non-QS only models showed better values for the median difference and Cliff 's delta (compared to QS only), QS & non-QS had a better RMSE and δ max in the normal outcome group than non-QS only. Overall, this showed that the more complex model which combined both QS and non-QS was indeed the most suitable choice.

Data availability
The research materials and data supporting this publication can be accessed, upon reasonable request, by contacting the corresponding author.  Table 3. Comparison of the trajectory summary metrics for models using features from only QS, only non-QS, QS and non-QS, and non-specific two-hour periods of EEG containing at least one full sleep cycle. Values for the normal and abnormal outcome trajectories are both given, with medians and the interquartile ranges (in brackets) provided (where appropriate). Differences between these values are also given, along with the Cliff 's delta. The results are presented at values of N f within each significant cluster where the differences were maximal. Asterisks in the column headings denote that the results were part of a statistically significant cluster.