Introduction

Follow-up assessments of extremely preterm (EP) infants are difficult to perform and interpret for multiple reasons. As for other assessments [1], the expectations or biases of unblinded examiners may have an important effect on the findings. This problem can be minimized by including a concurrently assessed reference group of term infants and assuring that the examiners are masked to gestational age, perinatal complications, and findings of any prior follow-up assessments [2,3,4,5].

Another issue is the appropriate comparison group of term infants. One approach is to compare EP infants to term infants matched for maternal age, ethnicity, income, education, marital status, insurance status, etc. This approach has been used in efforts to identify the independent effects of perinatal factors on outcomes. However, matching is logistically difficult, quite likely to be incomplete, and precludes assessment of how adverse socioeconomic factors and their interactions with biological or medical factors compromise the outcomes of EP infants. A better understanding of all these factors is needed to develop improved methods to reduce rates of impairment among EP born children. For these reasons, a comparison to healthy term infants may be preferred in deciding which EP infants should be considered to have a developmental impairment based on the child’s capabilities irrespective of the extent to which these impairments result from medical, socioeconomic, or other factors [6].

Additional issues include the choice of the developmental test and whether its norms are fully appropriate in designating which EP infants should be considered impaired [2, 7,8,9,10,11]. While the Bayley Scales of Infant and Toddler Development (Bayley III) have been widely used, multiple investigators have reported that the impairment rates are likely to be underestimated in applying its norms [2, 7,8,9,10,11,12,13,14]. Moreover, it is difficult to assure that the examiners in all centers perform the Bayley III assessments in the same way that the assessments were performed when the Bayley III was normed.

For all these reasons, the NICHD Neonatal Research Network (NRN) undertook the study described below to assess EP and a concurrent sample of healthy term reference (TR) infants examined by the same blinded and certified examiners in the same centers at two years corrected age. We hypothesized that the proportion of EP infants with developmental impairment based on standard deviations (SDs) from the mean for the TR sample would be higher than that based on Bayley III norms. If so, we hoped to identify threshold values for Bayley III scores based on our term reference infants that would be more appropriate than those based on the Bayley III norms for categorizing EP infants as impaired in NRN centers.

Methods

The study was conducted in 15 NRN centers between January 2017 and March 2020. The addition of the TR infants to the follow-up assessments was approved by each center’s Institutional Review Board (IRB). Consent was obtained in accordance with each study site’s IRB requirements.

Design

To augment the reliability of the assessments and reduce the likelihood that examiner expectations would affect the scores, the TR and EP infants in the study were assessed concurrently by examiners not informed of their gestational age at birth or their prior clinical or developmental findings.

Eligibility and sampling

The eligible EP infants were inborn at NICHD NRN centers and <27 weeks gestation by best obstetric estimate. Infants with at least one Bayley III composite score at the 24-month follow-up visit were included in the analysis.

Eligible TR infants met the following criteria assessed using the medical record: singleton birth at 39 0/7-40 6/7 weeks gestation by best obstetric estimate; birth weight appropriate for gestational age; no resuscitation at birth; absence of congenital anomalies or other abnormalities on physical examination; benign neonatal course with all care given in a low risk nursery and no neonatal problem delaying discharge home; and parent(s) willing and able to come into the clinic. Exclusion criteria included major central nervous system disorder (e.g., cerebral palsy, deafness, blindness or the effects of major insults identified by parent report or medical records [e.g. meningitis or traumatic brain injury before two years]), child protective services custody, parental incarceration, and parental psychosis.

Our goal was to assess one healthy term infant for every fifth EP survivor at 22–26 months corrected for prematurity in the same center to evaluate 180 total TR infants in a one-year study. (See Sample Size and Power.) Recruitment of each healthy TR infant began shortly (e.g. 1–2 months) before the corresponding EP infant’s scheduled assessment. If the EP infant was lost to follow-up, the TR infant was still to be assessed.

Center coordinators used medical records to identify and attempt to recruit the first healthy term infant born on or after the expected due date of the index EP infant. The potential value of developmental testing was emphasized in recruiting. The methods of contact (letter, text, phone call) and incentives used to promote participation (e.g. up to $100 plus parking or $50 plus cab fare) varied as allowed by the individual site’s IRB. When a parent or guardian declined participation or missed two scheduled clinic visits, the coordinator contacted the next eligible infant’s parent or guardian by delivery time and date until one agreed for her child to participate within the testing window.

To assess the representativeness of the TR sample with all term births in the NRN centers we requested the information for all term infants born in NRN hospitals during the study period. To further characterize the sample of TR infants we qualitatively contrasted the estimates on available data from the Bayley-III normative data.

Measurement and comparisons

Certified Bayley III examiners, trained to reliability and re-evaluated annually, provided assessments at each NRN center [15, 16]. The Bayley III was administered to Spanish-speaking children by either a Spanish-speaking evaluator or an English-speaking evaluator with a translator. Means and SD’s of Bayley III scores among the TR infants were used to determine new thresholds for each of the Bayley III composites (cognitive, language, and motor) to indicate three levels of impairment: (1) Normal/mild, a score greater than or equal to 1 SD below the mean; (2) Moderate, a score between one and two SD below the mean; and (3) Severe, a score lower than 2 SD below the mean. Application of these new cut points to the EP infants determined the proportion falling into each category [15, 16].

Statistical analysis

Generalized linear multilevel models compared the proportion of infants in each category using the norm-based vs. TR thresholds, accounting for clustering of infants within centers. Levels of impairment were analyzed using an ordinal logistic model and dichotomous variables (moderate/severe vs. normal/mild) were analyzed using a binomial model. Analyses were conducted using SAS version 9.3.

Sample size and power

Assuming a 5% rate of impairment based on Bayley III manual norms, a 15% rate of impairment based on thresholds derived from the reference group [2], and an intraclass correlation of 0.05 due to center membership, a sample of N = 180 provided 91% power to detect a 10% absolute difference in impairments > 2 S.D.’s below the mean with alpha = 0.05. Given prior, annual rates of enrollment for EP infants we anticipated that recruiting EP to TR in a 1:5 ratio would result in N = 180 within one year.

Results

A total of 1452 EP infants (86% of survivors at 2 years) were evaluated during the time required to accrue and successfully assess 183 TR infants (Fig. 1). This accrual of TR infants took longer than expected (38 versus 24 months with an accrual ratio of 1:8 versus 1:5. Based on querying site coordinators, reasons for slower accrual than expected varied among centers but included difficulty accessing the medical records in some hospitals that were not owned by the university, problems contacting the parents using letters (as required by some IRBs), variable incentives for participation allowed by the IRBs, parental inconvenience, transportation problems, and in one center, contract negotiations between the university and an affiliated hospital.

Fig. 1: Sampling diagram.
figure 1

Sample selection for term-reference (a) and pre-term (b) infants.

Demographic comparison of the TR sample and EP infants

Mothers of TR infants were more often White, married and more highly educated. Mothers of EP infants were more often African-American (Table 1).

Table 1 Sociodemographic and medical characteristics of term and preterm infants.

Demographic comparison of the TR sample, term births at NRN hospitals and the Bayley III normative population

The information that NRN hospitals provided about their term births was incomplete and varied between hospitals, resulting in uncertainty in how the TR sample differed from all children born at term in these centers. Since the TR group included only healthy infants, modest differences would be expected. However, in the 11 centers where the information was provided (Table 2), there were 30% fewer TR children with Medicaid/public insurance and 24% more with private insurance.

Table 2 Demographic characteristics for all term births at participating sitesa.

The data for our TR sample were compared with the data provided for the normative Bayley III sample at age two years gathered by the test company (i.e. n = 100 children at 24 months). The TR sample differed from the Bayley normative sample with respect to percent who were Hispanic (20 vs 16%), African-American (28% vs 14%) and parents with ≥ 16 years of education (51% vs. 29%). Surprisingly, the Bayley III Technical Manual did not characterize the normative sample in terms of marital or insurance status, did not report the proportion of children approached for inclusion who did not participate, or indicate any measures to blind the evaluators to any unfavorable social, medical, or biologic factors that might influence scores [17].

Bayley III scores for TR and EP infants

The mean composite cognitive, motor, and language scores were 83.9, 83.3, and 80.2, respectively, for the EP infants and 97.5, 98.2, and 97.9, respectively for the TR group (Table 3). As expected, with the deliberate inclusion of children with developmental problems in the Bayley normative sample, the SDs were less for our healthy TR sample for the Cognitive Composite (11.2, 95% CI 10.2–12.5) and the Motor Composite (10.9, 95% CI 9.9–12.2) than for the Bayley normative sample (SD = 15 for all composites). The SD for the Language Composite in the TR sample was 16.0 (95% CI 14.5–17.9), similar to the manual-based SD (15). The composite score SDs for the EP infants ranged from 15.1–17.4.

Table 3 Bayley III scores among term reference (TR) and extremely preterm (EP) infants.

The ranges for all three Bayley composite scores based on norm-based thresholds were ≥ 85, 70–84 and 55–69 respectively for all three Bayley III composite scores. Using term-reference data resulted in ranges for normal/mild, moderate and severe thresholds of ≥86.21, 75–86.20 and 63.73–74.97 for the Cognitive Composite, ≥87.31, 76.38–87.30 and 65.45–76.37 for the Motor Composite, and ≥81.91, 65.88–81.90 and 49.85–65.87 for the Language Composite. The Bayley III score thresholds for severe impairment ( < 2 SDs below the mean) for Cognitive and Motor Composites were thus were 5-6 points higher than for the Bayley normative sample. However, the Language Composite threshold was approximately 3 points lower.

Comparison of impairment rates

Term-reference-based impairment thresholds resulted in higher overall rates of moderate/severe impairment (i.e. impairment on any one of the Cognitive, Motor or Language Composites Scores) (Table 4 bottom). The same was true for impairment identified using just the Cognitive and Motor Composites. Given the larger, term-reference-estimated SD for the Language Composite, the norm-based thresholds resulted in higher rates of moderate/severe language impairment (Table 4). As evident in Table 4, the differences between the Manual and Term Reference based rates of moderate and severe impairment were largely to the difference in severe impairment. A second set of post-hoc analyses adjusting for maternal education, language spoken at home and age at assessment did not substantially alter these results.

Table 4 Proportion of EP infants designated moderately/severely impaired using norm-based versus term reference based threshold scores for impairment.

Discussion

We assessed Bayley III scores at two years adjusted age for EP infants and TR infants born in the same NRN centers and examined by the same assessors who had been trained to reliability [18] and were blinded to gestational age at birth, perinatal events, and prior follow-up findings. The mean composite cognitive, motor, and language scores were 83.9, 83.3, and 80.2, respectively, for the EP infants and 97.5, 98.2, and 97.9, respectively, for the TR group.

The mean Bayley III composite scores for our TR group were lower than for term control infants in some other studies [2, 3, 12] despite the high proportion of well-educated TR mothers. This finding may be due to greater socioeconomic disadvantages; our sample contained a higher proportion of Hispanic, African American, and Medicaid-insured children than the term controls in most other studies.

More EP infants had moderate or severe cognitive and motor impairments (composite scores more than 1 or 2 SDs below the mean, respectively) using the scores for TR sample (SD = 10.9–11.3) than the Bayley III normative sample (SD = 15.0). These differences are likely due in part to the different referent populations assessed. To avoid under-identification of impaired EP children, children with major congenital anomalies, perinatal problems, or postnatal insults likely to affect development [2, 3, 6] were systematically excluded from our TR sample. A different approach was used for the Bayley III normative sample, in which 10% of the children had such problems as Down’s syndrome, cerebral palsy and language impairments [17]. While a reference population that includes the full spectrum of child development is desirable for some purposes [6], this approach would likely understate the proportion of impaired preterm infants when threshold scores 1 or 2 SDs below the mean for the Bayley III normative population are used to designate impairments. Accordingly, Sharp and DeMauro [7], among others, suggest that different and higher threshold Bayley III scores are needed.

As hypothesized, the overall proportion of EP infants with a cognitive, motor, or language impairment based on a composite score at least 1 SD below the mean for our TR group was higher than that based on Bayley III normative population (68 vs. 57%, p < 0.01)). The difference was particularly marked for severe impairment (one or more composite scores at least 2 SDs below the mean; 36 vs. 24%, p ≤ 0.001). An unexpected finding was that the proportion of EP infants with composite language scores lower than 1 SD below the mean based on our TR sample was not higher than for the Bayley normative sample. This finding reflects a relatively high SD (16.0) for the TR language scores which may well be due to a high proportion of Hispanics and marked heterogeneity in parental education among the TR parents and a greater influence of education on language than on cognition or motor scores.

Study limitations

The approach in most follow-up studies to assessing EP infants and designating their impairment rates involves some uncertainty about the reliability and inadvertent bias of the examiners as well as the appropriateness of the Bayley normative sample. While our study facilitated blinded Bayley III assessment of EP and TR infants by the same carefully trained and certified examiners, our sample of healthy TR infants was not sufficiently representative of healthy term infants in NRN centers to establish clear impairment thresholds for outcomes in the NRN. Our findings for insurance coverage and parental education indicate that attempts to recruit such infants two years or more after birth are difficult and prone to selection bias. Caregivers who had concerns about their child’s development may have been more likely to participate, a problem that would cause us to underestimate the degree to which impairment rates were underestimated using Bayley norms. Future efforts to recruit a representative sample of healthy term infants may be more successful if these infants are enrolled in the neonatal period with special measures to maintain rapport with the parents and achieve high follow-up rates through the age of assessment [19].

The need to minimize bias in assessing EP infants may be achieved more simply by including a convenience sample of term reference controls and blinding the evaluators to gestational age, medical history, and any prior follow-up assessments. However, it is unclear whether the Bayley IV Scales address the need emphasized by Sharp and DeMauro [7] among others to establish higher impairment thresholds for the Bayley III Scales. While the Bayley-IV has superceded the Bayley-III the current results are still informative. The Bayley-IV Technical Manual states, “Because most of the Bayley-4 is a revision of the previous edition, most of the validity evidence reported in the research related to the Bayley-III is still relevant….”. (p. 37) [20].

Accurate identification and monitoring of impairment rates in EP infants is critical for multiple reasons, including provision of appropriate services and parental counselling for individual infants, planning their long term education and rehabilitation, testing perinatal interventions in proper clinical trials, and evaluating care and outcomes within and across different perinatal centers over time. The impairment rates identified among EP and other high-risk infants have been almost always assessed by examiners well aware of the infants’ risk factors and prior assessments. Yet,the need for blinded assessors and concurrently assessed control patients should not be assumed to be less important to assure unbiased assessments in follow-up clinics than in other settings. High priority should be given in neonatal follow-up programs to developing effective methods to meet this need and to define appropriate impairment thresholds for EP infants.