Attention-Deficit/Hyperactivity Disorder (ADHD) is one of the most prevalent mental health disorders in young children [1] and is associated with functional impairment in academic, social, and family settings [2] as well as sizeable social and economic costs [3, 4]. Children born preterm are at higher risk, experiencing rates of ADHD that are 2 to 4 times higher than the general population, with the risk increasing with each decreasing week of gestation at birth [5,6,7]. Despite this, little is known about the antecedents of attention problems, a predominant characteristic of ADHD, in children born very preterm.

Prior research has described a complex etiology underlying the development of attention problems, with both genetic and environmental factors thought to jointly contribute to risk [8]. More recently, epigenetics has been identified as an important biological domain that could predict risk for attention problems, serving as either a predictive biomarker or a causally implicated biological mechanism [9]. Specifically, the epigenetic mechanism of DNA methylation holds promise as a predictor of attention problems because the methylome is influenced by both genetic and environmental factors, including some of the environmental factors (e.g., smoking, alcohol, adversity, lead) that are implicated in the development of ADHD.

Early studies investigating DNA methylation and ADHD consisted of candidate gene studies that primarily targeted genes involved in the dopaminergic network (e.g., DRD4) [10,11,12]. In recent years epigenome-wide association studies (EWAS) have reported DNA methylation at other genetic loci associated with increased risk for attention problems in children [13,14,15,16,17,18,19]. Methylation of the VIPR2 gene—a gene that codes a receptor for a small neuropeptide with neurotransmitter and neuroendocrine functions—was shown to differentiate between ADHD cases and controls in boys age 7–12 [16], in a sample of twin pairs discordant for ADHD [17], and in the most recent case-control EWAS of approximately 600 children age 7–12 [15]. In prospective, longitudinal studies, DNA methylation at birth has been shown to be associated with later ADHD symptom severity, in genes such as ZNF544, ST3GAL3, ERC2, and CREB5 [13, 14]. Genetic variation within some of these genes has been implicated in ADHD in prior genome-wide (as opposed to epigenome-wide) association studies [20, 21]. Interestingly, studies with repeated measures of epigenetic data have failed to find concurrent associations between DNAm and ADHD symptoms measured in childhood [13, 14], suggesting DNAm in the neonatal period may be a particularly important predictor of later outcome.

While these prior studies underscore the potential utility of epigenetic studies for understanding the etiology of ADHD, they have not specifically investigated epigenetic precursors to attention problems in children born preterm. Additionally, many prior studies investigated ADHD as a dichotomy (i.e., cases versus controls) rather than measuring symptoms continuously, although the latter approach is gaining popularity [14] perhaps due to its consistency with recent framing of ADHD as a dimensional trait [22, 23]. Finally, prior studies have tended to assess symptoms of ADHD in school-age children, rather than in toddlerhood or early childhood, despite evidence that early attention problems quantified using validated assessments are associated with subsequent attention deficits at school age [24]. The current study aims to address these gaps by conducting an EWAS to examine epigenetic predictors of attention problems at age 2 years in a multi-site study of children born < 30 weeks gestational age (GA).

Methods

Participants

Participants were drawn from the Neonatal Neurobehavior and Outcomes in Very Preterm Infants (NOVI) Study, a multi-site study of infants born < 30 weeks GA. Participants were recruited from nine university-affiliated NICUs across six research sites from April 2014 to June 2016. Inclusion criteria were: (1) birth < 30 weeks GA, (2) parental ability to speak English or Spanish, (3) residence within 3 h of the NICU and follow-up clinic. Exclusion criteria included major congenital anomalies, maternal age < 18 years, cognitive impairment, and death. Parents of eligible infants were approached when infants were 31–32 weeks GA or when survival to discharge was deemed likely by the attending neonatologist. Researchers at each site obtained informed consent in line with each institution’s review board. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines.

Children were included in this analysis if they were enrolled in NOVI, had a neonatal buccal swab collected at NICU discharge, and had attention problems assessed at 24-month follow-up. The majority of infants enrolled in NOVI (651 of 704; 92%) had parental consent for buccal swab collection. Demographic information was collected at enrollment via maternal interview, and information about neonatal health was obtained via standardized medical record abstraction using Vermont-Oxford Network criteria [25].

Measures

Neonatal DNA methylation

Genomic DNA was extracted from buccal swab samples, collected near term-equivalent age, using the Isohelix Buccal Swab system (Boca Scientific), quantified using the Quibit Fluorometer (Thermo Fisher, Waltham, MA, USA) and aliquoted into a standardized concentration for subsequent analyses. DNA samples were plated randomly across 96-well plates and provided to the Emory University Integrated Genomics Core for bisulfite modification using the EZ DNA Methylation Kit (Zymo Research, Irvine, CA), and subsequent assessment of genome-wide DNAm using the Illumina MethylationEPIC Beadarray (Illumina, San Diego, CA) following standardized methods based on the manufacturer’s protocol.

Pre-processing of data followed a previously described workflow [26]. Array data weunderwent Noob normalization [27, 28]. Samples with poor detection p-values or sex-mismatch were excluded. We excluded probes with median detection p-values < 0.05, those on the X or Y chromosome, those with single nucleotide polymorphisms (SNP) within the binding region, and those that could cross-hybridize to other regions of the genome [29]. Array data were standardized across Type-I and Type-II probe designs with beta-mixture quantile normalization [30, 31].

We next took steps to decrease multiple testing burden and increase our power to detect meaningful associations. First, we implemented the CoMeBack pipeline [32] to identify co-methylated regions (CMRs) which are clusters of highly-correlated, proximal CpG sites. Principal components analysis is performed for each CMR and the first principal component is assigned to each cluster as a summary of DNAm levels at that CMR. The CoMeBack pipeline identified 73,746 CMRs representing the DNAm of 206,195 CpG sites; 500,128 CpG sites were not included in CMRs and were retained as individual CpG sites. Next, we excluded CpGs or CMRs with low variability (SD < 0.02); sites with low variability are more prone to measurement error and are less likely to result in reproducible findings [33]. To further decrease the likelihood of spurious or non-reproducible findings, we examined each CpG and CMR for outliers and recoded values that fell 3 interquartile ranges (IQR) below the 25th percentile or 3 IQR above the 75th percentile to missing.

After exclusions and data reduction, 452,453 loci (60,917 CMRs and 391,536 CpGs) were available from 542 samples for this study (83% of 651 with buccal swab consent; 77% of entire NOVI cohort). For simplicity in the results, we refer to each loci as a CpG but note where significant results were located in a CMR. These data are accessible through NCBI Gene Expression Omnibus (GEO) via accession series GSE128821.

Child Behavior Checklist 1 ½ - 5 years (CBCL)

The CBCL is a parent-report measure of child behavior problems. Caregivers rate the extent to which 99 specific child behaviors apply to their child on a scale of 0 (“Not True”), 1 (“Somewhat or Sometimes True”), or 2 (“Very True or Often True”). Individual items are summed into 7 symptom subscales which can be converted to norm-referenced T-scores (range = 50 to 100). Attention problem T-scores were the primary outcome in this analysis (M = 56.2; SD = 7.43, range = 50 to 80).

Covariates

As DNAm levels differ by cell type, estimating cell-type composition of mixed cell samples (e.g., buccal tissue) is important for addressing confounding. We estimated the proportion of epithelial, fibroblast, and immune cells in our buccal tissue using previously developed reference methylomes [34]. As reported in our prior work [35, 36], the majority of our samples were comprised primarily of epithelial cells, with a smaller proportion of immune cells. Given the strong inverse association between epithelial and immune cell proportions in our data, we adjusted all analyses for epithelial cell proportion to address cellular heterogeneity. We also accounted for potential batch effects by adjusting for sample plate.

Besides these technical covariates, we additionally adjusted all EWAS models for study site, infant GA at birth, infant GA at buccal swab (i.e., time between conception and biosample collection), infant sex, and neonatal medical morbidities. In sensitivity analyses, we additionally adjusted for genetic confounding by re-running all models controlling for first-degree relative (e.g., parent, sibling) history of ADHD, as reported on maternal interviews. We also examined maternal prenatal smoking, maternal low socioeconomic status (i.e., Hollingshead level 5), and child birthweight as additional confounders in sensitivity analyses.

Statistical analysis

Epigenome-wide analyses were conducted to examine the association of DNAm at each of 452,453 CpG sites and attention problem T-scores. We used generalized estimating equation (GEE) models with robust standard errors to regress CBCL attention problem T-scores (dependent variable) on DNAm at each CpG site, accounting for nesting of children within families and covariates (study site, infant GA at birth, infant GA at buccal swab, infant sex, neonatal medical morbidities, cell type composition [proportion of epithelial cells], and sample plate). P-values were adjusted for multiple testing using the Benjamini-Hochberg false discovery rate (FDR) [37]. CpG sites associated with attention problems within a 5% FDR cutoff were considered significant. For ease of interpretation, we rescaled DNAm at each CpG site by dividing the raw data by the CpG-specific interquartile range (IQR) so that beta coefficients derived from the GEE models can be interpreted as the expected change in attention problem T-scores associated with a change in DNAm from the 25th to the 75th percentile of observed data.

Buccal swabs are a peripheral tissue, whereas the primary mechanistic effects of DNAm on attention problems are likely to be neural. To understand whether the sites we identify in peripheral buccal tissue could be representative of processes occurring in the central nervous system, we investigated whether the methylation levels at our identified CpGs were correlated between brain and buccal samples. For all CpGs significantly associated with attention problems in our EWAS, we estimated the correlation between DNAm of that CpG in brain and buccal tissue using an existing database [38]. To better understand the biological processes underlying the associations between DNAm and attention problems, we additionally conducted gene enrichment analyses using the gometh function in the MissMethyl package [39] and tested for pathway-based gene set overrepresentation (KEGG and gene ontology [GO] terms). Pathways that were enriched within a 5% FDR were deemed significant. Statistical code for all analyses are available upon request from the first author.

We also examined whether any of the CpGs identified in our analysis annotated to genes that have previously been linked to phenotypic characteristics in genome-wide association studies (GWAS) using the NHGRI-EBI GWAS catalog [40]. Similarly, we examined overlap with published studies in the MRC-IEU EWAS catalog [41]. Finally, we examined whether any of the CpGs or genes identified in the current analysis have been identified in prior EWAS of attention problems in children [13,14,15,16,17,18,19].

Results

Descriptive statistics

Of the 704 infants enrolled in NOVI, 441 had both buccal swab and CBCL data and were included in these analyses (Fig. 1). The majority of the sample (79%) consisted of singleton births (350 children) with a smaller number of twins (80 children), triplets (3 children), and quadruplets (8 children). Characteristics of the full sample, as well as those included versus excluded in this analysis, are shown in Table 1. Included infants were more likely to be White (48% vs. 32%, p < 0.001) and less likely to be multiracial (18% vs. 29%, p < 0.001). There were no other maternal or neonatal characteristics that differed between included and excluded participants.

Fig. 1
figure 1

Study flowchart showing participant inclusion and exclusion.

Table 1 Demographic and medical characteristics of the sample.

EWAS findings

DNA methylation at 33 CpG sites was associated with child attention problems (Table 2; Fig. 2). Of these, there were 6 positive associations (i.e., higher DNAm associated with more attention problems) and 27 negative associations (i.e., lower DNAm associated with more attention problems). Of the 33 significant results, 5 were located in CMRs (Table 2). Overall, the associations were small in magnitude: going from the 25th to 75th percentile of DNAm was associated with a 1.3 to 3.2 point change in attention problem T-scores.

Table 2 Epigenome-wide association study results for statistically significant CpG sites (FDR < 5%).
Fig. 2: Manhattan plot of epigenetic loci associated with 2 year attention problems.
figure 2

Significant associations (FDR < 5%) are shown above the blue solid line (p < 3.5E-5). Bonferroni-significant CpG sites are shown above the red dashed line (p < 1.1E-7) and annotated in black. Highlighted in blue (with ) is one significant CpG site (FDR < 5%) located in a gene whose methylation has previously been shown to be associated with ADHD in a prior EWAS (TP73). Highlighted in purple (with Δ) are 4 significant CpG sites (FDR < 5%) located in genes that have been shown to be associated with ADHD in prior GWAS studies (FGFR1, NFIA, PITPNIM3, PIK3R2). Three CpG sites, highlighted in red (with °), are located in genes we previously found to be associated with prenatal risk in this sample (POR; MIR4651; COG4; LPAR5). *Denotes CpG located in co-methylated region (CMR).

There were significant, positive brain-buccal correlations for 3 of the 33 identified CpG sites (cg25109393, cg05182265, cg10020385). These correlations were moderate to large in magnitude (r = 0.45 to 0.86, all p < 0.05). After FDR correction, we failed to identify any significantly enriched pathways using either the KEGG or GO methods.

There were several relevant phenotypes and traits associated with the genes annotated to the significant CpG sites from our EWAS (Table 3). Four CpGs (cg26385256, cg09062708, cg27648858, cg11237284) were located in genes that have been found to be associated with ADHD in prior GWAS (FGFR1, NFIA, PITPNIM3, PIK3R2). Three CpGs (cg18773807, cg04468927, cg10457436) were located in genes we previously found to be associated with cumulative prenatal risk in this sample (POR; MIR4651; COG4; LPAR5) [36]. Three of the 33 CpGs met a strict Bonferroni adjustment for multiple testing (cg21415305, cg01132150, cg09297702). These CpGs are annotated to the TTLL3, C5orf56, and KCNJ5 genes. A comparison of our findings with the EWAS catalog (Table 3) uncovered that two of our significant CpGs (cg05182265, cg27648858) have previously been associated with maternal prenatal risk factors (i.e., smoking and hypertensive disorders of pregnancy).

Table 3 CpGs associated with child attention problems (FDR < 5%) are linked to genes, exposures, and outcomes in the GWAS and EWAS Catalog.

We examined whether any of the CpGs identified in our study were associated with methylation quantitative trait loci (mQTL) using the GoDMC database [42]. We found that 4 CpGs (cg01807408, cg01132150, cg05182265, cg11932091) have previously been identified as mQTLs.

Sensitivity Analyses

To address the potential for genetic confounding, we conducted sensitivity analyses that additionally adjusted for first-degree relative (e.g., parent, sibling) history of ADHD. Of the 33 CpGs identified as significant in the main EWAS, 28 remained significant (FDR < 5%) after this additional adjustment. The CpGs no longer significant after this additional adjustment are noted in Table 2 with a symbol (+).Overall, additional adjustment for familial confounding did not explain the majority of our significant findings.

We also examined the potential confounding effect of three additional covariates: maternal prenatal smoking, maternal low socioeconomic status (Hollingshead level 5), and child birthweight. Inclusion of these additional covariates did not substantively change the reported results. All 33 CpGs identified as significant in the main EWAS remained significant (FDR < 5%) after additional adjustment. Full results from all sensitivity models are presented as Supplementary Material.

Discussion

The purpose of this study was to conduct an EWAS to identify neonatal DNAm predictors of attention problems in infants born very preterm. We found 33 CpGs that were significantly associated with age 2 attention problems. Several of these CpGs annotated to genes previously found to be associated with ADHD. This study extends prior research by showing associations between DNAm at NICU discharge and attention problems, measured dimensionally, in toddlerhood, and is also the first EWAS investigating attention problems in children born very preterm.

Prior EWAS investigating attention problems, though not conducted specifically with preterm populations, have similarly found epigenetic signatures at birth associated with later ADHD diagnosis or symptom severity [13,14,15,16,17,18,19]. One of the CpGs identified in the current study (cg19418235) is located in the TP73 gene. Another CpG located in this gene (cg06996273) was identified in a prior study comparing DNAm of twin pairs discordant for ADHD diagnosis [17]. In the prior study, ADHD cases had higher DNAm of this CpG compared to controls, whereas in the current study, we found that lower DNAm of our CpG was associated with more parent-rated attention problems. The different direction of associations between these studies may be due to the different locations of these CpGs: cg19418235 is located 0–200 bases upstream of the transcription start site whereas cg06996273 is located in the gene body. While lower DNAm in the transcription start site is typically associated with increased transcriptional activity, the inverse is often true for gene body methylation, where DNAm is more frequently positively associated with transcription. Thus, the different directions of association between these two studies, at two different CpG, may actually be reflective of similar epigenetic regulation of the TP73 gene. The TP73 gene (tumor protein p73) encodes one of a family of transcription factors involved in cellular response to development and stress, including apoptotic signaling in response to DNA damage. Although genetic variation in TP73 has been associated with various types of cancer [43, 44], differential methylation of this gene is not well studied and its potential role during early development is not clear.

There were no other CpGs or genes found in the current analysis that overlapped with previous ADHD or attention EWAS. This may be due to differences in the tissue type used (prior studies have not investigated DNAm from buccal swabs), outcome measures (attention problems measured dimensionally versus ADHD diagnosis or ADHD symptom severity), age at outcome (age 2 versus school-age children), unsystematic differences due to chance findings from limited study sample sizes, and our specific investigation of children born < 30 weeks GA. Our choice of covariates compared to prior studies may also have contributed to differences in our findings. For example, we controlled for GA because it has been shown to be associated with both attention problems and patterns of DNA methylation. By controlling for GA, we avoid confounding by this factor but also limit our ability to identify CpG sites that could explain associations between GA and attention problems. While other studies have included additional covariates such as child age [15] we chose not to control for age as our outcome assessments were conducted within a relatively narrow age window.

Considering overlap with genetic (rather than epigenetic) studies, four of the CpGs we found to be associated with attention problems in our study were located in genes that have been linked to ADHD in prior GWAS (FGFR1, NFIA, PIK3R2, PITPNM3) [45, 46]. In our study, increased DNAm at all four CpG sites was associated with lower attention problem scores. Interestingly, the CpG located in FGFR1 (cg26385256) was no longer significant after controlling for family history of ADHD. Another one of these CpGs (cg11237284) was located proximal (i.e., 500 bases upstream) to the ADHD-associated SNP (rs1105916) in PITPNM3.This overlap in findings from the current and prior EWAS and GWAS studies suggests that both genetic and epigenetic processes likely contribute to risk for attention problems, though their relative contributions is not yet known. Our mQTL search showed that four of our identified CpGs may be mQTLs. Thus, the methylation signals we found in some of our CpGs could represent both genetic and environmental influences on ADHD. We use caution in interpreting these mQTL findings given that the mQTL search was conducted using a database developed in a different tissue type (blood) and age range (primarily adults) compared to the current study.

We have previously conducted EWAS in this sample to investigate epigenetic associations with prenatal risk factors [36], neonatal neurobehavior [35, 47], and neonatal medical morbidities [26]. Interestingly, we found overlap in one specific CpG (cg18773807, annotated to POR and MIR4651) and two additional genes (COG4; LPAR5) that we previously found to be associated with cumulative prenatal risk [36]. The direction of associations for these overlapping findings suggest that an increase in prenatal risk is associated with decreased DNAm at NICU discharge, which in turn in associated with higher attention problem T-scores at age 2 years. One additional CpG (cg05182265) has previously been identified as differentially methylated in children exposed to prenatal maternal smoking [48] a putative risk factor for the development of ADHD [8]. These results are intriguing as they suggest that neonatal DNAm may be one mechanism underlying the well-documented links between prenatal environmental conditions and attention problems in children (for a meta-analysis, see Kim [49]). The majority of these overlapping genes (POR, COG4, LPAR5) have also previously been linked to markers of physical health and cognitive ability [50, 51].

Our findings are consistent with a growing body of literature linking both genetic and epigenetic variability to differences in attention-related phenotypes, whether measured as dimensional traits, disease symptoms, or ADHD diagnosis. It is important to consider the current findings in the context of our study’s limitations. First, although measuring attention problems in toddlerhood could open the door for early detection of children at higher risk for later impairment, we are not yet able to pinpoint children in our sample who will go on to have persistent attention problems or who will go on to receive an ADHD diagnosis. We also used a single caregiver report of attention problems, which may not be as reliable as having multiple informants or objective assessments. However, as our longitudinal study is ongoing, eventually we will have objective assessment data alongside reported ADHD diagnosis. At that point we plan to investigate whether the neonatal DNAm signal persists or whether there are specific CpGs implicated in later, persistent, and/or clinically relevant attention problems. Second, our investigation of a sample of children born < 30 weeks GA is a unique component of this study, as these children are both understudied and at increased risk for attention problems. As such, we cannot say whether the CpGs identified in this study would be expected to be associated with attention problems in other populations of children or are unique to prematurity. The uniqueness of our sample also means we were unable to identify an appropriate replication dataset. Therefore, further study into the epigenetic predictors of attention problems in early childhood, in both low- and high-risk populations, is warranted. A third limitation is that our DNAm data were obtained using buccal swabs, whereas the tissue that is likely to be causally implicated in attention-related phenotypes is located in the brain. We also observed few significant brain-buccal correlations in the identified CpGs from this study, though the database we used to investigate these correlations was based on a small number of highly selected patients (i.e., those undergoing surgery for epilepsy) with a great degree of variability in patient age and brain tissue location [38]. Nonetheless, it is worth noting that the biological pathways leading from differential DNAm of the identified CpGs to attention problems cannot be parsed out in the current study, nor can we infer causality. Importantly, identification of DNAm loci within buccal cells that are linked to attention problems could be more practically useful for future screening or translation efforts since peripheral tissues (unlike brain tissue) are easily accessible. Future studies that take a multi-omics approach (e.g., adding transcriptomics and/or proteomics) might move the field closer to understanding the underlying biological mechanisms, but these methods remain analytically- and resource-intensive in practice. Finally, although we tested the role of family history of ADHD as an additional covariate, our study currently lacks genomic data, a potentially important source of unmeasured confounding that should be further explored.

In summary, we found DNAm at NICU discharge predicted attention problems at age 2 in a large sample of children born very preterm. Further research should be done to investigate whether the same CpGs or genes remain associated with attention problems measured later in development as well as with formal diagnosis of ADHD in this population. Understanding how changes in DNAm predict later attention problems or attention-related trajectories is another critical next step. This information could be useful in identifying preterm children at risk for later ADHD, who could benefit from additional monitoring and/or targeted early intervention.