Introduction

Attention-deficit/hyperactivity disorder (ADHD) results in the majority of cases from numerous common genetic and environmental factors with mostly small effects.1 The association of any individual risk factor with ADHD will depend on other genetic polymorphisms and/or environmental factors that dampen or amplify its effect on the underlying neurobiological pathways. Their joint effect therefore shapes the clinical profile of an individual, such as number of symptoms of inattention and hyperactivity/impulsivity displayed and their persistence over time. Failing to take such interaction effects into account will lead to noisier estimates of the effect of individual polymorphisms, which may have contributed to the inconsistent findings from studies investigating the genetics of ADHD.

We and others have shown that stress exposure has a role in ADHD.2, 3 Individuals vary widely in their response to stressful stimuli, which can be partly attributed to differences in regulation of the hypothalamic–pituitary–adrenal (HPA) axis.4 Brain regions involved in perceiving threat, such as the prefrontal cortex, hippocampus and amygdala may stimulate HPA axis activity through the hypothalamus.5 This results in the release of a range of neurotransmitters, peptides and hormones such as cortisol that stimulate the sympathetic nervous system. The strength and duration of the stress response is determined by an intricate system of feedforward and feedback loops.6 HPA axis regulation is moderated by previous experiences, with stress exposure being particularly impactful during periods of heightened brain development, such as in adolescence.7

ADHD has been associated with altered cortisol levels, albeit with much heterogeneity between reports. While a meta-analysis has indicated that individuals with ADHD have a blunted cortisol response to acute stressors,8 higher cortisol levels, both at baseline and in response to stress, have also been reported repeatedly.9 These findings may possibly be linked to the duration and extent of exposure to chronic stress.10 They may also relate to differences in ADHD symptom presentation and comorbidity; particularly the heightened levels of conduct problems seen in individuals with ADHD has been coupled to low levels of cortisol, which has been suggested to be causally related to this behavior by reflecting underarousal.9, 10, 11 Further indirect indication of HPA axis involvement in ADHD comes from findings that stimulant medication normalizes patients’ cortisol levels,12 and from the role of the HPA axis in the regulation of emotion,13 sleep14 and circadian rhythms,15 which are often altered in ADHD.16, 17

Genetic determinants of HPA axis activity may contribute to the diversity of findings on the relationship between ADHD and the stress response. ADHD has been associated with polymorphisms in the glucocorticoid and mineralocorticoid receptor genes NR3C1 and NR3C2,18 which provide negative feedback to the HPA axis when activated by cortisol.19 We have found that NR3C1 interacts with psychosocial stress on ADHD severity, and that this gene–environment interaction (G × E) is further moderated by the serotonin transporter gene 5-HTT.20 Serotonin signaling is tightly coupled to the regulation of HPA axis activity,21 and 5-HTT is one of several serotonergic genes that have been repeatedly linked to ADHD.22, 23 The most extensively studied candidate genes for ADHD, the dopamine transporter (DAT1) and dopamine receptor D4 (DRD4) are also known to influence the effect of stressors on HPA axis activity.24, 25 Besides reports of G × E, stress-response genes have also been found to moderate each other’s effects on the HPA axis,26, 27 illustrating the complexity of the genetic architecture underlying the stress-response pathway.

Although conventional regression analyses have led to various interesting findings on ADHD genetics, they are limited in their ability to handle many predictors and interaction terms simultaneously. This undermines accurate estimation of the true contribution of a risk factor on ADHD, as its contributions through interactions with other factors gets neglected.

Random forest regression (RFR) is well suited for investigating the etiology of complex traits using high-dimensional data.28 It allows for inclusion of many more predictors than there are respondents, and automatically incorporates all higher-order interactions between the predictors in its estimates.29 RFR has been praised for its robustness and predictive accuracy, particularly for noisy data containing many predictors with small effects.30, 31 Studies simulating complex genetic data sets have shown that it outperforms other techniques when it comes to detecting interacting single-nucleotide polymorphisms (SNPs) with small marginal effects.32

In this study, we utilized random forest regression to predict ADHD severity from SNPs in genes previously implicated to influence HPA axis activity, together with measures of long-term stress exposure. Machine-learning techniques, including tree-based techniques, have been used to predict ADHD diagnosis as accurately as possible, using neuropsychological and brain imaging data.33, 34, 35, 36 Our aim was not to optimize prediction per se, but to improve our understanding of the complicated relation between stress-response genetics and ADHD, by estimating the contributions of thousands of HPA-axis-related SNPs plus exposure to stressors simultaneously. We thereby sought to illustrate the complex genetic architecture of this disorder and to identify those factors that are of particular interest for follow-up research. Given the intricacy of the stress response,6 and the heterogeneity of findings in the literature regarding the relation between the HPA axis and ADHD,9 we hypothesized that many factors with small effects are involved. The strengths of random forest regression, particularly its ability to take into account higher-order interactions between many predictors, may therefore make it particularly well suited for this task. In addition, based on the same literature, we suspected that co-occurring conduct problems may be an important influence on the relation between ADHD and the HPA axis; we therefore also sought to investigate its role in our findings. The analyses were carried out in a sample of adolescents and young adults (mean age 17.2 years) consisting of individuals with ADHD and healthy controls, as well as individuals with subthreshold ADHD. This sample composition thus enabled analysis within a wide range of ADHD severity, in accordance with the contribution of genetic and environmental variation to the continuous distribution of ADHD traits in the general population.37

Materials and methods

The participants were selected from the NeuroIMAGE study, a follow-up of the Dutch part of the International Multicenter ADHD Genetics (IMAGE) study.38 NeuroIMAGE includes 365 families with at least one child with ADHD and at least one biological sibling (regardless of ADHD diagnosis) and 148 control families with at least one child, without any formal or suspected ADHD diagnosis in any of the first-degree family members. The ADHD families were recruited through ADHD outpatient clinics in the regions Amsterdam, Groningen and Nijmegen (The Netherlands). The control families were recruited through primary and high schools in the same geographical regions. To be included in NeuroIMAGE, the participants had to be of European Caucasian descent, between the ages 5 and 30, have an intelligence quotient 70 and no diagnosis of autism, epilepsy, a general learning difficulty, a brain disorder or a known genetic disorder. The study was approved by the regional ethics committee (CMO Regio Arnhem—Nijmegen; 2008/163; ABR: NL23894.091.08) and the medical ethical committee of the VU University Medical Center. All the participants and their parents (if the participant was younger than 18 years) signed informed consent; parents signed informed consent for participants under 12 years of age.

For the analyses reported in this paper, 686 participants from 360 families had complete data. Of these, 281 participants had an ADHD diagnosis, 88 participants had subthreshold ADHD (that is, had elevated levels of ADHD symptoms without meeting the full criteria for an ADHD diagnosis) and 292 participants were healthy controls. ADHD diagnoses were made in accordance with DSM 5 criteria on the basis of a combination of a semi-structured interview and the Conners’ Rating Scales.39 The participants were asked to withhold the use of their stimulant medication or other psychoactive drugs for 48 h before measurement. The mean age of this sample was 17.1 years (s.d. 3.4) and 52.3% were males. In this sample, 95 participants had an oppositional defiant disorder or conduct disorder, 22 had an internalizing disorder and 79 had a reading disorder. More information on the NeuroIMAGE study, its diagnostic algorithm and its participants is presented in the Supplementary Information and in ref. 38.

ADHD outcome measure

To retain as much information on ADHD as possible, we used a continuous measure of ADHD severity, the raw score on subscale N of the CPRS (Conners’ Parent Rating Scale), which has been shown to have high test–retest reliability and strong discriminatory power.39 This measure consists of 18 items asking about the 18 DSM symptoms of inattention and hyperactivity impulsivity, each rated on a four-point Likert scale (0: not at all true, to 3: very much true). In this sample, the score ranged from 0 to 53, with an average of 13.1 (s.d. 12.1). This measure was available for all the participants, from both ADHD families and control families.

Given the family design of NeuroIMAGE, we calculated the intraclass correlation for our outcome measure to estimate the degree of non-independence of the data.40 Using Searle’s exact confidence limit equation, we found a nonsignificant intraclass correlation of 0.088 with a 95% confidence interval ranging from −0.023 to 0.196, with an average cluster (family) size of 1.90, indicating the non-independence is rather low.

Stress exposure

Two questionnaires were used to assess exposure to psychosocial stress. Parents filled in the long-term difficulties questionnaire,41 containing thirteen items measuring whether their children have been exposed to chronic stressors such as a handicap, being bullied, having financial difficulties, or other persisting problems at home or school. They were asked to only report chronic, ongoing difficulties. Participants themselves filled in the stressful live events questionnaire,42, 43 containing 11 items on exposure to specific major stressful events in the past 5 years, such as death or serious illness of a loved one, physical or sexual abuse, or failure at something important to them. Scores on the long-term difficulties and stressful live events questionnaires have been shown to correlate with cortisol and other biological measures of stress, as well as to be predictive of later mental health problems, in large longitudinal cohort studies of child development.41, 42, 43 See the Supplementary Information for the full list of items, and van der Meer et al.2 for a more extensive description of its use in the NeuroIMAGE cohort.

Genetics

Given our hypothesis that many factors are involved, based on the intricacy of the stress response and the inconsistencies in the literature on the relation between ADHD and HPA-axis-related genes, we took an inclusive approach regarding the selection of SNPs. We included all the available SNPs in all genes coupled to the regulation of the HPA axis activity, as indicated by the reports from studies into genetic moderators of stress exposure in humans. This was done through a literature search in PubMed with the following search term: (“Gene-Environment Interaction”[Mesh] OR ((“Genes”[Mesh] OR “Polymorphism, Genetic”[Mesh] OR gene* OR polymorphism* OR SNP*) AND (“Stress, Psychological”[Mesh]) OR adversit* OR maltreatment OR psychosocial OR neglect OR abuse)) AND (“Hypothalamic Hormones”[Mesh] OR HPA OR hypothalamic pituitary adrenal OR cortisol OR ACTH). We made use of the wildcard symbol * and PubMed’s mesh terms to find as many relevant articles as possible. After filtering for English language articles with full text available, this search generated 415 results, of which 95 were relevant original research articles using human samples investigating specific genetic polymorphisms; see Supplementary Table S1 for references to the articles on each gene. Together, these studies investigated 31 unique genes. Two of these genes, MAOA and HTR2C, were excluded because they were located on the X-chromosome, for which no genotyping data were available. All SNPs within 100 kilo base pairs (kb) of the location of the remaining 29 genes,44 as found in human assembly GRCh37 were included in the study, for a total of 17 374 SNPs. Table 1 lists details on these genes. We used LocusZoom (http://locuszoom.sph.umich.edu) to make plots of the linkage disequilibrium (LD) and recombination rate of regions that contained one of the SNPs among the top results, which are presented in the Supplementary Information.

Table 1 List of genes based on our literature search

For the IMAGE sample, DNA was extracted from the blood samples or immortalized cell lines at Rutgers University Cell and DNA Repository, New Jersey, USA.45 DNA isolation for additional samples from the NeuroIMAGE study was performed at the department of Human Genetics of the Radboud University Medical Center in Nijmegen.38

Genome-wide genotyping was performed using the Infinitum PsychArray-24 v1.1 BeadChip, containing 265 000 tag SNPs, 245 000 exome markers and 50 000 additional markers associated with common psychiatric disorders (http://www.illumina.com/products/psycharray.html). Genotypes were called using Illumina GenomeStudio software, excluding samples with a call rate <0.994. Clustering was done using GeneTrain 2.0 (no-call threshold 0.15), excluding samples with call rate <0.98. Before quality control, the data set contained 594 663 SNPs. Basic quality control steps included checks for sex mismatches, visualization of sample relatedness and assessment of genetic homogeneity using multidimensional scaling. No individuals were removed based on sex mismatches or population structure. Four individuals were removed based on identity by descent estimation (two identical twin pairs and two duplicate sample pairs were detected). Further quality control included removal of SNPs with a call rate below 98% or call rate differences between cases and controls higher than 2%, or failing the Hardy–Weinberg equilibrium test at a threshold of P10–6. Individuals with a call rate below 98% or heterozygosity rate of more than three standard deviations from the mean (n=33) were removed as well. After quality control, the data set contained 584 262 SNPs. A further 221 865 SNPs with a minor allele frequency of less than 1% were removed from the set before imputation. Imputation was carried out according to the protocol supplied by ENIGMA (http://enigma.ini.usc.edu/), using MaCH46 for haplotype phasing and minimac47 for imputation, with 1000 Genomes Phase 1 V3 reference data.48 We reasoned that imputation makes more genetic information available for the analysis49 and therefore allows for a more comprehensive assessment of the true relation between ADHD and variation in genes influencing the stress response. SNPs with low imputation quality (R2<0.8) were filtered out. Subsequently, hard calls, needed as input for the analysis, were made by converting to PLINK format,50 using GCTA software.51

Random forest regression analysis

RFR is a non-parametric ensemble learning method, aggregating the results from many individual decision trees. Overfitting is prevented by growing each tree using a bootstrap sample and by selecting from a random subset of variables at each split.29 Observations not included in a tree’s sample due to the bootstrapping procedure, called out-of-bag (on average about 36%), serve as the tree’s test set and are used to measure prediction error. Importance of a predictor of interest can be estimated through permutation, by randomly shuffling its values in the out-of-bag samples and comparing the resulting prediction error to the error obtained before the shuffle.52 The so-called variable importance estimate VIMP derived in this way includes all interaction effects, as permuting a predictor will remove any influence it had on the selection of other variables deeper in the tree.

All analyses were run in R v3.2.3,53 making use of the package randomForestSRC v2.2.0.54 The code used is available upon request from the corresponding author. The 17 374 SNPs were coded to reflect the participants’ number of minor alleles (‘0’, ‘1’ or ‘2’), entered as non-ordered factors to allow for all possible genetic models. The 24 stress items were coded as ‘0’ (absence) or ‘1’ (presence), and also entered in the analysis as individual predictors. This approach ensured that all information was maintained, that is, the marginal and interaction effects of each stressor. It also prevented the potential bias of RFR whereby continuous measures, or categorical ones with many levels, are more often selected than categorical factors with few levels.55

We grew 5000 trees fully and used the default value of p/3 for mtry, the size of the random subset of available predictors at each split, in this case 5800 (17 398/3 rounded up). These settings were chosen to identify important predictors while still allowing for the detection of true predictors with small effects and interactions, and in accordance with recommendations from simulation studies on complex genetic data with interacting SNPs.56 We further checked the stability of the results by rerunning the analysis twice, with different random seeds.

The reported percent variance explained is calculated as 1−(mean-squared error/variance of y), with mean-squared error calculated from the difference between the observed score and the predicted score, averaged over all trees where the observation was ‘out-of-bag’.57 As a measure of importance, we report the Breiman–Cutler permutation variable importance, referred to as VIMP. VIMP is calculated by permuting the variable of interest in each tree’s out-of-bag sample; the resulting increase in prediction error, averaged over all trees, is expressed as percent increase in mean-squared error.29, 57 Further, the increase in prediction error following simultaneous permutation of two variables minus the sum of their individual VIMPs may be used as a measure of interaction. The operating definition of interaction in this context is that a split by either of the predictors influences the likelihood of a subsequent split by the other predictor, with a negative numeral indicating an increased likelihood that one is selected in the subtree of the other and a positive numeral indicating a reduced likelihood, as explained fully elsewhere.52 The VIMP interaction measures reported in the results section were obtained through the ‘find.interaction’ function included in the randomForestSRC toolbox. We made use of the ‘corrplot’ package for visualization of these results for the most important predictors. The interaction estimates are multiplied by 100 for ease of display.

Supplementary Figure S1 shows the Spearman’s rank correlation coefficient between each pair of the 25 highest-ranked predictors.

Supplementary analyses

Many studies on the HPA axis and related genes in ADHD have shown that especially co-occurring conduct problems drives HPA-axis-related differences with typically developing controls. Given conduct disorder was also among the most common comorbidities in this sample, we ran two additional RFR analyses aimed at providing an indication of the role of co-occurring conduct problems in our findings. We used the score on the CPRS subscale A, which has been found to specifically measure conduct problems rather than externalizing behaviors associated with ADHD in general.39 We ran one analysis where we added this measure as a predictor to the original model, with ADHD severity as outcome, and a second one where we used the score on the CPRS subscale A as outcome, adding ADHD severity to the set of predictors from the main analysis. See the Supplementary Information for more information on these analyses.

Results

The model explained 12.5% variance in ADHD severity. Permuting all SNPs simultaneously led to an 8.3% increase in mean-squared error compared with the intact model. For all stress items together, this was 25.3%. The 25 most important individual predictors are listed in Table 2, containing 20 SNPs and five stress items from the long-term difficulties questionnaire. Figure 1 visualizes the variable importance of every SNP individually, grouped by gene. Figure 2 displays the estimated strength of interaction between each of the top predictors. Figure 3, for illustrative purposes, depicts the interaction of the highest-ranked SNP, rs4635969, with each of the five highest-ranked stress items.

Table 2 Top 25 most important predictors, based on the increase in prediction error following permutation
Figure 1
figure 1

Variable importance for prediction, for all single-nucleotide polymorphisms (SNPs) included in the analysis. SNPs are ordered on the x axis based on their genomic position, from chromosome 1 to 22, with the labels and alternating red and black sections marking the gene they belong to. The y axis indicates the variable importance, as percent increase in mean-squared error (MSE) of the out-of-bag predictions when the SNP was permuted. Those above the dashed blue line are part of the top 25 most important predictors, listed in Table 2.

Figure 2
figure 2

Interaction strengths for each pair of 25 top predictors from the random forest analysis. These were calculated by subtracting the sum of the pair’s individual importance estimates from their joint importance estimate. Negative numerals indicate that one predictor made it more likely that the other was selected for a split in its subtree, positive numerals indicate this was less likely. The predictors are sorted on the basis of the first principal component of their interaction strengths.

Figure 3
figure 3

Visualization of the interaction between SLC6A3 rs4635969 and each of the five long-term difficulties among the top predictors. The participants are grouped based on their genotype and exposure to the individual long-term difficulty shown on the x axis. On the y axis is the observed score on the Conners’ Parent Rating Scale (CPRS), subscale N. The boxes show the median, and the first and third quantiles of each group. Their width is scaled by the number of participants. ADHD, attention-deficit/hyperactivity disorder.

For both additional analyses into the role of conduct problems, we found the same long-term difficulties and the same SNPs in PER3, ESR1 and NR3C2 that were also among the top predictors in the main analysis. SNPs in SLC6A3, NPSR1, DRD4 and GABRA6 remained among the top hits when CPRS subscale A was included as a predictor, but not when it was used as the outcome. Detailed output can be found in the Supplementary Information.

Discussion

In this study, we estimated the importance of stress-related genes, in interaction with stress exposure, for predicting ADHD severity through random forest regression. The strengths of this method, namely the ability to handle high-dimensional data and to take into account all possible interactions, align well with the complexity of stress-response genetics. We reasoned that this would enable us to identify important contributors to ADHD severity, and to document how a multitude of SNPs from genes involved in HPA axis activity combined with stress exposure relates to ADHD.

The SNP with the highest estimated importance for predicting ADHD severity in our analysis, rs4635969, also showed the strongest interaction with a stressor. Multiple genome-wide association studies, together with a meta-analysis, have provided strong cumulative evidence that this SNP is also associated with risk for several forms of cancer.58 Although we included rs4635969 as part of the 3′ end-flanking region of SLC6A3, it is possible that this finding is explained by its close proximity to other genes, such as micro-RNA (MIR4457) at the 5′ end of the telomerase reverse transcriptase (TERT) gene, known to regulate telomere length.59 Overexpression of TERT increases cell proliferation and resilience to oxidative stress,60 whereas glucocorticoid administration and chronic stress exposure have been shown to lower basal telomerase activity and shorten telomere length.61, 62 Therefore, while the C-allele of rs4635969 is linked to cancer, individuals carrying the T-allele may be more vulnerable to stress exposure through inhibition of telomerase activity by the HPA axis. Our finding, together with reports on children’s telomere length being related to early social deprivation63 and hyperactivity/impulsivity,64 suggests this SNP is of interest for ADHD and G × E research.

The other high-ranked SNPs were in or near NPSR1, ESR1, GABRA6, PER3, DRD4, NR3C2 and OPRK1. Besides their associations with HPA axis activity (references listed in Supplementary Table S1), polymorphisms in these genes have all been repeatedly, but inconsistently, associated with internalizing and externalizing behavior often co-occurring with ADHD.65, 66, 67, 68, 69, 70, 71, 72 This inconsistency mirrors the heterogeneity of findings on the relation of cortisol with ADHD as well as with internalizing and externalizing behavior, which have indicated that low reactivity of the HPA axis is most prominent in individuals with ADHD and co-occurring externalizing disorders while high HPA axis activity relates more to anxiety and depression.9, 73 If early splits in a tree form groups more homogeneous with regard to, for instance, externalizing behavior, they allow for detection of other SNPs that impact ADHD severity only in these individuals and not in, for example, more internalizing individuals. These differential effects, analogous to interactions, would increase error in straightforward association studies while they get incorporated in the importance estimates produced by RFR. The ability of this technique to capture shared genetics of psychiatric disorders74 is corroborated by our additional analyses, showing that the polymorphisms in ESR1, NR3C2 and PER3 were also among the top predictors for our measure of conduct problems in this sample. The other top hits appeared to be more specific to ADHD. The reported associations may still be influenced by any of the range of co-occurring problems seen in ADHD, which would contribute to inconsistent findings across studies. Follow-up studies investigating the relation between ADHD and HPA-axis-related factors should therefore carefully consider comorbid conditions.

We further found that particularly long-term difficulties, compared with stressful live events, are important for predicting ADHD severity as well as co-occurring conduct problems. This stronger influence of chronic stress may be explained by the principles of the allostatic load model and its implications for psychiatric disorders.75 Allostatic load refers to the detrimental consequences of repeated stress, mediated by the long-term effects of stress hormones such as cortisol. Prolonged exposure to glucocorticoids is known to be particularly damaging to the prefrontal cortex and hippocampus, thought to contribute to the relation of stress with a range of psychiatric disorders.5, 76 High allostatic load may result from impaired feedback to the HPA axis leading to an extended stress response, and/or from low reactivity of one component inducing hyperactivity of other components of the stress-response system.77 Interactions between stressors, or between stressors and genetic variants, may therefore relate to how they strengthen each other’s effects on this system, leading to dysregulation and increased allostatic load. Neuroimaging data may be used to study the relation of polymorphisms, stressors, and their interactions with brain structure and activity, providing clues on how they influence the stress system, why they interact, and what their role is in ADHD.78, 79

We included many predictors in this analysis that are correlated with each other. Whether correlation between predictors, such as SNPs in regions of high LD, or exposure to different concurrent stressors, helps or hinders random forests depends on the aim of the study.80 Individual importance estimates of correlated predictors will be lowered because a split on one will reduce the likelihood of the other subsequently being selected and vice versa. This also influences the measures of interaction, which are calculated by subtracting the sum of the individual importance estimates from their joint importance estimate; as correlation will make it more likely that the two predictors are part of different (sub)trees, the interaction measure may become less negative or even become positive.52 Correlation between predictors may, however, be beneficial for the analysis of the type of high-dimensional data encountered in genetic studies; while it may lower the estimated importance of the SNP best tagging the true locus of effect, the estimates of nearby SNPs in LD will be raised and therefore may aid in its identification. This pattern is clearly visible in Figure 1 as the streak of dots below the top hits. This inflation of importance estimates for predictors surrounding the true effect does not take place under the null hypothesis of no association with the outcome,80 and therefore signals the authenticity of this effect. Further, correlated SNPs will increase the odds that interacting SNPs from another region are included in the same tree, thereby increasing the ability of the forest to incorporate the impact of interactions.56 This is particularly relevant for the small effects encountered in genetics, as this lowers the number of trees that contain both SNPs and contribute to the calculation of their interaction strength. Therefore, while correlation may lower the quantitative measure of importance for the strongest predictor, it strengthens the confidence in the findings and more accurately captures the impact of groups of predictors.

The approach taken in this study should be seen as complementary to the conventional statistical techniques used in ADHD etiological studies. Random forest regression has great potential as an exploratory tool, given its ability to handle high-dimensional data, and to produce measures of importance. However, the interpretability of its results has been criticized; whereas the findings from conventional regression analyses can be relatively easily probed, for example, by plotting the association on the basis of the regression coefficients, the importance of a predictor as estimated by random forests contains its complex interaction structure with all other predictors included. Simulation studies have further shown that small interaction effects contribute to the overall predictive accuracy, but that current measures are unable to identify them.81 While gene–gene interactions may explain a considerable amount of the heritability of ADHD that currently remains unaccounted for,82, 83 the effects of individual SNPs and their interactions are likely to be small, and their estimated size is further diminished by the LD between SNPs with the current inclusive approach. This may explain the lack of noteworthy gene–gene interactions shown in Figure 2, whereas interactions between the strongest predictors, predominantly the long-term difficulties, do get identified.

It should be noted that this was a cross-sectional study, which precludes any statements on the nature of the relation between the SNPs, the stressors and ADHD, and therefore may include gene–environment correlations. For instance, a polymorphism may both influence the odds of experiencing a stressor such as having a chronic illness or handicap and contribute to ADHD severity, although this does not make it any less of an interesting target for further research. Other stressors, such as having few friends, may partly result from ADHD-related behavior; the direction of effects may be teased apart by longitudinal studies. We further were unable to correct for the presence of siblings in the sample. Although we showed that the degree of non-independence was low and we did not perform any inferential statistics, we cannot rule out that the family design influenced the pattern of the results. We also chose not to add an additional, external, round of validation because of the relatively small sample size for a genetics study, which limits confidence in the findings. Depending on the goal of the study and the available sample size, future studies may choose other approaches, such as a discovery-replication approach and/or LD pruning of the SNP selection.

To summarize, in this exploratory study, we aimed to illustrate the strengths of random forest regression, an ensemble learning method that may be useful for exploring high-dimensional data to discover associations with ADHD. Besides documenting how many factors with small effects come together to predict ADHD, this method enables detection of risk factors that may get overlooked due to interaction effects and that contribute to the many differences between individuals with ADHD. We took a three-step approach beginning with the distribution of all individual importance estimates, followed by extracting measures of interaction between the top predictors, and subsequently visualizing the most interesting G × E. Inference on such a selection, however, should take place in independent samples. We identified a novel association between ADHD severity and a SNP that may relate to TERT, suggesting an influence on telomere length in relation to stress sensitivity. The importance of other SNPs among the top predictors may reflect the ability of random forests to capture effects of polymorphisms that are relevant for only a specific subset of individuals, such as those with conduct problems, thereby contributing to inconsistent association of stress-response genes with ADHD. Our results also illustrated the strong effects of chronic stress, not found for individual stressful events, in accordance with allostatic load models.75 This explorative study may best be followed up by selecting the strongest predictors, analyzing whether the effects of this selection replicate in independent samples, and investigating how and why these are dependent on each other.