## Introduction

Attention-deficit/hyperactivity disorder (ADHD) is a common disorder affecting 5% of children and 3% of adults1. It is associated with injuries2, traffic accidents3, increased health care utilization4,5, substance abuse6,7, criminality8, unemployment1, divorce9, suicide10,11, AIDS risk behaviors12, and premature mortality13. The cost of adult ADHD to society is between $77.5 and$115.9 billion each year14.

## Materials and methods

### MRI samples

The current study was approved by all contributing members of the ENIGMA-ADHD Working Group, which provided T1-weighted structural MRI (sMRI) data from 4183 subjects from 35 participating sites (by Aug. 2019). Each participating site had approval from its local ethics committee to perform the study and to share de-identified, anonymized individual data. Images were processed using the consortium’s standard segmentation algorithms in FreeSurfer (V5.1 and V5.3)31. A total of 151 variables were used including 34 cortical surface areas, 34 cortical thickness measurements, and 7 subcortical regions from each hemisphere, and intracranial volume (ICV). Subjects missing more than 50% of variables were removed. Remaining missing values and outliers (outside of 1.5 times the interquartile range (iqr 1.5)) were replaced with imputed values using multiple imputation with chained equations in STATA15. The final ML dataset consisted 4042 subjects from 35 sites, among which 45.8% were non-ADHD controls (n = 1850, male to female ratio (m/f) = 1.42) and 54.2% ADHD participants (n = 2192, m/f = 2.79). Ages ranged from four to 63 years old; 60.7% were children (age <18 years, n = 2454) and 39.3% were adults (age ≥18 years, n = 1588). ADHD diagnosis was significantly biased by sex (X2(1) = 66.9, p < 0.0001), sites (X2(1) = 146.73, p < 0.0001), and age (X2(1) = 4.28, p = 0.04).

To balance the confounding factors, we took the following steps. First, we randomly assigned samples to training (~70%), validation (~15%), and test (~15%) subsets within each diagnosis, sex, age subgroup (child vs. adult), and site to ensure that the train/validation/test subsets have the same composition of these variables. Twelve sites that provided only cases or only controls (total 203 subjects) were excluded during the initial train/validation/test split because their samples cannot provide an unbiased learning during the training and validation steps. These samples were added to the test set for final test evaluation. Supplementary Table 1 shows the sample splitting from each site. Next, we balanced the training set for the case and control groups within each sex, age, and site subgroup by random oversampling of the under-represented diagnostic group, a procedure commonly used to deal with class imbalance. The resulting balanced training set is described in Table 1. The validation and test sets were not balanced by age, sex, and site, however due to our sample splitting procedures, they contain the same demographic samples as the training set. In addition, the test set also contains samples from sites that had been excluded from the training set due to not having a site-specific control group.

### Feature preprocessing

The high correlation among the 151 MRI features suggested the need for feature dimension reduction. Many prior studies have opted for feature selection in which the most important features were retained rather than using all MRI features. Although this approach reduces the numbers of input features, it does not remove the highly correlated relationships among the selected features. As prior MRI studies also suggested small but widespread differences between children with and without ADHD, we chose to use principal factors factor analysis (PFFA) for dimension reduction. With varimax rotation, PFFA on sMRI features of the training set identified 46 factors that explained >90% of the variance. This means that the reduced numbers of 46 non-correlated factors were able to represent majority (>90%) of the variance within the training dataset. We then computed factor scores for subjects in the validation and test sets based on the training set PFFA. We compared the original MRI and PFFA features in a screening pipeline for nine different ML models (see below) to determine which set of features were better for the classifiers.

### Machine learning framework

Our ML framework starts with a screening pipeline in which nine different ML models were thoroughly evaluated. We used only training and validation sets for this purpose and we also compared the results of the original MRI features and the PFFA factors. Children and adults were combined for the screening analysis. The screening pipeline utilized Scikit‐Learn’s grid search algorithm44 to search a large hyperparameter space for each of the models (see Supplementary Fig. 1 for details on these models and their hyperparameter spaces). We then compared both the training and validation scores of all the possible combinations of the hyperparameter sets. We used the area under the receiver operating characteristic (ROC) curves (AUC) as a measure of accuracy. To avoid overfitting, we chose the model having the highest validation AUC and smaller training AUC. Because multilayer perceptron (MLP) neural network models were found to be better than other models in meeting this criterion, we used MLP in the following analysis.

More detailed hyperparameter tuning for MLP was carried out using the Keras API (version 2.3.1), the TensorFlow library (version 1.14.0), and HyperOpt45. The neural network hyperparameters and their spaces are: the numbers of layers (1–3, model deteriorates quickly when more than 3 layers were used), numbers of units in each layer (4–500) and dropout rates in each layer (0.1–0.9), learning rate (0.00001–0.01) and batch normalization size (4–256). These hyperparameters were chosen for the HyperOpt tuning because of their important role in effective learning, avoiding local minimum and overfitting. The numbers of layers and units determines the complexity of the model. The ideal complexity of the neural network ensures a converging model that was able to learn the predictive features but not overfit the training examples. Early stopping was also implemented to avoid overfitting. We tested different activation functions (relu, selu, tanh), and optimizers (Adam, SGD, RMSprop, Adagrad, Adamax, Nadam). We used binary cross entropy as the loss function. Best model architecture and hyperparameters were chosen based on the lowest total validation loss. Final test scores were obtained on the test set with ensemble learning approach46. All ML algorithms were written in Python 3.5.

### Analysis pipeline

Our main analysis pipeline starts with two base models that used data from the corresponding age groups during the model training and validation phase and tested also on data from their corresponding age groups. The child model used only child samples during model training, validation, and hyperparameter optimization, and tested on child test set. The adult model, similarly, was trained and validated on the adult samples and tested on the adult test set. We examined models using MRI features only, as well as those included age and sex information. We also trained a combined model that uses all the training data from both child and adult groups and compared the performance with the age-specific models.

Next, we sought to determine if the model trained and validated on the adult samples, the adult model, could be used to predict child ADHD, and vice versa. We hypothesized that if the ADHD vs. control sMRI differences seen in children are also present in adult ADHD brains, then the base models for each age group should be able to predict ADHD in the other age group. To create the largest test sets possible, we tested the child model on all the adult samples, and the adult model on all the child samples.

### Model evaluation

The sigmoid function in the output layer of the neural network generates a continuous score that assesses the probability for each individual to be classified as ADHD. We name this continuous output the brain risk score. Using the brain risk scores, we calculated Cohen’s d effect sizes for child and adult test sets. We computed ROC curves and used the area under the ROC curve (AUC) as our primary measure of accuracy. The AUC and its confidence intervals were calculated in Stata 15 using the empirical method and compared with nonparametric approach by DeLong et al.47. We also computed precision-recall (PR) curves and reported the area under the PR curves, as well as the Brier loss for the final models as measures of accuracy and goodness of fit.

## Results

The screening results (Supplementary Fig. 1) showed that principal factors as input features greatly improved the classifiers’ performance compared with original MRI features, as demonstrated by higher validation AUCs achieved in many models. Using principal factors, MLP outperformed all other models and was chosen as the base model and used in the following main analysis after additional fine-tuning the hyperparameters. The final MLP models’ hyperparameters were listed in Supplementary Table 2.

Figure 1A (top portion) shows the test set AUCs (as dots) and their 95% confidence intervals (as horizontal lines) for the base models using only MRI factors. The model trained and validated on child data predicted child ADHD with a significant AUC 0.64 (95%CI 0.58–0.69). In contrast, the model trained and validated on adult data was not significant AUC (0.56, 95%CI 0.49–0.62, p = 0.057). ROC curves for the two base models are in Supplementary Fig. 2A. The difference between the two base models’ AUCs was not significant (X2(1) = 3.4, p = 0.065). The areas under the precision-recall curve (AUPRC) were higher for the adult model (AUPRC = 0.74) than the child model (AUPRC = 0.68). Using the model predicted brain risk scores, we calculated the Cohen’s d effect sizes in the test set to be 0.47 for child samples (95%CI: 0.27–0.68) and 0.15 (−0.08–0.39) for the adult samples.

After adding age and sex as predictors, the adult model (Fig. 1B, top) increased the AUC to 0.62 (95%CI 0.56–0.69, p = 0.002). Although prediction AUC was now significant, the increase from the base model without age and sex was not statistically significant (X2(1) = 2.01, p = 0.15). The AUPRC for the adult model also slightly increased to 0.79. Adding age and sex as predictors to the child model did not affect either the AUC, nor the AUPRC. ROC curves of two models are plotted in Supplementary Fig. 2B. The Cohen’s d effect sizes in the test set were 0.48 for children (95%CI: 0.27–0.69) and 0.39 (0.15–0.63) for adults. All above models had similarly small Brier scores (0.25).

The combined model with MRI features produced an overall test AUC of 0.60 (95%CI 0.55–0.64). The test AUC was 0.64 (95%CI 0.58–0.69) on the child subset and 0.54 (95%CI 0.47–0.60) on the adult subset, comparable to those from the age-specific models. Similarly, the combined model with MRI, age, and sex features produced an overall AUC of 0.63 (95%CI 0.59–0.67). The subset test AUC was 0.65 (95%CI 0.60–0.71) on the child subset and 0.56 (95%CI 0.49–0.63) on the adult subset, also statistically comparable to those of the age-specific models.

Because the training samples had been balanced for age and sex, these variables are not predictive of ADHD for either the child or adult test sets. To verify this, linear regression using only age and sex and their interactions to predict ADHD in the child and adult samples resulted in non-significant AUCs (child AUC 0.51, 95%CI: 0.45–0.57; adult AUC 0.46, 95%CI: 0.39–0.53).

### Tests of hypotheses

For models using only MRI features, neither the adult nor child models were successful at predicting ADHD in the other age group (Fig. 1A, bottom). However, the adult model that used both MRI features and age and sex was able to predict the child samples significantly (AUC = 0.60, 95%CI: 0.58–0.62, Fig. 1B bottom). The Cohen’s d effect size for children, based on the adult model predictions, was 0.17 (95%CI: 0.10–0.24), smaller than those predicted by their age-corresponding models. The child model that used both MRI features and age and sex did not significantly predict ADHD when applied to the adult samples (AUC = 0.53, 95%CI: 0.49, 0.56, Fig. 1B bottom). ROC curves of both models tested on the different age groups are plotted in Supplementary Fig. 2C.