Alcohol use effects on adolescent brain development revealed by simultaneously removing confounding factors, identifying morphometric patterns, and classifying individuals

Group analysis of brain magnetic resonance imaging (MRI) metrics frequently employs generalized additive models (GAM) to remove contributions of confounding factors before identifying cohort specific characteristics. For example, the National Consortium on Alcohol and NeuroDevelopment in Adolescence (NCANDA) used such an approach to identify effects of alcohol misuse on the developing brain. Here, we hypothesized that considering confounding factors before group analysis removes information relevant for distinguishing adolescents with drinking history from those without. To test this hypothesis, we introduce a machine-learning model that identifies cohort-specific, neuromorphometric patterns by simultaneously training a GAM and generic classifier on macrostructural MRI and microstructural diffusion tensor imaging (DTI) metrics and compare it to more traditional group analysis and machine-learning approaches. Using a baseline NCANDA MR dataset (N = 705), the proposed machine learning approach identified a pattern of eight brain regions unique to adolescents who misuse alcohol. Classifying high-drinking adolescents was more accurate with that pattern than using regions identified with alternative approaches. The findings of the joint model approach thus were (1) impartial to confounding factors; (2) relevant to drinking behaviors; and (3) in concurrence with the alcohol literature.


Cross-Validation
The accuracy of each implementation was measured via cross-validation. Cross-validation is a popular approach in the machine learning community as it minimizes the risk of reporting overly optimistic accuracy scores by repeatedly training and testing an implementation on separate subsets of the data. Specifically, the data set was divided into two non-overlapping subsets so that each subset preserved most of the characteristics of the complete data set (e.g., for each subset, the ratio of samples between the two cohorts was consistent and the two cohorts were matched with respect to ethnicity, sex, scanner type, and supratentorial volume). For both subsets, the cohorts were not matched with respect to age (p < 0.0001). With respect to socioeconomic status 11 , one subset matched (p = 0.138) and one did not (p = 0.013). Due to the relatively large p-value of socioeconomic status (p = 0.0036) and the small number of regular drinkers (N = 34) of the entire data set, having to match both subsets with respect to socioeconomic status would have required omitting samples from the study, which would have compromised the integrity of this analysis.
Each implementation was then trained on the first subset using a variety of algorithmic parameter settings. Specifically, the search space for the sparsity setting 'N K ' of the logistic classifier was bounded by the smallest pattern (consisting of more than one element) and half of the imaging scores, which for Joi ST R -GAM-Class was N K ∈ {2, 4, . . . , 16}, for Joi DT I -GAM-Class was N K ∈ {2, 4, . . . , 56}, and for all other implementation was N K ∈ {2, 4, . . . , 72}. In addition, the robust regression of Seq-GAM Rob -Class required setting the optimal 'scaling' parameter, for which the search range was {0, 0.5, . . . , 6}. Note, the classification accuracy of Seq-GAM Rob -Class on the training data varied by almost 5% depending on the specific setting of that parameter. Finally, the joint implementations (i.e., Joi ST R -GAM-Class, Joi DT I -GAM-Class, and Joi-GAM-Class) weighted the importance of the GAM model over the logistic classifier through the weight 'γ', which varied between {0.1, 0.2, . . . , 0.9} with γ = 0.1 focusing mostly on improving classification accuracy and with γ = 0.9 aiming to determine the optimal GAM model.
For each implementation and parameter setting, the training determined the optimal values for the GAM variables α i,0 , . . . , α i,3 for each image score 'i' and selected the corresponding residual image scores (i.e., patterns) that lead to the highest normalized-accuracy of the classifier on the training data. Computing the normalized-accuracy required first recording the accuracy of the classifier in correctly labeling subjects of the minimal alcohol exposed cohort and the accuracy of correctly labeling the regular drinkers, and then computing the average across the two resulting (cohort-specific) accuracy scores. For each implementation, the classifiers (and corresponding pattern) across all training runs (i.e., setting) were then combined into a single ensemble of classifiers 12 , which computed the weighted average across the decisions of all classifier with the weight of a classifier being defined by its training accuracy.
To 'test' the ensemble of classifiers of each implementation, it was applied to the second subset and the labeling decisions were recorded with respect to assigning a sample to the minimal alcohol exposed cohort or regular drinking cohort. The process of training and testing was repeated a second time using the second subset for training and the first subset for testing. The testing accuracy of each implementation was then summarized by computing a number of variables (i.e., sensitivity, specificity, Area Under the receiver operating characteristic Curve (AUC), normalized-accuracy, matched-accuracy, age-test, and socioeconomic status test) with respect to the cohort assignment of the testing data. To compute the matched-accuracy, 34 minimal alcohol exposed adolescents were matched to the 34 regular drinking subjects with respect to all confounding factors (age: p=0.12; socioeconomic status: p = 0.2; supratentorial volume: p = 0.61; sex: p = 1.0; ethnicity: p = 1.0; scanner: p = 1.0). Then the normalized-accuracy was computed with the respect to the matched-set. Normalized-accuracy on the entire data set and  To compute the age-test, the minimal alcohol exposed adolescents of the NCANDA data set were divided into an older (i.e., above the age of 15.4 years) and younger (i.e., below the age of 15.5 years) cohort, so that the cohorts were almost equal in size (older cohort: N=335; younger cohort: N=336) and matched with respect to all confounding factors but age (supratentorial volume : p= 0.4410, socioeconomic status: p=0.1277, sex: p=0.2026, race: p=0.1685, scanner: p=0.1334). The two-tailed Fisher's exact test was then applied to the earlier recorded labelings in order to see if that labeling was significantly better than chance in correctly assigning minimal alcohol exposed individuals to one of those two cohorts. Implementations passed the age-test if their p > 0.01, i.e., the effect of age was magnitudes smaller than the effect of regular drinking for those implementations that reported significant normalized-accuracy and significant matched-accuracy.
Similar to the age-test, an implementation passed the socioeconomic status test if the two-tailed Fisher's exact test returned p > 0.01 with respect to the classification results correctly assigning minimal alcohol exposed individuals to the cohort with higher socioeconomic status (≥ 17) or lower socioeconomic status (< 17). These two cohorts, however, were only matched with respect to age (p=0.26), sex (p=0.18), and scanner (p=0.24) as socioeconomic status was highly correlated with ethnicity and supratentorial volume in minimal alcohol exposed individuals (p < 0.001 according to Pearson's correlation).
For each implementation, we also recorded the frequency of patterns appearing across all training runs. The frequency of a pattern of size 'N K ' was defined by the number of times it appeared as part of a pattern selected by a training run divided by the number of training runs that searched for patterns of at least size 'N K '. This computation thus account for larger patterns not 3/5  Joi-GAM-Class Joint model optimizing for regressing out confounding factors and the group separation being able to be part of patterns selected by training runs searching for smaller ones. Patterns of an implementation were then labeled as highly informative for separating the two cohorts if their frequency was higher than 50%, i.e., they appeared in a majority of patterns (of the same size or larger) selected by training runs.

Notes on Training of Sequential and Joint Methods
The computational time associated with convergence (i.e., the training of the Matlab implementation of Joi-GAM-Class) was up to 5 minutes on a single core PC. One way of speeding up the training of the method is to replace Penalty Decomposition with a more commonly used sparse solver 14 that determined a sparse solution by relaxing the l 0 -'norm' with the l 1 -norm · 1 . The corresponding joint implementation again outperformed the corresponding sequential one (i.e., sparse logistic classification based on the l 1 -norm). However, this implementation was significantly less accurate than Joi-GAM-Class so that the results were omitted from Table 1 for clarity. Once the joint method converged, one can readily show that applying the resulting optimal parameter setting ( Φ,ν,ω) to the joint or sequential approach results in the same classification. In other words, measuring the testing accuracy of the joint approach can be performed by first regressing out the effect of confounding factors from the raw imaging scores before applying a logistic classifier solely to the residual image scores, i.e., the classifiers decision is done without knowing the confounding factors.