## Introduction

Substance use disorders (SUD) and addiction represent a global public health problem of substantial socioeconomic implications1,2. In 2010, 147.5 million cases of alcohol and drug abuse were reported (Whiteford et al., 2015), and SUD prevalence is expected to increase over time. Genetic factors have been implicated in SUD etiology, with genes involved in the regulation of several neurobiological systems (including dopaminergic and glutamatergic) found to be important (for a review see Prom-Wormley et al., 20173). However, limitations intrinsic to most genetic epidemiological studies support the search for additional risk genes.

## Subjects and methods

### Subjects

We used independent populations from disparate regions of the world (n = 2698) ascertained through patients affected with ADHD co-morbid with disruptive behaviors (Paisa, Spanish and MTA samples) or SUD (Spanish and Kentucky samples).

### Paisa sample

This population isolate is unique in that it was used to identify ADHD susceptibility genes by linkage and association strategies. Detailed clinical and demographic information on this sample has been published elsewhere23,25,29. The sample consists of 1176 people (adults, adolescents, and children), mean age 28 ± 17 years, ascertained from 18 extended multigenerational and 136 nuclear Paisa families inhabiting the Medellin metropolitan area in the State of Antioquia, Colombia. Initial coded pedigrees were obtained through a fixed sampling scheme from a parent or grandparent of an index proband after having collected written informed consent from all subjects or their parent/guardian, as approved by the University of Antioquia and the NIH Ethics Committees, and in accordance with the Helsinki Declaration. Patients were recruited under NHGRI protocol 00-HG-0058 (NCT00046059).

Exclusion criteria for ADHD participants were IQ < 80, or any autistic or psychotic disorders. Parents underwent a full psychiatric structured interview regarding their offspring (Diagnostic Interview for Children and Adolescents—Revised—Parents version (DICA-IV-P, Spanish version translated with permission from Dr. Wendy Reich (Washington University, St. Louis). All adult participants were assessed using the Composite International Diagnostic Interview (CIDI), as well as the Disruptive Behavior Disorders module from the DICA-IV-P modified for retrospective use. The interview was conducted by a “blind” rater (either a psychologist, a neuropsychologist, or a psychiatrist) at the Neurosciences Clinic of the University of Antioquia, or during home visits. ADHD status was defined by the best estimate method. Specific information regarding clinical diagnoses and co-morbid disruptive disorders, affective disorders, anxiety, and substance use has been published elsewhere3.

From the 1176 individuals in this cohort, only founder members were included in analyses (n = 472). This was done to avoid kinship relatedness bias and to exclude children and adolescents, as they may have not been exposed to substances of abuse yet. Of these 472 individuals, 17% (n = 79) fulfilled criteria for ADHD, 17% (n = 78) for ODD, 18% (n = 84) for CD, 22% nicotine dependence (n = 102), 27% alcohol dependence (n = 124), 3% drug dependence (n = 12), 37% social/simple phobia (n = 156), 13% any other anxiety disorder (n = 58), and 25% major depressive disorder (n = 117) (Table 1).

### Spanish sample

The SUD sample consisted of 494 adults (mean age 37 ± 9 years and 76% males, n = 376) recruited and evaluated at the Addiction and Dual Diagnosis Unit of the Psychiatry Department at the Hospital Universitari Vall d’Hebron with the Structured Clinical Interview for DSM-IV Axis I Disorders (SCID-I). All patients fulfilled DSM-IV criteria for drug dependence beyond nicotine dependence. None were evaluated for ADHD.

The control sample consisted of 483 blood donors (mean age 42 ± 20 years, 74% males) in which DSM-IV lifetime ADHD symptomatology was excluded under the following criteria: (1) not having been diagnosed with ADHD and (2) answering negatively to the lifetime presence of the following DSM-IV ADHD symptoms: (a) often has trouble keeping attention on tasks, (b) often loses things needed for tasks, (c) often fidgets with hands or feet or squirms in seat, and (d) often gets up from seat when remaining in seat is expected. Individuals affected with SUD were excluded from this sample. None of them had self-administered drugs intravenously. It is important to mention that the exposure criterion was not applied; therefore, this set cannot be classified as “pure” controls.

All patients and controls were Spanish of Caucasian descent. This study was approved by the ethics committee of the Hospital Universitari Vall d’Hebron and informed consent was obtained from all subjects in accordance with the Helsinki Declaration.

### MTA sample

DSM-IV abuse or dependence was based on a positive parent or child report with the Diagnostic Interview Schedule for Children version 2.3/3.0 (DISC)46 at the 6- and 8-year follow-up assessments. The DISC includes both lifetime and past year diagnoses. The Diagnostic Interview Schedule-IV47 was used at the 8-year follow-up for 18 + year-olds (n = 111). SUD was defined as the lifetime presence of any abuse or dependence (excluding tobacco dependence, due to differences in the meaning of abuse/dependence for tobacco versus other substances).

Additional analyses explored SUD for alcohol, tobacco, and cannabis/other drugs (recreational or misused prescription medications) separately10. All patients in this study provided informed written consent as approved by the NIH Ethics Committee.

### Kentucky sample

A sample of 560 inpatients and outpatients with severe SUD from Central Kentucky psychiatric facilities was collected during a pharmacogenetics investigation48. Patient interviews and medical record information (including urine drug screens and substance abuse counselor notes) were used by the research nurse to assess the Clinician Rating of Alcohol (CRAUD) and Drug Use Disorder (CRDUD)49,50 that provides a score from 1 = abstinence (not used in the assessed period) to 5 = severe dependence. Scores of 3 and higher are pathological and were considered positive in our analyses. All drugs were combined into one rating48. Descriptions of the training provided to research nurses to assess the CRAUD and CRDUD were published elsewhere48,51.

DNA was available from 533 of 560 study subjects. Of the 533 subjects with available DNA, 53% (n = 285) were male, 82% (n = 436) were Caucasian, 16% (n = 87) were African American, and 2% (n = 10) were from other ethnicities. Additional clinical information for this sample has been described elsewhere48,51 and included: (1) clinical diagnosis obtained from medical records, (2) prior psychiatric history, (3) history of daily smoking, (4) reviews of current and psychiatric medication use, and (5) body mass index (Supplemental Table 2). All participants in the Kentucky study provided informed written consent as approved by the University of Kentucky IRB.

### Genotyping

DNA was extracted from whole blood (Paisa, Spanish and MTA sample) or buccal swabs (Kentucky sample) using standard protocols. The Paisa sample was genotyped using the service provided by Illumina (San Diego, CA). The Spanish, MTA, and Kentucky samples were genotyped for select variants using pre-designed TaqMan® SNP genotyping assays (Thermo Fisher Scientific, Waltham, MA). Allelic discrimination real-time PCR reactions were performed in a 384-well plate format for each individual sample according to the manufacturer’s instructions. Briefly, 20 ng of genomic DNA were mixed with 2.5 μL of 2X TaqMan Universal PCR Master Mix and 0.25 μL of 20X SNP Genotyping Assay in a total volume of 5 μL per reaction. Assays were run in an ABI 7900HT Fast Real-Time PCR System (Thermo Fisher Scientific). Allele calling was made by end-point fluorescent signal analysis using the ABI’s SDS2.3 software. In addition, we had previously collected exome genotype data from the MTA sample26 using the Infinium® HumanExome-12 v1.2 BeadChip kit (Illumina), which covers putative functional exonic variants selected from over 12,000 individual exome and whole-genome sequences. Processed and raw intensity signals for the array data can be accessed at GEO (GSE112652). SNP markers harbored at the ADGRL3 gene were filtered in from this dataset and added to those genotyped using TaqMan® assays.

### Dataset quality control and preparation for analysis

Genotype data were imported into Golden Helix® SVS 8.3.1 (Golden Helix, Bozeman, MT) for quality control analysis. Markers with a minor allele frequency (MAF) < 0.01 (rare variants), significant deviation from Hardy–Weinberg equilibrium (P-values < 0.0001), and a genotyping success rate < 90%, were excluded. For the Paisa and Spanish samples, a subset of variants in the ADGRL3 minimal critical region (MCR), 5′UTR and 3′UTR were selected based on a previous ADHD association study30. Because the Paisa sample is a family-based cohort and recursive-partitioning analysis does not correct for kinship relatedness, only founder members from the pedigrees were included in the analyses. For the MTA sample, a total of 8568 markers with a MAF ≥ 1.0 % from the 244,414 markers genotyped with the exome chip were filtered out using linkage disequilibrium (LD) pruning, and variants within ADGRL3 were selected for analyses. For the Kentucky sample, only four ADGRL3 variants were selected for analyses after LD pruning of a list of markers located within the ADGRL3 5′UTR and MCR regions that was available to us. Variants rs7659636 and rs5010235 had been imputed from ADHD genome-wide association data funded through the Genetics Analysis Information Network (GAIN) initiative, a public-private partnership between the NIH and the private sector (https://www.genome.gov/19518664/genetic-association-information-network-gain/#al-4). ADGRL3 variants used in this study for each cohort are presented in Supplemental Table 3.

### Advanced recursive-partitioning (tree-based) approach (ARPA)

Association studies of ADGRL3 variants with ADHD, ODD, CD, response to stimulant treatment and severity outcome have been published elsewhere for the Paisa and Spanish populations24,29,32,52. We used ARPA to build a predictive framework to forecast the behavioral outcome of children with ADHD, suitable for translational applications. Our goal was to test the hypothesis that ADGRL3 variants predisposing to ADHD also increase the risk of co-morbid disruptive symptoms, including SUD.

ARPA is a tree-based method widely used in predictive analyses because it accounts for non-linear and interaction effects, offers fast solutions to reveal hidden complex substructures and provides truly non-biased statistically significant analyses of high-dimension, seemingly unrelated data53. In a visionary manuscript, D.C. Rao suggested that recursive-partitioning techniques could be useful for genetic dissection of complex traits54. ARPA accounts for the effect of hidden interactions better than alternative methods, and is independent of the type of data (i.e., categorical, continuous, ordinal, etc.) and of the type of data distribution (i.e., fitting or not fitting normality)54. Furthermore, results supplied by tree-based analytics are easy to interpret visually and logically53. Therefore, to generate the most comprehensive and parsimonious classificatory model to predict the susceptibility to disruptive behaviors, we applied ARPA using a set of different modules implemented in the Salford Predictive Modeler® (SPM) software, namely, Classification and Regression Trees (CART), Random Forest, and TreeNet (http://www.salford-systems.com). One important advantage of SPM when compared to other available data mining software is its ability to use raw data with sparse or empty cells, a problem frequently encountered in genetic data.

Briefly, CART is a non-parametric approach whereby a series of recursive subdivisions separate the data by dichotomization55. The aim is to identify, at each partition step, the best predictive variable and its best corresponding splitting value while optimizing a splitting statistical criterion, so that the dataset can be successfully split into increasingly homogeneous subgroups55. We used a battery of different statistical criteria as splitting rules (e.g., GINI Index, Entropy, and Twoing) to determine the splitting rule, maximally decreasing the relative cost of the tree while increasing the prediction accuracy of target variable categories55. The best split at each dichotomous node was chosen by either a measure of between-node dissimilarity or iterative hypothesis testing of all possible splits to find the most homogeneous split (lowest impurity). Similarly, we used a wide range of empirical probabilities (priors) to model numerous scenarios recreating the distribution of the targeted variable categories in the population55. Following this iterative process, each terminal node was assigned to a class outcome. To avoid finishing with an over-fitted CART predictive model (a common problem in CART analyses), and to ensure that the final splits were well substantiated, we applied tree pruning. During the procedure, predictor variables that were close competitors (surrogate predictors with comparable overall classification error to the optimal predictors) were pruned to eliminate redundant commonalities among variables, so the most parsimonious tree would have the lowest misclassification rate for an individual not included in the original data55.

Additionally, we applied the Random Forest (RF) methodology using a bagging strategy to exactly identify the most important set of variables predicting disruptive behaviors56. The RF strategy differs from CART in the use of a limited number of variables to derive each node while creating hundreds to thousands of trees. This strategy has proved to be immune to the over fitting generated by CART56. In RF, variables that appeared repeatedly as predictors in the trees were identified. The misclassification rate was recorded for each approach.

The TreeNet strategy was used as a complement to the CART and RF strategies because it reaches a level of accuracy that is usually not attainable by single models such as CART or by ensembles such as bagging (i.e., RF)57. The TreeNet algorithm generates thousands of small decision trees built in a sequential error-correcting process converging on an accurate model57. The number of variables considered to derive each node with RF was $$\sqrt n$$, where n is the number of independent variables (either 3 or 4).

To derive honest assessments of the derived models and have a better view of their performance on future unseen data, we applied a cross-validation strategy where both training with all the data and then indirectly testing with all the data were performed. To do so, we randomly divided the data into separate partitions (folds) of different sizes. This strategy allowed us to review the stability of results across multiple replications55. We used a 10-fold cross-validation as implemented in the SPM software.

A fixed-effects meta-analysis of the overall fraction of correctly classified individuals (accuracy) using the derived models from each of the four samples was applied to derive a general perspective of the SUD predictive capacity of this demographic-clinical-genetic framework.

## Results

A series of predictive models were built on our data using combinations of the following criteria: (i) the rules of splitting (GINI index, twoing, order twoing, and entropy); (ii) the priors; (iii) the size of the terminal nodes; (iv) the costs; (v) the depth of branching; and (vi) the size of the folds for cross-validation, to maximize the accuracy of the derived classification tree while considering class assignment, tree pruning, testing and cross-validation.

A parsimonious and informative reconstructed predictive tree derived from CART for the Paisa sample revealed demographic (age), clinical (CD), and genetic variables (rs5010235 and rs4860437) (Fig. 1a). The importance of these variables was corroborated, and their potential over fitting discarded by the TreeNet analyses that revealed a set of predictors for SUD containing those derived by CART (Fig. 1b). This predictive model displays good sensitivity and specificity as shown by areas under the receiver-operating characteristic (ROC) curve (0.954 and 0.87 for the learning and the test data, respectively) during TreeNet cross-validation using folding (Fig. 1c). The proportions of misclassification for SUD cases in the cross-validation experiment for the learning and testing data were 0.124 and 0.177, respectively (Fig. 1d).

In the case of the Spanish sample, a parsimonious and informative tree was reconstructed with CART revealing demographic (sex), clinical (CD, ODD, depression, and ADHD), and genetic variables (rs4860437 and rs1868790) (Fig. 2a). The TreeNet analysis revealed a set of predictors for SUD containing those derived by CART (Fig. 2b). This predictive model displayed good sensitivity and specificity as shown by areas under the ROC curve (AUC) of 0.911 and 0.897 for learning and testing samples, respectively, during TreeNet cross-validation using folding (Fig. 2c). The proportions of misclassification for SUD cases obtained by TreeNet analysis for learning and testing data were 0.151 and 0.175, respectively (Fig. 2d).

As in the previous cohorts, for the MTA sample we derived a parsimonious and informative predictive tree with CART depicting demographic (site of ascertainment), and genetic variables (rs2172802, rs61747658, rs12509110, and rs6856328) (Fig. 3a). The TreeNet analyses revealed a set of predictors for SUD containing those derived by CART (Fig. 3b). This predictive model displays good sensitivity and specificity as showed by AUC of 0.808 and 0.643 for learning and testing samples, respectively, during TreeNet cross-validation using folding (Fig. 3c). The proportions of misclassification for SUD cases obtained by TreeNet analysis for learning and testing data were 0.314 and 0.358, respectively (Fig. 3d).

Finally, for the Kentucky sample, we derived a parsimonious and informative predictive tree with CART involving demographic (sex), clinical (high body mass index (HBMI) and schizophrenia diagnosis), and genetic variables (rs4860437 and rs7659636) (Fig. 4a). The TreeNet analyses revealed a set of predictors for SUD containing those derived by CART (Fig. 4b). This predictive model displays good sensitivity and specificity as showed by AUC of 0.811 and 0.744 for learning and testing samples, respectively, during TreeNet cross-validation using folding (Fig. 4c). The proportions of misclassification for SUD cases obtained by TreeNet analysis for learning and testing data were 0.285 and 0.252, respectively (Fig. 4d). The results from the RF analysis were consistent with those produced by TreeNet cross-validation using folding.

A fixed-effects meta-analysis for overall accuracy returned a value of 0.727 (95% CI = 0.710–0.744) (Fig. 5), suggesting potential eventual clinical utility of predictive values. Overall, ADGRL3 marker rs4860437 was the most important variant predicting susceptibility to SUD, a commonality suggesting that these networks may be accurate in predicting the development of SUD based on ADGRL3 genotypes.

We conducted independent analyses for alcohol or nicotine dependence and compared these results with those of our composite SUD phenotype, as defined by the disjunctive presence of substance use phenotypes and explained by likely common neuropathophysiological mechanisms. In general, across cohorts, we found significant alcohol and nicotine risk variants, some of which have reasonably high odd ratios (OR). For instance, in the Spain sample, marker rs2271339 conferred significant risk to nicotine use: the heterozygote genotype A/G confers 43% increased risk of being diagnosed with nicotine use (OR = 1.43, 95% CI = 1.12–1.82). In the same vein, we found in the Paisa sample that the heterozygote A/T genotype for rs1456862 confers 83% increased risk to nicotine use (corrected OR = 1.84, 95 CI% = 1.03–3.38) than the A/A genotype. Regarding alcohol use, we found in the Paisas that the heterozygote C/T genotype for rs2159140 confers susceptibility, whereas the C/C genotype does not (corrected OR = 1.64, 95 CI% = 1.01–2.72). Supplemental Fig. 1 shows the ROC curves of nicotine and alcohol use prediction in the Paisa sample. Note that the AUC is greater than 0.7 in both cases, which suggests a straight performance of markers rs1456862 and rs2159140 in predicting nicotine and alcohol use, respectively.

To determine the significance of improvement of prediction when genetic markers are introduced in the ARPA-based predictive model for SUD, we compared the performance measures (i.e., sensitivity, specificity, classification rate, and lift) across all cohorts under two disjunctive scenarios: inclusion of genetic markers or not. We found that including genetic markers improved the performance measures of the resulting ARPA-based predictive model of SUD, regardless of cohort (Supplemental Fig. 2 and Supplemental Table 1). For instance, the AUC for the Spain sample was 81.6% (95% CI = 79.8–83.4) when genetic information was included, and 77.5 (95% CI = 75.9–79.1) when it was excluded. A bootstrap-based test with 10,000 replicates revealed that the former AUC was statistically greater than the latter (P < 0.0001, Supplemental Table 1). Similar results were obtained for the Paisa sample: the AUC was 90% (95% CI = 86.6–93.0) when genetic information was included versus 78.8% (95% CI = 75.8–81.7) when it was not (P < 0.0001, Supplemental Table 1). Improvements were also observed in the correct classification rate for the Spanish and Paisa samples, the sensitivity values in all samples, the specificity in the Spanish and Paisa samples, and the lift in the Paisa sample (Supplemental Table 1). Similar results were observed for the MTA and Kentucky samples, where including genetic information in the predictive model for SUD drastically improved these performance measures (Supplemental Table 1).

## Discussion

SUD genetic epidemiological studies across multiple substances have been plagued with inconsistency in the replication of genetic association results. This may be due to reasons such as: (i) small effect size of variants expected to influence the SUD phenotype, as with any complex disease;58 (ii) insufficient power to detect significant associations due to small sample size;59 (iii) phenotypic heterogeneity of SUD across samples that may reflect different disease stages or multiple subtypes (i.e., single-drug versus poly-drug dependence/use); (iv) genetic heterogeneity arising from distinct risk genes sets; (v) ethnicity inconsistencies between discovery and replication samples;60,61 and (vi) comorbidity with other psychiatric conditions (e.g., ADHD) with shared genetic and environmental architecture62,63. Consequently, additional studies are required to identify new SUD candidate genes and to help dissect genetic contributions in the context of complex interactions with co-morbid conditions.

In this study, we present a demographic, clinical and genetic framework generated using ARPA that is able to predict the risk of developing SUD. Interestingly, marker rs4860437 showed a differential splitting pattern in the Paisa, Spain, and Kentucky cohorts. For instance, in Fig. 1a, rs4860437 splits into (G/G, G/T) and T/T; in Fig. 2a, the same variable splits into (G/G, T/T) and G/T; and in Fig. 4a, it splits into (G/T, T/T) and G/G. The most parsimonious and plausible explanation of this splitting pattern is the presence of genomic variability surrounding this proxy marker, reflecting ancestral composition. Future studies of genomic regions surrounding rs4860437 might reveal a cryptic mechanism. It is particularly compelling that ADGRL3 marker rs4860437, which is a major predictor variable component in the trees for SUD, is in complete LD with ADHD susceptibility markers rs6551665 and rs1947274 in Caucasians28,30,52, suggesting that the phenotype underpinning SUD is under the pleiotropic effect of ADGRL3 variants. Unfortunately, rs4860437 was not included in the exome chip used to genotype the MTA sample and, therefore, could not be included in the analyses for this sample. Given the limited overlap of markers across datasets and possible stratification differences among study populations, a gene- rather than a marker-level approach has been advocated64.

Adopting such a perspective, our results suggest that genetic variants harbored in the ADGRL3 locus confer susceptibility to SUD in populations from disparate regions of the world. These populations are from three different countries and involve different investigators, diverse inclusion criteria, and different clinical assessments, which suggests that our results may replicate in other settings and are likely to be clinically relevant. Of particular interest is the generalization of our findings to a longitudinal study (the MTA sample), where adding genetic information to baseline data predicted the development of SUD at later ages, as determined from information gathered over a period of more than 10 years. Additionally, our results generalized to a sample of patients with severe SUD from Kentucky (U.S.) that were not ascertained on the basis of ADHD diagnosis.

The first genome-wide significant ADHD risk loci were published recently65. Marker rs4860437 is not represented in this dataset; however, this study was not aimed at identifying loci shared between ADHD and SUD. In any case, while genome-wide association studies are a useful tool for discovering novel risk variants—as it involves a hypothesis-free interrogation of the entire genome—the lack of genetic association may be a reflection of the polygenic, multifactorial nature of ADHD, with both common and rare variants likely contributing small effects to its etiology66,67,68. In addition, an important factor may be the genetic heterogeneity of ADHD subtypes, which may have different underlying genetic mechanisms. Therefore, genome-wide significance may identify loci with larger genetic effects, while others with smaller effects remain undetected for a given population size.