Main

The lives lost, impacts on individuals and families, and socioeconomic costs attributable to substance use reflect a growing public health crisis1. For example, in the United States, 13.5% of deaths among young adults2 are attributable to alcohol, smoking is the leading risk factor for mortality in males3, and the odds of dying by opioid overdose are greater than those of dying in a motor vehicle crash4. Despite the large impact of substance use and substance use disorders5, there is limited knowledge of the molecular genetic underpinnings of addiction broadly.

Individual substance use disorders (SUDs) are heritable (h2, ~50–60%) and highly polygenic6,7. Recent large-scale genome-wide association studies (GWASs) have identified loci associated with problematic drinking8,9, alcohol use disorder (AUD)10,11, cigarettes smoked per day12, nicotine dependence13,14, cannabis use disorder (CUD)15 and opioid use disorder (OUD)16. Echoing evidence from twin and family studies17, these GWASs show that the genetic architecture of SUDs is characterized by a high degree of commonality18, that is, a general addiction genetic factor likely conveys vulnerability to multiple SUDs. Even after accounting for genetic correlations with non-problematic substance use and with other psychiatrically relevant traits and disorders, there is considerable variance that is unique to this general risk for addiction, indicating that a liability to addiction reflects more than just the combined genetic liability to substance use and psychopathology18,19,20,21.

We conducted a multivariate GWAS of the largest available discovery GWASs of SUDs, including problematic alcohol use (PAU: N = 435,563; continuous)8, problematic tobacco use (PTU: N = 270,120; continuous)12,13,18, CUD (N = 384,032, cases = 14,080)15 and OUD (N = 79,729, cases = 10,544 cases)16. First, we partitioned single-nucleotide polymorphism (SNP) effects into five sources of variation: (1) a general addiction risk factor (referred to as the addiction-rf), and risks specific to (2) alcohol, (3) nicotine, (4) cannabis and (5) opioids. Second, we identified biological pathways underlying risk for these five SUD phenotypes using gene, expression quantitative trait locus (eQTL) and pathway enrichment analyses. Third, we examined whether currently available medications could potentially be repurposed to treat SUDs22. Fourth, we assessed the association of a polygenic risk score (PRS) derived from the addiction-rf with general SUD phenotypes in an independent case/control sample. Fifth, we examined the extent to which genetic liability to the addiction-rf is shared with other phenotypes (for example, physical and mental health outcomes). Sixth, we tested whether the addiction-rf PRS was associated with medical diagnoses derived from electronic health records (EHRs) and with behavioural phenotypes in largely substance-naive 9–10-year-old children.

Results

Addiction risk factor in European ancestry GWAS

As in our prior study18, we estimated a single factor model, scaled the variance of the addiction-rf to 1 and allowed loadings to be estimated freely. The single factor model that loaded on OUD (Neffective = 30,443), PAU (Neffective = 300,789), PTU (Neffective = 270,120) and CUD (Neffective = 46,351) fit the data well (χ2(1) = 0.017, P = 0.896, comparative fit index (CFI) = 1, standardized root mean square residual (SRMR) = 0.002). The latent factor loaded significantly on all indicators (standardized loadings on OUD = 0.83, PAU = 0.58, PTU = 0.36, CUD = 0.93; see Supplementary Fig. 1 for full model). The addiction-rf was associated with 19 independent (r2 < 0.1) genome-wide significant (GWS) SNPs that mapped to 17 genomic risk loci (Fig. 1; Table 1; Supplementary Table 1 for lead SNPs and Supplementary Table 2 for genomic risk loci). The most significant SNP (rs6589386, P = 2.9 × 10–12) was intergenic, but closest to DRD2, which was GWS in gene-based analyses (P = 7.9 × 10–12; Supplementary Table 3). Further, rs6589386 was an eQTL for DRD2 in the cerebellum, and Hi-C analyses (in FUMA)23 revealed that the variant made chromatin contact with the promoter of the gene (Supplementary Fig. 2).

Fig. 1: Manhattan plot of the addiction-rf GWAS results.
figure 1

The dotted line represents genome-wide significance at 5 × 10–8. Each SNP peak is annotated with the closest mapped gene from FUMA (Table 1). We have not included all SNPs in the credible set in Table 1, but they are shown in Supplementary Table 4. Significance is set at genome-wide significance Bonferroni correction is a two-sided test (P < 5 × 10–8).

Table 1 Lead GWAS significant variants

Gene-based analyses identified 42 significantly associated genes (Supplementary Table 3); the most significant signals were FTO (P = 1.86 × 10–13), DRD2 (P = 7.9 × 10–12) and PDE4B (P = 9.63 × 10–11). Fine-mapping identified 123 GWS SNPs (of 660 non-independent GWS SNPs) in credible sets as potential causal SNPs based on the posterior probability of inclusion (Supplementary Table 4). Mapping the lead independent SNPs in the credible sets to their nearest gene based on posterior probability of 1, the following SNPs showed the strongest causal potential: rs1937455 (PDE4B), rs3739095 (GTF3C2), rs6718128 (ZNF512), rs4143308 (RP11-89K21.1), rs4953152 (SIX3), rs41335055 (CTD-2026C7.1), rs2678900 (VRK2), rs7620024 (TCTA), rs283412 (ADH1C), rs901406 (BANK1), rs359590 (RABEPK), rs10083370 (LINC00637), rs1477196 (FTO) and rs291699 (CDK5RAP1) (Supplementary Table 4 and Fig. 1). Pathway analysis of gene-based results revealed several significant gene ontology (GO) terms including double-stranded DNA binding (PBonferroni = 0.005), sequence-specific double-stranded DNA binding (PBonferroni = 0.01), regulation of nervous system development (two terms: PBonferroni = 0.011–0.037), and positive regulation of transcription by RNA polymerase (PBonferroni = 0.038) (Supplementary Table 6).

Substance-specific risk in European ancestry GWAS

To identify loci associated with only a single substance (that is, not pleiotropic), we used ASSET (Association Analysis Based on Subsets24; one-sided P < 5 × 10–8). SNPs that were associated at GWS with only an individual substance (PAU, PTU, CUD or OUD) were considered substance-specific (for example, CHRNA5 SNPs were only associated with PTU; Supplementary Fig. 3b–e).

Problematic alcohol use

ASSET analyses revealed nine independent SNPs in six loci associated specifically with PAU (Supplementary Fig. 3b; Supplementary Tables 7 and 8). As expected8, the top signal was rs1229984 in ADH1B (P = 4.11 × 10–68). Gene-based enrichment analyses also implicated the alcohol dehydrogenase activity zinc-dependent pathway (PBonferroni = 0.035; Supplementary Table 9).

Problematic tobacco use

PTU was specifically associated with 32 independent SNPs in 12 loci (Supplementary Fig. 3c; Supplementary Tables 10 and 11). The top SNP was rs10519203 (P = 5.12 × 10–267) in HYKK which is also a robust eQTL for CHRNA5; the signal is likely driven by the CHRNA5 missense variant, rs16969968 (P = 2.79 × 10–175), which has previously been linked to tobacco use (r2 = 0.87)12. Several other SNPs were closest to genes encoding nicotinic acetylcholine receptors, including CHRNA4, CHRNB4, CHRNB3 and CHRNB2 (Supplementary Table 10). Gene-based enrichment implicated multiple pathways and gene sets related to nicotinic acetylcholine receptors (Supplementary Table 12). Specific dopamine-related associations were also noted (for example, PDE1C: rs215600; P = 2.35 × 10–18; DBH: rs1108581; P = 1.00 × 10–14).

Cannabis use disorder

ASSET identified five substance-specific loci for CUD (Supplementary Tables 13 and 14), with lead signals at rs11913634(FAM19A5; P = 1.20 × 10–15), rs8104317 (CACNA1A; P = 1.17 × 10–13), rs72818514 (ATP10B; P = 1.57 × 10–9), rs11715758 (GNAI2/HYAL3; P = 4.84 × 10–8; Supplementary Fig. 3d) and rs11778040 (P = 1.77 × 10–9; annotated to the GULOP pseudogene). rs11778040 also mapped to the previously discovered signal for CUD near CHRNA2 and EPHX215 and is an eQTL for CHRNA2, EPHX2 and CCDC25. CUD-specific signals showed no significant gene-based enrichment.

Opioid use disorder

The only significant substance-specific signal for OUD was the well-characterized16 mu opioid receptor (OPRM1) SNP, rs1799971 (P = 1.63 × 10–8; Fig. 2e). Gene-based analyses produced no significant findings.

Fig. 2: Manhattan plot of the transcriptome-wide association study results for addiction-rf.
figure 2

a,b, TWAS of the addiction-rf, plotted as a Manhattan plot. The analyses in a were conducted in S-MultiXcan with GTeX v8 data. The analysis in b was run using S-PrediXcan with weights trained from PsychENCODE. The y-axis is presented as –log10(P), the colour of the data point represents the tissue in which correlation between gene expression and outcome was the highest. The dotted black line represents Bonferroni-corrected TWAS significance of a two-sided test (a, 9,944 genes, PBonferroni = 5 × 10−6 and the line is at 5.3; b, 13,850 genes, PBonferroni = 3.6 × 10–6, line is at 5.4).

Cross-substance risk in African ancestry GWAS

The ASSET-based meta-analysis of GWAS data for AUD (N = 82,705)11, tobacco dependence (TD; based on the Fagerström Test for Nicotine Dependence, N = 9,925)13, CUD (N = 9,745)15 and OUD (N = 32,088)16 in individuals of African ancestry yielded only one GWS pleiotropic SNP, rs77193269 (P = 4.92 × 10–8); this SNP was GWS for AUD and TD when considering ASSET loci pleiotropic for two substances (Supplementary Fig. 4b). For substance-specific signals, only one SNP was GWS significant: rs2066702, an ADH1B variant that was alcohol-specific (Supplementary Fig. 4a).

Cross-substance risk in cross-ancestry GWAS

We found 68 GWS SNPs (Supplementary Fig. 5), which are challenging to map to nearby regions or candidate genes due to ancestral differences in LD structure. Table 2 lists the SNP with the lowest GWAS P value on each chromosome. The most significant association was noted near the FUT2 gene (rs507766, P = 3.47 × 10–19). Many GWS signals were consistent with genes found in the European GWAS, including FTO (rs9928094, p = 6.50 × 10–32) and PDE4B (rs1937439, P = 8.56 × 10–12). We also identified two SNPs in genes that have previously been implicated in SUDs including CADM2 (rs62250713, P = 1.00 × 10–18) and FOXP2 (rs4727799, P = 3.90 × 10–15), both of which were within r2 = 0.6 of lead signals from the European GWAS.

Table 2 Top results from the cross-ancestry meta-analysis in METASOFT

Polygenic architecture and power

We used a likelihood estimation-based approach to calculate the probability distribution of effect sizes for the addiction-rf and each of the constituent input GWASs (that is, PAU, PTU, CUD and OUD) to examine relative differences in polygenicity (Methods). The addiction-rf showed a narrow distribution of small effect sizes with almost all values falling close to 0. Contrastingly, the original substance-specific GWASs were characterized by larger average effects (see Supplementary Fig. 6 for shape of probability density distribution). For example, only 26% of genes associated with PTU showed effect sizes as close to the mean threshold of the probability distribution as the addiction-rf did. These findings suggest that the addiction-rf is characterized by greater polygenicity than specific substances.

Transcriptome-wide association and drug repurposing

A transcriptome-wide association study (TWAS)25 of the addiction-rf using multiple tissues simultaneously from GTEx in MetaXcan (Methods) identified 35 genes in 13 brain regions (Fig. 2; Supplementary Table 15). Gene-set analysis using FUMA23 revealed that these genes were enriched for gene sets and pathways related to neural cells and T-cell processes (Supplementary Fig. 7; Supplementary Table 16). TWASs with PsychENCODE data found 29 significantly associated genes and 11 genes that overlapped with those identified in the GTEx analysis (AMT, DALRD3, GPX1, KLHDC8B, NCKIPSD, NICN1, P4HTM, PPP6C, RHOA, SNX17, WDR6; Fig. 2). Linking transcriptome-wide patterns from our GTEx MetaXcan analysis to perturbagens that cross the blood–brain barrier from the Library of Integrated Network-Based Cellular Signatures (LINCS)26 database identified 104 medications approved by the US Food and Drug Administration (FDA) that reverse the addiction-rf transcriptional profile (Supplementary Table 17). Medications currently used to treat SUDs (for example, varenicline for smoking cessation), other psychiatric conditions (for example, reboxetine for depression) as well as those used for other purposes (for example, mifepristone is currently used for pregnancy termination and is currently under clinical investigation for treating AUD; riluzole is a treatment for amyotrophic lateral sclerosis) were identified.

Linkage disequilibrium score regression and genetic correlations

After Bonferroni correction (P < 0.05/1,547 = 3.20 × 10–5), the addiction-rf was genetically correlated with 251 phenotypes (Fig. 3; Supplementary Table 18). Notably, 38 of these (15%) were somatic diseases linked to specific substances (for example, lung cancer with tobacco and pain-related conditions with opioids). As expected, we found significant genetic correlations (rG) between the addiction-rf and serious, transdiagnostic psychopathological behaviours, including suicide attempt (rG = 0.62, P = 2.89 × 10–33) and self-medication (for example, using non-prescribed drugs or alcohol for anxiety, rG = 0.64, P = 3.18 × 10–6). The addiction-rf was correlated with, but remained separable based on 95% confidence intervals (rG = 0.63 ± 0.037, P = 2.33 × 10–231), from an externalizing factor27 that included similar indices of problematic substance use and behavioural measures.

Fig. 3: PheWAS of genetic correlations using MASSIVE.
figure 3

Genetic correlations between 1,547 traits and the addiction-rf, calculated in MASSIVE, mapped by their statistical significance (−log10(P) on the y-axis), and broad category. The top 20 correlations are annotated; all results can be found in the Supplementary Results. The black dashed line represents Bonferroni significance for association of a two-sided test (PBonferroni = 0.05/1,574 = 3.232 × 10–5).

Latent causal variable analysis

We used MASSIVE to conduct latent causal variable (LCV)28 analyses on the same 251 phenotypes significant in our genetic correlation analyses (Supplementary Table 19). After multiple corrections (P = 0.05/250 = 1.98 × 10–4), the only significant causal processes were medication codes. Specifically, addiction-rf was estimated as a potential risk factor for “Medication for cholesterol, blood pressure or diabetes: cholesterol lowering medication” (genetic causality proportion = –0.739(0.078), P = 4.51 × 1021), “treatment/medication code: atorvastatin” (genetic causality proportion = –0.373(0.050), P = 7.93 × 10–14) and “Medication for cholesterol, blood pressure, diabetes, or take exogenous hormones: cholesterol lowering medication” (genetic causality proportion = –0.315(0.071), P = 8.31 × 10–6). The negative genetic causality proportion estimates suggest a causal role of addiction on physical disease (addiction-rf is trait 2 in all instances).

Polygenic risk score analyses

PRS analyses with measures addiction and SUDs

In the independent Yale–Penn 3 sample16 (European ancestry, N = 1,986), the addiction-rf PRS was significantly associated with a phenotypic factor loading on several SUDs (P < 0.001), polysubstance use disorder (two or more SUDs; P < 2 × 10–16), and each individual SUD (DSM-IV29: TD, cocaine use disorder (CoUD), AUD, CUD and OUD (all P < 7.71 × 10–6; Fig. 4; Supplementary Table 20). Nagelkerke’s R2 values ranged from 2.4% for CUD to 5.9% for TD, and 6.6% for a phenotype similar to the addiction-rf that represents phenotypic commonality across AUD, CUD, OUD, TD and CoUD. Odds ratios varied from 1.41 for CUD to 1.73 for OUD.

Fig. 4: Polygenic risk score prediction in Yale–Penn 3.
figure 4

a, PRS of the addiction-rf predicts lifetime AUD, CUD, OUD, TD and CoUD, and variables representing more than one lifetime SUD diagnosis versus no SUDs diagnosis (polysubstance use disorder, two level), more than one lifetime diagnosis versus one lifetime diagnosis (polysubstance versus unitary), as well as any SUD diagnosis (any addiction) in an independent sample (Yale–Penn 3; N = 1,986 individuals of European genetic ancestry). b, The addiction-rf PRS was associated with a comparable phenotypic SUD common factor in the Yale–Penn 3 sample. Analyses control for age, sex and 10 genetic principal components of ancestry; all path estimates are fully standardized. *, Estimates were significant at P < 0.001 of a two-sided test (LAVAAN does not report P-values lower than 0.001). CFI, comparative fit index; RMSEA, root mean square error of approximation.

Phenome-wide association studies in electronic health records data

In the BioVU sample (European ancestry, N = 66,914)30, the addiction-rf PRS was associated with SUDs (P = 3.31 × 10–29; Supplementary Fig. 8), various types of substance involvement (for example, tobacco use disorder P = 9.79×10–24, alcoholism (so named in EHR, we note the term ‘alcohol use disorder’ is more appropriate), P = 1.12 × 10–21), chronic airway obstruction (P = 4.99 × 10–10) and several psychiatric disorders, with the strongest being bipolar disorder (P = 2.44 × 10–11). Controlling for any SUD diagnosis to account for causal effects found similar associations with ‘alcoholism’, mood disorders, respiratory disease and heart disease (Supplementary Fig. 9a). Controlling for tobacco use disorder diagnosis did not significantly modify associations (Supplementary Fig. 9b).

Behavioural phenotypes in substance-naive children

Among 4,491 substance-naive children aged 9–10 years who completed the baseline session of the Adolescent Brain and Cognitive Development (ABCD) Study31, the addiction-rf PRS was positively correlated (after Bonferroni correction) with Behavior Activation System Scale (BAS) fun-seeking (an aspect of externalizing behaviour; P = 2.09 × 10–5), family history of drug addiction (P = 7.04 × 10–7), family history of hospitalization due to mental health concerns (including suicidal behaviour; P = 4.64 × 10–6), childhood externalizing behaviours (for example, antisocial; P = 1.62 × 10–5), childhood thought problems (P = 3.51 × 10–6), sleep duration (P = 1.52 × 10–7), parental externalizing and substance use behaviours (for example, prenatal tobacco exposure; P = 2.87 × 10–11), maternal pregnancy characteristics (for example, urinary tract infection during pregnancy, P = 2.70 × 10–7), socioeconomic disadvantage (for example, child’s neighbourhood deprivation; P = 9.84 × 10–7) and child’s likeliness to play sports (P = 2.80 × 10–6) (Supplementary Fig. 10; Supplementary Table 21 for results from all phenotypes and Supplementary Table 23 for measure inclusion criteria).

Discussion

We found 17 genomic loci significantly associated with addiction-rf, and 47 substance-specific loci. Post-hoc fine-mapping, annotation, and exploratory drug repurposing analyses highlight the potential therapeutic relevance of the discovered loci. The addiction-rf PRS was associated with many medical conditions characterized by high morbidity and mortality rates, including psychiatric illnesses, self-harming behaviours, and somatic diseases that could be consequences of chronic substance use (for example, chronic airway obstruction) or precursors to heavy substance use (for example, chronic pain). Finally, in a sample of drug-naive children, the addiction-rf PRS was correlated with parental substance use problems and externalizing behaviour.

Our analyses suggest that the regulation or modulation of dopaminergic genes, rather than variation in dopaminergic genes themselves, is central to general addiction liability. DRD2 was the top gene signal, which was mapped via chromatin refolding, suggesting a regulatory mechanism. The role of striatal dopamine in positive drug reinforcement is well established32. DRD2 plays a role in reward sensitivity and may also be central to executive functioning33—the interplay of reward and cognition is likely relevant throughout the course of addiction. These complementary observations reinforce the role of dopamine signalling in addiction32.

Other regulatory effects on dopaminergic pathways were supported by the signal at PDE4B, which has been implicated in prior GWASs of disinhibition traits27. The phosphodiesterase (PDE) system has been proposed as a dopaminergic regulation mechanism34. Furthermore, animal studies suggest that the PDE system is associated with downregulation of drug-seeking behaviours across opioids, alcohol and psychostimulants35. Notably, The PDE4B antagonist, ibudilast, has been shown to reduce heavy drinking among patients with AUD36,37 and also shown to reduce inflammation in methamphetamine use disorder38, and was significant in our drug repurposing analysis.

The addiction-rf PRS was associated with general and specific SUD liabilities in an independent sample. The addiction-rf PRS predicted ~6% of OUD variance, which is nearly half the total SNP-heritability of OUD16. The addiction-rf PRS also predicted variance in cocaine use disorder (CoUD); as CoUD was not included in the development of the addiction-rf (due to a lack of a well-powered CoUD GWASs), these findings highlight the generalizability of the addiction-rf beyond alcohol, tobacco, cannabis and opioids.

Substance-specific genetic signals fell primarily into three broad categories: drug-specific metabolism (for example, ADH1B for PAU), drug receptors (for example, CHRNA5 for PTU, OPRM1 for OUD) and general neurotransmitter mechanisms (for example, CACNA1A for CUD). Surprisingly, even after accounting for the addiction-rf, dopaminergic genes (DBH and PDE1C in particular) were implicated in substance-specific effects for tobacco (PTU). In contrast, CUD-specific genes did not include well-studied receptor targets (for example, CNR1) or metabolic mechanisms (for example, cytochrome P450 genes).

The current addiction-rf is distinct from recent genetic factors21,27,39 that were based upon analyses of SUDs with other substance use, psychiatric and behavioural traits. We focus on SUDs rather than measures of substance use or other externalizing traits, which prior data indicate have differing aetiologies and relationships with psychiatric health9,40,41. Our study also parses substance-general (that is, addiction-rf) and substance-specific loci. This approach distinguishes the addiction-rf from other genetic factors that include substance use measures. For example, despite genetic overlap between the addiction-rf and a recent index of externalizing behaviours (rG = 0.63)27, a significant portion of the variance in the addiction-rf was distinct.

Our analyses highlight the robust genetic association of the addiction-rf with serious mental and somatic illness. The addiction-rf PRS was more strongly associated with using drugs to cope with internalizing disorder symptoms (anxiety, depression; rG = 0.60–0.62) than with the individual psychiatric traits and disorders themselves (rG = 0.3), suggesting that genetic correlations between SUDs and mood disorders may partially be attributable to a predisposition to use substances to alleviate negative mood states (‘self-medication’)42.

The phenome-wide association study (PheWAS) provided insight into potentially complex mechanisms of genetic liability to environmental pathways of risk. In addition to indices of socioeconomic status (SES), the addiction-rf was correlated with maternal tobacco smoking during pregnancy and with attention deficit hyperactivity disorder, in line with evidence that effects ascribed to the prenatal environment may also be mediated by the inheritance of risk loci43,44. The addiction-rf PRS was associated with a family history of serious mental illness, which likely represents an amalgam of genetic and environmental vulnerability45. Finally, disability and SES were also associated with polygenic risk, further supporting the association between environmental risk factors and common genetic effects on SUD liability9,41,46.

This study has limitations. First, our GWAS in individuals of African ancestry had few discoveries, underscoring the need for systematic data collection on SUDs in globally representative populations. Still, we chose to analyse and present these data as their exclusion only furthers disparities in genetic discoveries. Second, although we discovered many loci, they accounted for only a small proportion of the total variance. More samples, particularly from diverse populations, and the integration of rarer variants are needed to discover the biological pathways that fall below genome-wide significance or are missed in GWAS. Finally, despite interesting associations between our PRS and SUDs, our findings do not apply to prognostication of future disease risk.

Conclusion

A common and highly polygenic genetic architecture underlies multiple SUDs, a finding that merits integration into medical knowledge on addictions.

Methods

Summary statistics from each SUD-related GWAS

Summary statistics from the largest available discovery GWAS were used to represent genetic risk for each construct. These include four measures of problematic substance use or SUD: (1) PAU8, (2) PTU12,13,18, (3) CUD15, (4) OUD16. All GWAS summary statistics were filtered to retain variants with minor allele frequencies >0.01 and INFO score >0.90 for GSCAN12 and PGC15 and INFO score >0.70 for the MVP8,16.

For the current cross-trait GWAS, we maintained the same quality control (QC) metrics and only analysed SNPs that were present in all four input GWASs, that is, variants that passed QC thresholds at all levels, resulting in 3,513,381 SNPs in samples of European ancestry and 5,303,643 SNPs in samples of African American ancestry. The linkage disequilibrium (LD) scores used for the genomic structural equation modelling (GenomicSEM)47 were estimated in the European ancestry samples only using the 1000 Genomes European data48. We restricted analyses to HapMap3 SNPs49 as these tend to be well imputed and produce accurate estimates of heritability. We used the effective N, that was estimated for each GWAS50. For traits with a binary distribution, the effective sample size for an equivalently powered case-control study under a 50–50 case control balance was estimated using the equation: Neffective = 4/((1/Ncase) + (1/Ncontrol))51,where N represents the sample size. Continuous and quasi-continuous traits used the given N or if from MTAG, the equation Neffective = ((Z/β)2)/(2 × MAF × (1 – MAF)), where MAF is the minor allele frequency, Z is the z-score of the effect size and β is the beta of the effect size8, to approximate an equivalently powered GWAS of a single trait. Effective N values ranged from 46,351 (CUD) to 300,789 (PAU) and are described for each substance-specific GWAS in the Results. Individual GWAS details can be found in the Supplementary Methods.

Genome-wide analyses in European ancestry

We conducted a GWAS of a unidimensional addiction risk factor (addiction-rf) underlying the genetic covariance among PAU, PTU, CUD and OUD by applying GenomicSEM47 to these European ancestry summary statistics. GenomicSEM conducts genome-wide association analyses in two stages. First, a multivariate version of LD score regression is used to estimate the genetic covariance matrix among all GWAS phenotypes, which is then combined with each individual SNP to calculate SNP-specific genetic covariance matrices. Second, these matrices are then used to estimate the SEM using the lavaan package in R52. Variable and unknown extents of sample overlap across contributing GWASs are automatically accounted for in the estimation procedure. The unifactor model fit the data well53 (χ2(1) = 0.017, P = 0.896, CFI = 1, SRMR = 0.002; residual r = 0.51, P = 0.016; Supplementary Fig. 1; see also our prior work18 and Methods).

As the sample size of summary data derived from African American samples (N range = 9,835–56,648) was not sufficient for LD score54 analyses, we used ASSET24 to conduct the addiction-rf GWAS, as opposed to GenomicSEM, as described in the subsequent ASSET section below.

ASSET trans-ancestry analyses

ASSET24 was used to identify pleiotropic (that is, SNPs that show associations with more than one SUD) and substance-specific (that is, SNPs only associated with a single SUD) SNPs within the European and African American ancestry samples (in addition to GenomicSEM in Europeans). ASSET was used in our African American ancestry addiction-rf GWAS because the sample size was not sufficient for the genomic structural equation modelling (SEM) approach used in the European addiction-rf GWAS. As a result, there are important differences in the primary addiction-rf GWAS and GWAS run in ASSET. First, the ASSET-based addiction-rf GWAS contains SNPs that may influence two, three, or all four individual SUDs, while the GenomicSEM-based addiction-rf GWAS in European ancestry samples includes SNPs associated with a common factor across all included SUDs. We used ASSET to identify pleiotropic SNPs in the European ancestry sample to facilitate method-consistent cross-ancestry meta-analysis GWAS (see subsequent ‘Cross-ancestry meta-analysis’ section below) and cross validate primary GenomicSEM results.

ASSET does not leverage the genetic correlation to identify variants of interest (as GenomicSEM does); instead, subset searches scaffold effects into pleiotropic and non-pleiotropic variants based on effect size and standard error derivations that estimate the degree to which the SNP–trait association is due to pooled effects across the phenotypes, versus a single phenotype driving variant association. Loci were designated as substance specific when they were significantly associated with only one SUD. Because ASSET does not automatically account for sample overlap; we used the linkage disequilibrium score regression intercept (LDSC) to adjust for overlap within the European ancestry ASSET covariance term.

Cross-ancestry meta-analysis

We conducted a cross-ancestry meta-analysis of ASSET-derived (to maintain analytic consistency) European and African ancestry addiction-rf summary statistics. First, SNPs with evidence of SUD pleiotropy (that is, effects on two, three, or all four SUDs, including different sets of SUDs in each ancestry) in both ancestral groups were extracted. SNPs with evidence of cross-ancestral heterogeneity (that is, Cochran’s Q statistic <5 × 10–8) were removed, leaving 317,447 SNPs. A meta-analysis in METASOFT55 using a random-effects meta-analysis with ancestry group as a random effect was used to identify cross-ancestral effects. We report the random-effects beta and P-value as cross-ancestry effects.

Substance specific genetics in European ancestry individuals

To validate substance-specific SNPs, we used ASSET for discovery of these variants and, in the European ancestry GWAS, also examined Q-SNP results derived from GenomicSEM. Q-SNP14 indexes violation of the null hypothesis that a SNP acts on a trait entirely through a common factor (for example, the addiction-rf). For example, if a SNP has a particular effect on one SUD trait (such as SNPs in CHRNA5 influencing PTU), then it should have significant Q-SNP statistics because it violates the assumption that its effect on PTU is via the addiction-rf. We identified Q-SNPs by estimating the association between each SNP and the addiction-rf. Then, we fit a model where the SNP predicted the indicators underlying the addiction-rf, that is, PAU, PTU, CUD, OUD. We compared the χ2 difference statistic between the two models; those with significant decrement of fit (χ2 for Δd.f. = 4) in the model where the SNP predicted the addiction-rf alone relative to the SNP predicting the indicators themselves was considered a significant Q-SNP above GWS (that is, Q P < 5 × 10–8). SNPs with significant Q-SNP statistics were removed from the addiction-rf summary statistics for all post-hoc analyses, including fine-mapping, gene-based tests, transcriptome-wide association analyses, LD score genetic correlations and PRS analyses.

Q-SNP analysis also identified several SNPs that appeared to be specific to a single substance. However, as Q-SNP cannot be used for precise identification of substance-specific (trait-specific) SNPs, we relied on ASSET analyses (with a one-sided P-value), to identify the subset of SNPs with effects (at GWS, P < 5 × 10–8) limited to only one SUD-related trait (for example, PAU-specific vs. PAU common with OUD). It is worth noting that the ASSET analysis determines both common addiction and substance specific SNPs. Here we would like to note that the common addiction SNPs from ASSET results were used for our cross-ancestry analysis, while specific SNPs in our results are described seperately for each population.

Post-hoc analyses of European ancestry GWAS results

Estimation of expected SNP effect sizes

We estimated the distribution of genetic effect-sizes of the addiction-rf (GenomicSEM) and the four input GWASs (PAU, PTU, CUD, OUD) using genetic effect-size distribution inference from summary-level data (GENESIS). GENESIS is a likelihood-based approach56. In this approach, GWAS summary statistics and an external panel of LD (in our case, the 1000 Genomes Phase 3 reference panel) are used to estimate a projected distribution of SNP effect sizes. A flexible normal mixture model based on the number of tagged SNPs and LD scores is estimated. A three-component model is fit, where SNP effect sizes are estimated to belong to one of three components based on bins of effect sizes (large, medium and small). If the distribution of SNPs is multivariate normal, the estimation of the SNPs with large and medium effect sizes can be done via their independent effect sizes. The third component represents SNPs with null and small effect sizes, and these should follow a similar distribution. Therefore, this model reweights SNPs and generates a projected distribution of effect sizes, and from this projection, we can draw conclusions about the distribution of effect sizes54.

Biological characterization

FUMA23 was used for post-hoc bioinformatic analyses of our five GWASs (that is, the addiction-rf (from GenomicSEM), PAU-specific, PTU-specific, CUD-specific, OUD-specific (from ASSET) loci) in European ancestry samples and to determine lead and independent variants. Within FUMA, gene-based tests and gene-set enrichment were conducted via MAGMA57; gene annotation, and identification of SNP-to-gene associations via eQTLs and/or chromatin interactions (via Hi-C data) in PsychENCODE58 and Roadmap Epigenomics tissues for prefrontal cortex, hippocampus, ventricles and neural progenitor cells59,60. For each specific SUD, the distribution of P-values included all non-pleiotropic SNPs identified by ASSET (that is, SNPs only associated with a single SUD, n SNP CUD-specific = 312,661, n SNP PTU-specific = 560,983, n SNP PAU-specific = 193,647, n SNP OUD-specific = 425,665).

Fine-mapping with SusieR

We fine-mapped the association statistics of the four phenotypes (the addiction-rf, PAU-specific, PTU-specific, CUD-specific; OUD-specific only had one significant locus, and that locus has a known mechanism of effect) that had more than one GWS SNP in a 1 Mb region around the lead SNP to determine the 95% credible set using susieR61 with at most 10 causal variants (this analysis reduces the total number of SNPs at a lead genome-wide signal to those that can credibly be considered as causal SNPs). The credible set reports include the likelihood of being a causal variant; the marginal posterior inclusion probability (PIP) ranges from 0 to 1, with values closer to 1 being most likely causal.

Transcriptome-wide association analysis

We conducted two transcriptome-wide analyses. First, we used MetaXcan/S-MultiXcan38 to conduct a cross-tissue analysis of all brain tissues in the GTEx v8 data37. S-MultiXcan returns a broad z-score across all tissues in the model, along with the top and lowest scores at each tissue. S-MultiXcan combines information across individual tissues, which improves the power for discovery by reducing the multiple correction burden. It also produces z-score and P-values for top-associated tissues. Second, we also used S-PrediXcan62 to predict transcription using the weights trained on psychiatric cases versus controls transcriptional differences from the frontal and temporal cortex using the PsychENCODE63 dataset. As these data were very densely sampled for psychiatrically relevant traits, it serves to complement the relatively healthy GTEx sample.

Drug repurposing

Our technique for drug signature matching used data from the LINCS L1000 database64. The LINCS L1000 database catalogues in vitro gene expression profiles (signatures) from thousands of compounds in over 80 human cell lines (level 5 data from phase I: GSE92742 and phase II: GSE70138)26. We selected compounds that were currently FDA approved or in clinical trials (via https://clue.io/repurposing#download-data; updated 24 March 2020). Our analyses included signatures of 829 chemical compounds (590 FDA approved, 239 in clinical trials) in five neuronal cell-lines (NEU, NPC, MNEU.E, NPC.CAS9 and NPC.TAK), a total of 3,897 signatures were present as not all compounds were tested in all cell lines in the LINCS dataset.

In vitro medication signatures were matched with the addiction-rf signatures from the transcriptome-wide association analyses (conducted using S-MultiXcan)25,62 via multi-level meta-regression. We computed weighted (by its proportion of heritability explained (hMULTI-XCAN2)) Pearson correlations between transcriptome-wide brain associations and in vitro L1000 compound signatures using the metafor package in R65. We treated each L1000 compound as a fixed effect incorporating the effect size (rweighted) and sampling variability (ser_weighted2) from all signatures of a compound (for example, across all time points, cell lines and doses). Analyses included time since perturbagen exposure as a random effect. Only genes that were Bonferroni significant in the S-PrediXcan analysis (transcriptome-wide correction = 0.05/14,389 = 3.48 × 10–6) were entered into the model. We only report those perturbagens that were associated after Bonferroni correction (perturbagen correction = 0.05/3,897 = 1.28 × 10–5).

PRS analyses in Yale–Penn

Yale–Penn 3

The Yale–Penn16,66 sample includes 11,332 genotyped and phenotyped individuals recruited across three phases (that is, Yale–Penn 1, Yale–Penn 2 and Yale–Penn 3) based on the time of recruitment and genotyping array used. All cohorts were ascertained via recruitment at substance use treatment centres or targeted advertisements for genetic studies of cocaine, opioid and alcohol dependence, resulting in a sample highly enriched for problematic substance use, as well as control subjects and relatives. All participants were assessed using the Semi-Structured Assessment for Drug Dependence and Alcoholism (SSADDA)67. Analyses based on Yale–Penn 1 and 2 have been published previously66, and were used in the discovery sample of the present study. Here, we used data from Yale–Penn 316 for replication analyses and as a target sample for PRS analyses; the Yale–Penn 3 sample is independent from our discovery GWASs. Yale–Penn 3 comprises 3,026 genotyped and phenotyped Americans of European (EUR; N = 1,986) and African (AFR; N = 1,040) ancestry passing standard QC. Genotyping was performed at the Gelernter lab at Yale University using the Illumina Multi-ethnic Global Array containing 1,779,819 markers, followed by genotype imputation using Minimac368 and the Haplotype Reference Consortium reference panel69 as implemented on the Michigan imputation server (https://imputationserver.sph.umich.edu).

For the present analysis, only Yale–Penn 3 EUR subjects (N = 1,986) were included. DSM-IV29 substance abuse and dependence diagnoses (combined as abuse or dependence to represent use disorder) based on SSADDA assessments were used to determine case and control status for AUD, CUD, CoUD, TD and OUD. Of the 1,986 EUR subjects, 42.4% met criteria for AUD (N = 843), 25.9% met criteria for CUD (N = 515), 25.3% met criteria for CoUD (N = 503), 31% met criteria for TD (N = 615) and 22.6% met criteria for OUD (N = 448). The mean age of Yale–Penn 3 EUR subjects is 41.5 years (s.e. = 15.1) and 51.5% are female (N = 1,023).

We calculated the addiction-rf PRS using the PRS-CS auto approach70. This method assumes a general distribution of effect sizes across the genome, and then reweights SNPs based on this assumption, their effect size in the original GWAS, and their LD; weights for every SNP were then summed to create a final score. PRS were associated with phenotypes (OUD, TD, CUD, AUD, CoUD) in Yale–Penn 3 via a logistic regression controlling for the first 10 ancestral principal components, age, sex and age by sex. PRS were scaled to unit variance. These logistic regression analyses were also examined for the following contrasts: (1) those with any SUD (n = 985) versus those with no SUD (n = 1,001), to represent ‘any SUD’; (2) those with at least two SUDs (n = 729) versus those with less than two (including zero) SUDs (n = 1,257) to represent ‘polysubstance use disorder’; and (3) those with at least two SUDs (n = 729) versus those with one SUD (n = 256) to represent polysubstance use disorder within those with SUD. The association between the addiction-rf PRS and the SUD common factor was estimated with lavaan52 where the common factor loaded on the five SUDs.

Genetic correlations and latent causal variable modelling

To examine phenotypes that were genetically correlated with the addiction-rf, we calculated genetic correlations using LD score regression54,71 through the MASSIVE pipeline72, which conducts LD score regression13,46 and Latent Causal Variable Analysis28 on 1,547 summary statistics for various phenotypic traits, including a mixture of ICD codes and self-reported traits from the UK Biobank and publicly available meta-analyses from GWAS consortia.

Phenome-wide association studies

PheWASs in adult samples

As MASSIVE includes a fairly sparse set of diagnoses (not all ICD codes are available) for genetic correlation analyses, we conducted additional and theoretically relevant PheWASs using the addiction-rf PRS. We used EHR data for 66,914 genotyped individuals of European ancestry from the Vanderbilt University Medical Center biobank (BioVU)30. BioVU is a repository of leftover blood samples (~240,000 samples) from clinical testing, that are sequenced, de-identified and linked to clinical and demographic data. Genotyping and QC of this sample have been described elsewhere30. The addiction-rf PRS was used to predict 1,335 diseases in a logistic regression model, controlling for median age on record, reported gender and first 10 genetic ancestral principal components. For an individual to be considered a case, they were required to have two separate ICD codes for the index phenotype, and each phenotype needed at least 100 cases to be included in the analysis. A Bonferroni-corrected phenome-wide significance threshold of 0.05/1,335 = 3.7 × 10–5 was used73.

ABCD PheWAS of phenotypes collected in childhood

To identify phenotypes that were associated with the addiction-rf before the onset of regular substance use, we used data from the ABCD Study (release 2.0 for genomic data and 3.0 for phenotypes) to conduct a phenome-wide association analysis of behavioural, social and environmental phenotypes in adolescence. The ABCD Study is an ongoing multi-site longitudinal study of child health and development (Methods)31,74. Children (N = 11,875; including twins and siblings) ages 8.9–11 years were recruited from 22 sites across the United States to complete the ABCD Study baseline assessment. We restricted our sample to participants of genomically confirmed European ancestry (based on principal components) who were not missing any covariate measures (N = 4,490).

PRS were generated using the PRS-CS software package70 consistent with our other (that is, Yale–Penn 3, BioVU) PRS analyses described above. Associations between the addiction-rf PRS and phenotypes were estimated using mixed-effects models in the lme475 package in R. PRS were scaled to unit variance. Family ID and site were included as random effects to account for non-independence of measurement associated with relatedness and scanner/site. We controlled for the first 10 ancestral principal components, age, sex and age by sex. We used a Bonferroni-corrected phenome-wide significance threshold of 0.05/1,480 = 3.38 × 10–5; all results are presented in the Supplementary Table 21.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.