Genome-wide association study of alcohol consumption and use disorder in 274,424 individuals from multiple populations

Alcohol consumption level and alcohol use disorder (AUD) diagnosis are moderately heritable traits. We conduct genome-wide association studies of these traits using longitudinal Alcohol Use Disorder Identification Test-Consumption (AUDIT-C) scores and AUD diagnoses in a multi-ancestry Million Veteran Program sample (N = 274,424). We identify 18 genome-wide significant loci: 5 associated with both traits, 8 associated with AUDIT-C only, and 5 associated with AUD diagnosis only. Polygenic Risk Scores (PRS) for both traits are associated with alcohol-related disorders in two independent samples. Although a significant genetic correlation reflects the overlap between the traits, genetic correlations for 188 non-alcohol-related traits differ significantly for the two traits, as do the phenotypes associated with the traits’ PRS. Cell type group partitioning heritability enrichment analyses also differentiate the two traits. We conclude that, although heavy drinking is a key risk factor for AUD, it is not a sufficient cause of the disorder.

xcessive alcohol consumption is associated with a host of adverse medical, psychiatric, and social consequences. Globally, in 2012, about 3.3 million or 5.9% of all deaths, 139 million disability-adjusted life years, and 5.1% of the burden of disease and injury were attributable to alcohol consumption, with the magnitude of harm determined by the volume of alcohol consumed and the drinking pattern 1 . Regular heavy drinking is the major risk factor for the development of an alcohol use disorder (AUD), a chronic, relapsing condition characterized by impaired control over drinking 2 . Independent of AUD, heavy drinking has a multitude of adverse medical consequences. Identifying factors that contribute to drinking level and AUD risk could advance efforts to prevent, identify, and treat both medical and psychiatric problems related to alcohol.
Many different alcohol-related phenotypes have been used to investigate genetic risk, including formal diagnoses, such as alcohol dependence [e.g., based on the Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV) 3 ] and screening tests that measure alcohol consumption and alcoholrelated problems [e.g., the Alcohol Use Disorders Identification Test (AUDIT)]. The AUDIT, a 10-item, self-reported test developed by the World Health Organization as a screen for hazardous and harmful drinking 4,5 has been used for genomewide association studies (GWASs) both as a total score [6][7][8] and as the AUDIT-Consumption (AUDIT-C) and AUDIT-Problems (AUDIT-P) sub-scores 8 . The three-item AUDIT-C measures the frequency and quantity of usual drinking and the frequency of binge drinking, while the 7-item AUDIT-P measures alcoholrelated problems.
Twin and adoption studies have shown that half of the risk of alcohol dependence, a subtype of AUD, is heritable 9 . The singlenucleotide polymorphism (SNP) heritability of alcohol dependence in a family-based, European-American (EA) sample was 16% 10 and 22% in an unrelated African-American (AA) sample 11 . In the meta-analysis of data from the UK Biobank (UKBB) and 23andMe, the SNP heritability of the total AUDIT was estimated to be 12%, while for the AUDIT-C and AUDIT-P it was 11% and 9%, respectively).
In 12 GWASs of alcohol dependence (most of which used a binary DSM-IV diagnosis 3 ) published between 2009 and 2014 (ref. 12 ), the only consistent genome-wide significant (GWS) findings were for SNPs in genes encoding the alcohol metabolizing enzymes. Similarly, in a recent meta-analysis of 14,904 individuals with alcohol dependence and 37,944 controls, which was stratified by genetic ancestry (European, N = 46,568; African; N = 6280), the only GWS findings were two independent ADH1B variants. In addition, there were significant genetic correlations seen with 17 phenotypes, including psychiatric (e.g., schizophrenia, depression), substance use (e.g., smoking and cannabis use), social (e.g., socio-economic deprivation), and behavioral (e.g., educational attainment) traits 13 .
A GWAS of the AUDIT in nearly 8000 individuals failed to identify any GWS loci 6 . A GWAS of the AUDIT from 23andMe in 20,328 European ancestry participants also failed to yield GWS results 7 , although meta-analysis of the AUDIT in the UKBB and 23andMe samples identified 10 associated risk loci, including associations to JCAD and SLC39A13 (ref. 8 ). In addition to the total AUDIT-C score, the meta-analysis included GWASs for the AUDIT-C and AUDIT-P, which showed significantly different patterns of association across a number of traits, including psychiatric disorders. Specifically, the direction of genetic correlations between schizophrenia, major depressive disorder, and obesity (among others) was negative for AUDIT-C and positive for AUDIT-P.
In the present study, we evaluate the independent and overlapping genetic contributions to AUDIT-C and AUD in a single large multi-ancestry sample from the Million Veteran Program (MVP) 22 . Large-scale biobanks such as the MVP offer the potential to link genes to health-related traits documented in the electronic health record (EHR) with greater statistical power than can ordinarily be achieved in prospective studies 23 . Such discoveries improve our understanding of the etiology and pathophysiology of complex diseases and their prevention and treatment. To that end, we use a common data source-longitudinal repeated measures of alcohol-related traits from the national Veterans Health Administration (VHA) EHR-to obtain the mean, age-adjusted AUDIT-C score and International Classification of Diseases (ICD) alcohol-related diagnosis codes over more than 11 years of care 24 . We then conduct a GWAS of each trait followed by downstream analysis of the findings in which we construct Polygenic Risk Scores (PRS) for both traits and show that they are associated with alcohol-related disorders in two independent samples. The availability of data on alcohol consumption from the AUDIT-C and a formal diagnosis of AUD from the EHR enables us to examine the relationship between these key alcohol-related traits in a single, well-phenotyped sample and to compare the findings for these traits more systematically than has previously been possible.

Results
Principal components analysis. We differentiated participants genetically into five populations (see Methods, Supplementary  Fig. 1) and removed outliers. There was a high degree of concordance ( Supplementary Fig. 2) between the genetically defined populations and the self-reported groups for European Americans (EAs, 95.6% were self-reported Non-Hispanic white) and African Americans (AAs, 94.5% were self-reported Non-Hispanic black). Concordance ranged from 53.1% to 81.6% in the other three population groups.
The GWAS findings largely reflect male-specific signals due to the predominantly male sample (Supplementary Table 1). However, sex-stratified GWAS also identified two female-specific signals for AUDIT-C (Supplementary Data 3, Supplementary Figs. 9, 10) and one for AUD (Supplementary Data 4,Supplementary Figs. 11,12).
For AUDIT-C, when associations for the seven LD-pruned GWS SNPs on chromosome 4q23-q24 in EAs are conditioned on rs1229984, the most significant functional SNP in the region in that population, the only independent signal (using a Bonferroni-corrected p value < 0.05) is for rs1229978, near ADH1C (Supplementary Data 5). For AUD, when associations for the four LD-pruned GWS SNPs in EAs are conditioned on   rs1229984, the only independent signal in that region is for rs1154433 near ADH1C. In the trans-population meta-analysis, rs5860563 is independent when conditioned on rs1229984 in EAs, and on rs2066702 in AAs, the most significant functional SNP in the region in AAs (Supplementary Data 6).
To elucidate further the genetic differences between AUDIT-C and AUD, we conducted a GWAS of each phenotype with the other phenotype as a covariate. A GWAS of AUDIT-C with AUD as a covariate identified 10 GWS loci in EAs and 2 GWS loci in AAs (Supplementary Data 7). In both EAs and AAs, all loci overlapped with the GWS findings for AUDIT-C alone. A GWAS of AUD that included AUDIT-C as a covariate identified five GWS loci in EAs and one in AAs (Supplementary Data 8). Among EAs, four of the loci were the same as for AUD, the only non-overlapping finding being DIO1 (Iodothyronine Deiodinase 1). In AAs, ADH1B remained significant for AUD when accounting for AUDIT-C, but TSPAN5 did not.
Using a sign test, most SNPs have the same direction of effect for AUDIT-C and AUD, consistent with the high genetic correlation between the traits. For SNPs with p value <1 × 10 −6 the sign concordance between the two traits is 98.7% in EAs and 100% in the other four, smaller population groups.
Body mass index-adjusted GWAS. Because FTO was GWS for both AUDIT-C and AUD, we repeated the two GWASs correcting for body mass index (BMI). Among the top SNPs associated with AUDIT-C and AUD, most remain GWS after correction for BMI, though the significance level of some change (Supplementary Data 9, 10). FTO SNPs become only nominally significant for both alcoholrelated traits: the p value for the lead SNP for AUDIT-C, rs9937709, decreases in significance from 5.53 × 10 −14 to 1.42 × 10 −5 and for the lead SNP for AUD, rs11075992, it decreases in significance from 3.22 × 10 −10 to 3.02 × 10 −5 . In contrast, with correction for BMI, some signals increase, e.g., for rs1260326 in GCKR the p value increases in significance from p = 2.04 × 10 −16 to p = 2.91 × 10 −19 for AUDIT-C and from p = 2.27 × 10 −13 to p = 1.71 × 10 −14 for AUD. Similarly, rs1229984 in ADH1B increases in significance from p = 3.62 × 10 −133 to p = 9.81 × 10 −145 for AUDIT-C and from p = 4.68 × 10 −85 to p = 3.85 × 10 −89 for AUD.
Gene-based analyses. For AUDIT-C score, gene-based association analyses identify 31 genes in EAs that are GWS (p < 2.69 × 10 −6 ), 3 in AAs, 1 in LAs, and 2 in EAAs ( Supplementary Fig. 13), including many of the loci in the SNP-based analyses for that trait. The unique genes in EAs include C4orf17, ZNF512, MTTP, TBCK, and MCC. For AUDIT-C, the loci that were not GWS in the SNPbased analyses included EIF4E in AAs, MAP2 in LAs, and LOX and MYL2 in EAAs.
For AUD, we identify 23 GWS genes in EAs, 5 in AAs, and 1 in LAs ( Supplementary Fig. 14), many of which are GWS loci in the SNP-based analyses for that trait. For AUD, the loci in EAs that are not GWS in the SNP-based analyses are KRTCAP3, TRMT10A, ZNF512, DCLK2, MTTP, and MCC. In AAs, EIF4E, ADH4, and METAP1 are GWS for AUD, while ADGRB2 is the only GWS locus in LAs.  Pathway and biological enrichment analyses. Using Functional Mapping and Annotation (FUMA) 26 software to investigate the pathway or biological process enrichment with summary statistics as input and false discovery rate (FDR) correction for multiple testing, we find multiple reactome and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that are significantly enriched for AUDIT-C (   In the analysis of stratified heritability enrichment using LDSC 28 (see Methods), several cell line functional enrichments were significant (FDR < 0.05) for AUDIT-C (Supplementary Data 19) and AUD (Supplementary Data 20). Cell type group partitioning heritability enrichment analyses indicated that central nervous system (CNS) was the most significant cell type for AUDIT-C (Fig. 2b, upper panel; Supplementary Data 21) and the only significant cell type for AUD (Fig. 2b, bottom panel; Supplementary Data 22). Enrichments for AUDIT-C were also detected for cardiovascular, adrenal or pancreatic, skeletal muscle, other, and liver cell types in descending order of significance. We also tested the heritability enrichments using data from gene expression and chromatin to identity disease-related tissues or cell types 29 (Supplementary Data 23-32). We found a few epigenetic features in brain tissues-e.g., H3K4me1, H3K4me3, and DNase-that were significantly enriched for each trait.
Genetic correlations. We estimated the genetic correlation (r g ) between different datasets or populations using LDSC 30  We tested the difference between genetic correlations for AUDIT-C and AUD using a two-tailed z-test. After correction for 714 tested traits, the genetic correlations for 188 traits showed significant differences between the two alcohol-related traits (Supplementary Data 36). We explored trait and disease associations for AUDIT-C-adjusted for AUD and AUDadjusted for AUDIT-C, and found that the genetic correlations between the alcohol-related traits and other phenotypes did not differ substantially from the unadjusted ones (Supplementary Data 37,38). Additionally, we explored genetic correlations for AUDIT-C-adjusted for BMI (Supplementary Data 39) and AUD-adjusted for BMI (Supplementary Data 40). Most of the genetic correlations for AUDIT-C-adjusted for BMI did not differ substantially from the unadjusted ones, except for anthropometric traits, where the negative correlation was attenuated (although still significant). Significant genetic correlations for AUD-adjusted for BMI did not differ substantially from those for AUD alone. We also explored prior GWAS associations for the GWS SNPs from AUDIT-C and AUD analyses and found associations with other phenotypes for five of them (Supplementary Data 41).
Polygenic Risk Scores. We examined PRS generated from the AUDIT-C and AUD GWASs in three samples ( Supplementary  Figs. 23-26). First, in a hold-out MVP sample of EAs and AAs (described in Methods), AUDIT-C and AUD PRS were significantly associated with both AUDIT-C and AUD phenotypes (Supplementary Data 42, 43). Lower p value thresholds of AUDIT-C PRS were associated with AUDIT-C score and AUD diagnosis codes, with the most significant being 1 × 10 −7 (EA AUDIT-C: β = 0.088, p = 1. Second, in an independent sample from the Penn Medicine BioBank, AUDIT-C and AUD PRS were significantly associated with alcohol-related disorders and alcoholism phecodes (see Methods and Supplementary Data 44 and 45). In EAs, higher AUDIT-C risk scores significantly increased the likelihood of alcohol-related disorders and alcoholism at multiple p value thresholds, with the most significant being 1 × 10 −7 (β = 0.278, p = 0.0013) and 1 × 10 −6 (β = 0.245, p = 0.0074), respectively. In AAs, at a p value threshold of 1 × 10 −7 , AUDIT-C risk scores were non-significantly associated with risk of alcohol-related disorders (β = 0.210, p = 0.064) but significantly associated with alcoholism (β = 0.400, p = 0.0051). In both populations, AUD risk scores were significantly associated with both alcohol-related disorders and alcoholism. In EAs, the most significant p value threshold was 1 × 10 −4 (alcohol-related disorders: β = 0.254, p = 0.0006; alcoholism: β = 0.229, p = 0.0062), while in AAs, the most significant p value threshold was 1 × 10 −6 (alcoholrelated disorders: β = 0.306, p = 0.006; alcoholism: β = 0.440, p = 0.0007).
Third, in the Yale-Penn study sample 25 , an independent sample ascertained for substance use disorders, the PRS of AUDIT-C and AUD were significantly associated with DSM-IV alcohol dependence criterion counts (see Methods and Supplementary Data 46,47). In EAs, all AUDIT-C risk scores were significantly associated with the criterion count, with the most significant p value threshold being 1 × 10 −7 (β = 1.029, p = 6.67 × 10 −13 ). Similarly, all AUD risk scores were significantly associated with the criterion count, the most significant p value threshold being 1 × 10 −6 (β = 1.144, p = 1.86 × 10 −16 ). In AAs, all but one AUDIT-C risk score and all AUD risk scores were significantly associated with the alcohol dependence criterion count, the most significant p value threshold being 1 × 10 −7 (AUDIT-C: β = 0.829, p = 1.18 × 10 −11 ; AUD: β = 0.502, p = 4.62 × 10 −8 ).
Secondary phenotypic associations. To identify secondary phenotypes associated with AUDIT-C or AUD, we performed a phenome-wide association analysis (PheWAS) of the AUDIT-C and AUD PRS (p value threshold = 1 × 10 −7 and all SNPs) in the MVP hold-out sample (Supplementary Data 48, 49). In EAs, the AUDIT-C PRS was significantly associated with an increased risk of alcoholic liver damage, and nominally associated with a decreased risk of hyperglyceridemia. No significant associations were found for AAs. The AUD PRS was significantly associated with an increased risk of tobacco use disorder in both EAs and AAs, and in EAs with multiple psychiatric disorders, including major depression, bipolar disorder, anxiety, and schizophrenia.

Discussion
We report here a GWAS of two alcohol-related traits in a sample of 274,424 MVP participants from five population groups-EA, AA, LA, EAA, and SAA-using two EHR-derived phenotypes: age-adjusted AUDIT-C score and AUD diagnostic codes. In addition to the large number of EAs, the study included large numbers of African-American and Latino-American participants. Trans-population meta-analyses identified 13 independent GWS loci for AUDIT-C and 10 independent GWS loci for AUD. For AUDIT-C, in addition to the loci identified in the SNP-based analyses, there were 31 GWS genes in EAs, 3 in AAs, 1 in LAs, and 2 in EAAs. For AUD, in addition to the loci identified in the SNP analyses, there were 23 GWS genes identified in EAs, 5 in AAs, and 1 in LAs.
Using both AUDIT-C scores and AUD diagnoses enabled us to examine the relations between these key alcohol-related traits. The findings underscore the utility of using an intermediate trait, such as alcohol consumption, for genetic discovery. Five of the 13 loci associated with AUDIT-C score, a measure of alcohol consumption, including the two most commonly identified alcohol metabolism genes (ADH1B and ADH1C) and three highly pleiotropic genes (GCKR, SLC39A8, and FTO), contributed to AUD risk. Of the 10 loci that were GWS for AUD, half also were associated with AUDIT-C score, while half were uniquely associated with the AUD diagnosis: ADH4, SIX3, a variant on chr10q25.1 and 2 variants in DRD2.
In addition to multiple overlapping variants for AUDIT-C and AUD, we found a moderate-to-high genetic correlation between the traits: 0.522 in EAs and 0.930 in AAs. There are two potential explanations for the population difference in genetic correlation. First, it may reflect a bias in the assignment of AUD diagnoses by clinicians (e.g., in the context of a high AUDIT-C score, clinicians could be less likely to assign an AUD diagnosis to EAs than AAs, reducing the genetic correlation). Second, because LD structure in admixed populations is complex, LD score regression could have inflated the genetic correlation among AAs, an admixed population. Another factor relevant to this difference is the smaller number of AAs, which despite a higher r g , yielded a larger standard error. The genetic similarity between these alcoholrelated traits is consistent with twin studies of alcohol dependence and alcohol consumption 31,32 . These findings are also consistent with the PRS analyses in the MVP sample, where both AUDIT-C and AUD PRS were associated with AUDIT-C and AUD phenotypes. Both traits also predicted multiple alcohol-related phenotypes in independent datasets, including alcohol dependence criteria in the Yale-Penn sample. However, there was a smaller effect of AUDIT-C PRS scores than AUD PRS scores on alcoholrelated disorders and alcohol dependence. This is in line with findings from the meta-analysis of UKBB and 23andMe data, where the genetic correlation with alcohol dependence was nominally greater for AUDIT-P scores (r g = 0.63) than AUDIT-C scores (r g = 0.33) 8 .
Despite the significant genetic overlap between the AUDIT-C and AUD diagnosis, downstream analyses revealed biologically meaningful points of divergence. The AUDIT-C yielded some GWS findings that did not overlap with those for AUD, which reflects genetic independence of the traits. This broadens our previous observations using SNPs in ADH1B, in which we validated the AUDIT-C score as an alcohol-related phenotype 33 . In that study, after accounting for the effects of AUDIT-C score, AUD diagnoses accounted for unique variance in the frequency of ADH1B minor alleles.
Evidence of genetic independence between the two traits was most striking in the differences between the genetic correlation analyses. After correction, genetic correlations for 188 traits differed significantly (some in opposite directions) between AUDIT-C and AUD. Notably, these included a negative association of AUDIT-C with anthropometric traits, including BMI; coronary artery disease; and glycemic traits, including Type 2 diabetes. The negative genetic correlation with coronary artery disease is consistent with some epidemiological findings that alcohol consumption protects against some forms of cardiovascular disease 34 . AUDIT-C was positively genetically correlated with overall health rating, HDL cholesterol concentration, and years of education, findings that are consistent with prior literature showing genetic correlation of these traits with alcohol consumption 7,8,21 . AUD was significantly genetically correlated with 111 traits or diseases, including negative genetic correlations with intelligence, years of education and quitting smoking, and positive genetic correlations with insomnia, ever having smoked and most psychiatric disorders, findings that are consistent with phenotypic associations in the epidemiological literature [35][36][37] and genetic correlations reported from the UKBB and 23andMe GWASs and their metaanalysis 7,8,21 . The opposite genetic correlations seen for some traits may be driven by low-effect variants, as we find close to 100% consistency in the direction of effect for the most significantly associated SNPs for both AUDIT-C and AUD. Further, in the MVP sample, the AUD PRS was significantly positively associated with tobacco use and multiple psychiatric disorders, whereas the AUDIT-C PRS was not. Taken together, these findings suggest that AUD and alcohol consumption, measured by AUDIT-C, are related but distinct phenotypes, with AUD being more closely related to other psychiatric disorders, and AUDIT-C to some positive health outcomes.
Although the protective effects of moderate drinking are controversial, we found that alcohol consumption in the absence of genetic risk for AUD may protect from cardiovascular disease, diabetes mellitus, and major depressive disorder. In contrast, individuals with genetic risk for AUD are at elevated risk for some adverse secondary phenotypes, including insomnia, smoking, and other psychiatric disorders. However, individuals who have had health problems resulting from drinking are more likely to reduce or stop drinking by middle age or under-report their alcohol consumption. This offers an alternative explanation for the opposite genetic associations 38 , particularly in an older clinical sample in which a large proportion report current abstinence (reflected in an AUDIT-C score of 0). For this complex set of genetic associations to be useful in informing clinical recommendations on safe levels of alcohol consumption, it will be necessary to elucidate the mechanisms underlying these findings.
Both phenotypes showed cell type-specific enrichments for CNS. Other relevant cell types for AUDIT-C, but not for AUD, included cardiovascular, adrenal or pancreas, liver, and musculoskeletal. Thus, although heavy drinking is prerequisite to the development of AUD, the latter is a polygenic disorder and variation in genes expressed in the CNS (e.g., DRD2) may be necessary for individuals who drink heavily to develop AUD. As a binary trait, AUD provided less statistical power to identify genetic variation than the ordinal AUDIT-C score, but the multiple GWS findings unique to AUD argue against that as an explanation for the non-overlapping GWS findings for the two traits.
The VHA EHR provided a rich source of phenotypic data. These included mean age-adjusted AUDIT-C scores, which are more stable than measures at a single point in time (more likely reflecting traits rather than states) and contrast with metaanalytic studies that may use phenotypes reflecting the lowestcommon denominator among the studies comprising the sample. However, our analyses were limited by our reliance on the AUDIT-C, which includes only the first 3 of the 10 AUDIT items. We also obtained cumulative AUD diagnoses, which are also more informative than assessments obtained at a single time point. Because the diagnosis of AUD is based on features other than alcohol consumption per se 2,5 , use of the AUD diagnosis from the EHR augmented the information provided by the AUDIT-C phenotype. Although EHR diagnostic data are heterogeneous, large-scale biobanks such as the MVP yield greater statistical power to link genes to health-related traits repeatedly documented over time in the EHR than can ordinarily be achieved in prospective studies 23 , justifying the lower resolution of EHR data. However, because the MVP sample is predominantly comprised of EA males, statistical power was limited in both the GWAS and the post-GWAS analyses of other populations and some female samples. Future studies with larger sample sizes are needed to identify additional variation contributing to these alcohol-related traits and to elucidate their interrelationship.
The SNP heritability of our GWASs was lower than that seen in the meta-analysis of the UKBB and 23andMe data 8 . For the AUDIT-C, the estimated SNP heritability was 0.068 in EAs (0.068 in males and 0.099 in females) and 0.062 in AAs. For AUD, the estimated SNP heritability was 0.056 in EAs (0.054 in males and 0.110 in females) and 0.100 in AAs. These estimates may reflect the lower number of SNPs tested in our sample compared with the meta-analysis of UKBB and 23andMe data. The nominally higher SNP heritability in females than males could be due to the substantially smaller size of the female subsample. Alternatively, women could have a higher liability-threshold and therefore a higher burden of risk variants. Because our study sample was predominantly male, we do not have adequate statistical power to evaluate these hypotheses. Although we found no significant difference in PRS between males and females, because of the substantially smaller number of women in MVP, there is much less power for the PRS in this subgroup and for comparing the PRS by sex.
Despite these limitations, the large, diverse, and similarly ascertained sample enabled us to identify multiple GWS findings for both AUDIT-C score and AUD diagnosis, and thereby to help elucidate the relationship between drinking level and AUD risk. The large sample provided high power for PRS analyses in other samples, as demonstrated here in the Penn Medicine Biobank and Yale-Penn samples. The genetic differences between the two alcohol-related traits and the observed opposite genetic correlations between them point to potentially important differences in comorbidity and prognosis. Our findings underscore the need to identify the functional effects of the risk variants, especially where they diverge by trait, to elucidate the nature of the trait-related differences. Focusing on variants linked to AUD, but not AUDIT-C, could identify targets for the development of medications to treat the disorder, while variation in AUDIT-C could help in developing interventions to reduce drinking and thereby prevent the morbidity associated with it. The findings reported here could also help to identify individuals at high risk of AUD through the use of PRS. This effort could be augmented using knowledge of the full set of phenotypes that associate with AUD through the use of genetic correlations and PheWASs.

Methods
Data collection. The MVP is an observational cohort study and biobank supported by the U.S. Department of Veterans Affairs (VA). Phenotypic data were collected from MVP participants using questionnaires and the VA EHR and a blood sample was obtained for genetic analysis.
Ethics statement: The Central Veterans Affairs Institutional Review Board (IRB) and site-specific IRBs approved the MVP study. All relevant ethical regulations for work with human subjects were followed in the conduct of the study and informed consent was obtained from all participants.
Phenotypes. AUDIT-C scores and AUD diagnostic codes were obtained from the VA EHR. The AUDIT-C comprises the first three items of the AUDIT and measures typical quantity (item 1) and frequency (item 2) of drinking and frequency of heavy or binge drinking (item 3). The AUDIT-C is a mandatory annual assessment for all veterans seen in primary care. Our analyses used AUDIT-C data collected from 1 October 2007 to 23 February 2017. We validated the phenotype in a sample of 1851 participants from the Veterans Aging Cohort Study 33 , in which we found a highly significant association of AUDIT-C scores with the plasma concentration of phosphatidylethanol, a direct, quantitative biomarker that is correlated with the level of alcohol consumption. In the AA part of this sample (n = 1503), the AUDIT-C score was highly significantly associated with rs2066702, a missense (Arg369Cys) polymorphism of ADH1B, the minor allele of which is common in that population and has been associated with alcohol dependence 25 . We also examined AUDIT-C scores in 167,721 MVP participants (57,677 AAs and 110,044 EAs) 24 , comparing the association of AUDIT-C scores and AUD diagnoses with the frequency of the minor allele of rs2066702 in AAs and rs1229984 (Arg48His) in EAs. Both polymorphisms exert large effects on alcohol metabolism 39 and are among the genetic variants associated most consistently with alcohol-related traits in both AAs and EAs 8,12,18 . In both populations, we found a stronger association between age-adjusted mean AUDIT-C score and ADH1B minor allele frequency than between AUD diagnostic codes and the frequency of the minor alleles 24 . However, because AUD diagnoses accounted for unique variance in the frequency of the minor alleles in both populations, we concluded that the two phenotypes, although correlated, are distinct traits. Thus, in the present study, we used GWAS to examine these traits separately and to adjust for the effects of AUD in the AUDIT-C GWAS and the effects of AUDIT-C in the GWAS of AUD.
We calculated the age-adjusted mean AUDIT-C value 24 for each participant using age 50 as the reference point and down-weighting scores for individuals younger than 50 and up-weighting scores for individuals older than 50. The ageadjusted mean AUDIT-C was computed using a sample of 495,178 participants with data on age and AUDIT-C, of whom 272,842 had genetic data and were included in the AUDIT-C genetic analyses.
The principal classes of alcohol-related disorders in the ICD are alcohol abuse and alcohol dependence. We used ICD-9 codes 303.X (dependence) and 305-305.03 (abuse) and ICD-10 codes F10.1 (abuse) and F10.2 (dependence) to identify subjects diagnosed with either of these disorders, as suggested previously 40 (see Supplementary Table 2). Participants with at least one inpatient or two outpatient alcohol-related ICD-9/10 codes (N = 274,391) from 2000 to 2018 were assigned a diagnosis of AUD, an approach that has been shown to yield greater specificity of ICD codes than chart review 41 .
Genotyping and imputation. MVP GWAS genotyping was performed using an Affymetrix Axiom Biobank Array with 686,693 markers. Subjects or SNPs with genotype call rate <0.9 or high heterozygosity were removed, leaving 353,948 subjects and 657,459 SNPs for imputation 22 .
Imputation was performed with EAGLE2 (ref. 42 ) to pre-phase each chromosome and Minimac3 (ref. 43 ) to impute genotypes with 1000 Genomes Project phase 3 data 44 as the reference panel. Subjects with no demographic information or whose genotypic and phenotypic sex did not match were removed. We also removed one subject randomly from each pair of related individuals (kinship coefficient threshold = 0.0884). A greedy algorithm was implemented for network-like relationships among three or more individuals, leaving 331,736 subjects for subsequent analyses. Imputed genotypes with posterior probability ≥0.9 were transferred to best guess. We removed both genotyped and imputed SNPs with genotype call rates or best guess rates ≤0.95 and HWE p value ≤ 1 × 10 −6 in each population, using different MAF thresholds to filter SNPs: EA (0.0005), AA (0.001), LA (0.01), EAA (0.05), and SAA (0.05). The approximate number of SNPs remaining in each population was EA: 6.8 million, AA: 12.5 million, LA: 5.6 million, EAA: 2.6 million, and SA: 2.6 million.  ; 144 controls) in the AUD GWAS. We used linear regression for the GWAS of age-adjusted mean AUDIT-C score and logistic regression for AUD diagnosis; in both cases age, sex, and the first 10 PCs were covariates. To evaluate the impact on AUD findings of controlling for AUDIT-C and the impact on AUDIT-C findings of controlling for AUD, we repeated the GWAS for AUD with AUDIT-C as a covariate and AUDIT-C with AUD as a covariate. For both phenotypes, following GWAS in each of the five populations, the summary statistics were combined within phenotype in trans-population metaanalyses. SNPs in EAs or those present in at least two populations were metaanalyzed. Sex-stratified GWAS for both phenotypes were then performed in groups large enough to permit it-EA, AA, LA, and EAA men and EA, AA, and LA women-and the data were meta-analyzed within sex and phenotype. All metaanalyses were performed using a sample-size-weighted scheme that was implemented in METAL 47 .
To identify independent signals in each population, we performed LD clumping using PLINK v1.90b4.4 (ref. 48 ). We identified an index SNP (p < 5 × 10 −8 ) with the smallest p value in a 500-kb genomic window and r 2 < 0.1 with other index SNPs. Because in EAAs there is extended linkage disequilibrium at the ALDH2 locus, we used a 2500-kb window in that population. In the chr4q23-q24 region, where we identified multiple apparently independent signals for both AUDIT-C and AUD, we used conditional associations to differentiate independent signals from partially overlapping ones.
Gene-based association analysis. Gene-based association analysis was performed using Multi-marker Analysis of GenoMic Annotation (MAGMA) 49 , which uses a multiple regression approach to detect multi-marker effects that account for SNP p values and LD between markers. We used the default setting (no window around genes) to consider 18,575 autosomal genes for the analysis, with p < 2.69 × 10 −6 (0.05/18,575) considered GWS. For each population, we used the respective population from the 1000 Genomes Project phase 3 as the LD reference.
Enrichment analyses. Pathway and biological enrichment analyses were performed for each population using the FUMA platform 26 , with independent significant SNPs identified using the default settings. Positional gene mapping identified genes up to 10 kb from each independent significant SNP. Hypergeometric tests were used to examine the enrichment of prioritized chemical and genetic perturbation gene sets, canonical pathways, and GO biological processes (obtained from MsigDB c2), and GWAS-catalog enrichment (obtained from reported genes from the GWAS-catalog). We report all significantly enriched gene sets based on an FDR-adjusted p value <0.05.
Heritability and partitioning of heritability. LDSC 27 was used to calculate population-specific LD scores based on 1000 Genomes phase 3 datasets according to the LDSC tutorial, using SNPs selected from HapMap 3 (ref. 50 ) after excluding the major histocompatibility complex (MHC) region (chr6: 26-34Mb); only ancestry groups with large sample size (N > 10,000) were analyzed using LDSC. Of note, LDSC could be biased in admixed populations because reference panels are not provided for AAs and LAs in that application 27 . We calculated LD scores for 1,215,001 SNPs in EAs; 1,322,841 SNPs in AAs; and 1,243,726 SNPs in LAs. The LDSC analyses used SNPs with imputation INFO ≥ 0.9 in each population and that were LD scored in 1000 Genomes. LD score regression intercepts for available datasets were estimated to distinguish polygenic heritability from inflation. SNPbased heritability (h 2 SNP ) was estimated from GWAS summary statistics for both AUDIT-C and AUD. The sex-specific h 2 SNP was also estimated in EA males and females, AA males, and LA males.
We estimated partitioned h 2 SNP using genomic features or functional categories 28 for both AUDIT-C and AUD in the largest dataset, EAs, and then tested for enrichment of the partitioned h 2 SNP in different annotations. First, we used a baseline model consisting of 53 functional categories, including UCSC gene models [exons, introns, promotors, untranslated regions (UTRs)], ENCODE functional annotations 51 , Roadmap epigenomic annotations 52 , and FANTOM5 enhancers 53 . We then analyzed cell type-specific annotations and identified enrichments of h 2 SNP in 10 cell types, including adrenal and pancreas, CNS, cardiovascular, connective tissue and bone, gastrointestinal, immune and hematopoietic, kidney, liver, skeletal muscle and other. Gene expression and chromatin data were also analyzed to identify disease-relevant tissues, cell types, and tissue-specific epigenetic annotations. We used LDSC to test for enriched heritability in regions surrounding genes with the highest tissue-specific expression or with epigenetic marks 29 . Sources of data that were analyzed included 53 human tissue or cell type RNA-seq data from the Genotype-Tissue Expression Project (GTEx) 54 ; 152 human, mouse, or rat tissue or cell type array data from the Franke lab; 55 3 sets of mouse brain cell type array data from Cahoy et al. 56 ; 292 mouse immune cell type array data from ImmGen 57 ; and 396 human epigenetic annotations (6 features in 88 primary cell types or tissues) from the Roadmap Epigenomics Consortium 52 . In the analysis of each trait in each dataset, we used FDR < 0.05 to indicate significant enrichment for the h 2 SNP .
Genetic correlations. We estimated the genetic correlation (r g ) between AUDIT-C and AUD (from MVP), and with other traits in LD Hub 58 or from published studies using LDSC, which is robust to sample overlap 30 . First, we estimated the r g between AUDIT-C and AUD using the summary data generated in this study. AAs, EAs, and LAs were analyzed separately using the corresponding 1000 Genome phase 3 population as reference. Genetic correlations in EAA and SAA were not analyzed. The r g between AUDIT-C and AUD was out of bounds because the h 2 SNP did not differ significantly from zero. Then we tested the r g between males and females within each trait. We estimated the r g for AUDIT-C and AUD with 216 published traits in LD Hub and 493 unpublished traits from the UK Biobank. We resolved redundancy in phenotypes by manually selecting the published version of the phenotype or using the largest sample size. We also calculated the genetic correlations of both AUDIT-C and AUD with five traits for which GWAS were recently published or posted, including anorexia nervosa 59 , alcohol dependence 13 , attention deficit hyperactivity disorder 60 , autism spectrum disorder 61 , and major depressive disorder (summary data without the 23andMe sample) 62 , bringing the total number of tested traits to 714. A Bonferroni correction was applied separately for AUDIT-C and AUD, and traits with a corrected p value <0.05 were considered significantly correlated. Because the results were similar whether the intercept was constrained or not, we present here the original results without constraint.
Polygenic Risk Scores. To generate PRS from GWAS summary statistics in the MVP sample, we first conducted a GWAS for AUDIT-C and AUD as above, but restricted our analysis to two-thirds of the total sample by splitting the total sample randomly, keeping the number of AUD cases/controls balanced in each part (EA: N = 139,346, AA: N = 38,226). GWS loci identified from this analysis were the same as those in the larger sample, but were slightly decreased in significance. PRS were generated for the remaining EAs (N = 69,674) and AAs (N = 19,114) as the sum of all variants carried, weighted by the effect size of the variant in the GWAS. PRS were generated using PLINK2 (ref. 63 ). We performed p value informed clumping with a distance threshold of 250 kb and r 2 = 0.1. Risk scores were calculated for a range of p value thresholds (P ≤ 1 × 10 −7 , 1 × 10 −6 , 1 × 10 −5 , 1 × 10 −4 , 1 × 10 −3 , 0.01, 0.05, 0.5, 1.0). PRS were standardized with mean = 0 and SD = 1. Logistic regression was used to test for association with AUDIT-C and AUD phenotypes, with PRS as the independent variable and AUDIT-C or AUD as the dependent variable, with age, sex, and the first five PCs as covariates.
Population-specific summary statistics from the AUDIT-C and AUD GWAS in MVP were used to generate PRS in the PMBB, an independent sample. PRS were generated for EAs (N = 8524) and AAs (N = 2031) as above using the PRSice2 package 64 with imputed allele dosage as the target dataset. As recommended in the software, we performed p value informed clumping with a distance threshold of 250 kb and r 2 = 0.1. We excluded the MHC region. Risk scores were calculated for a range of p value thresholds (p ≤ 1 × 10 −7 , 1 × 10 −6 , 1 × 10 −5 , 1 × 10 −4 , 1 × 10 −3 , 0.01, 0.05, 0.5, 1.0) and standardized with mean = 0 and SD = 1. To identify individuals with alcohol-related disorders, we utilized phecodes, a method to aggregate ICD codes 65 . First, we extracted ICD-9 and ICD-10 data for 48,610 individuals from the EHR. To facilitate mapping to phecodes, ICD-10 codes were back converted to ICD-9 using 2017 general equivalency mapping (GEM). The ICD-10 conversions were combined with the ICD-9 codes to create a dataset with 10,682 unique ICD-9 codes. ICD-9 codes were aggregated to phecodes using the PheWAS R package 65 to create 1812 phecodes. Individuals are considered cases for the phenotype if they had at least two instances of the phecode, controls if they had no instance of the phecode, and other/missing if they had one instance or a related phecode. Logistic regression was used to test the association of the PRS with the alcohol-related disorders phecode (phecode number 317) and its sub-phenotype, alcoholism (phecode number 317.1). The analysis was performed in R with PRS as the independent variable and diagnosis as the dependent variable and age, sex, and the first 10 PCs as covariates.
We also tested the PRS of AUDIT-C and AUD for DSM-IV alcohol dependence criterion counts in the Yale-Penn cohort 25 . There are three phases of the Yale-Penn sample: phase 1 contains 3110 AAs and 1718 EAs exposed to alcohol; phase 2 contains 1667 AAs and 1689 EAs exposed to alcohol; phase 3 contains 556 AAs and 999 EAs exposed to alcohol. PRS were generated for EAs and AAs in each phase as described above and risk scores were calculated for a range of p value thresholds (p ≤ 1 × 10 −7 , 1 × 10 −6 , 1 × 10 −5 , 1 × 10 −4 , 1 × 10 −3 , 0.01, 0.05, 0.5, 1.0). Different from PRS in MVP and PMBB, to correct for the relatedness in the Yale-Penn subjects, a linear mixed model implemented in GEMMA 66 was used to test the association between PRS score and DSM-IV alcohol dependence criterion counts, with age, sex, and the first 10 PCs as covariates. Meta-analyses of data from the three phases were performed in AAs (N = 5333) and EAs (N = 4406) separately.
Phenome-wide association analysis. To conduct PheWAS, we extracted ICD-9 data from the EHR for 353,323 genotyped veterans. Of these, 277,531 individuals had two or more separate encounters in the VA Healthcare System in each of the 2 years prior to enrollment in MVP, consisting of 21,209,658 records. ICD-9 codes were aggregated to phecodes using the PheWAS R package to create 1812 phecodes. To improve the specificity of these codes, individuals with at least two instances of the phecode were considered cases, those with no instance of the phecode controls, and those with one instance of a phecode or a related phecode as other. A PheWAS using logistic regression models with either AUDIT-C or AUD PRS as the independent variable, phecodes as the dependent variables, and age, sex and the first five PCs as covariates were used to identify secondary phenotypic associations. A phenome-wide significance threshold of 2.96 × 10 −5 was applied to account for multiple testing.
Secondary GWAS Adjusted for BMI: As described below, for both alcoholrelated traits, we identified a GWS SNP in FTO, variation in which has been associated with BMI and risk of obesity 67 . To examine whether BMI confounded the association with this and other loci and the genetic correlations with other traits, we repeated the GWAS for AUDIT-C and AUD using BMI as an additional covariate. Data on BMI were from the MVP baseline survey and the EHR. For AUDIT-C, 200,092 EAs; 56,239 AAs; 14,029 LAs; 1352 EAAs; and 185 SAAs had BMI data available. For AUD, 201,320 EAs; 56,347 AAs; 14,075 LAs; 1360 EAAs; and 186 SAAs had BMI data available. After GWAS, we analyzed the genetic correlations between BMI-adjusted traits and other publicly available traits (N = 714), with Bonferroni correction for multiple testing.
Reporting summary. Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.