Genomewide association study of cocaine dependence and related traits: FAM53B identified as a risk gene

We report a GWAS for cocaine dependence (CD) in three sets of African- and European-American subjects (AAs and EAs, respectively), to identify pathways, genes, and alleles important in CD risk. The discovery GWAS dataset (n=5,697 subjects) was genotyped using the Illumina OmniQuad microarray (890,000 analyzed SNPs). Additional genotypes were imputed based on the 1000 Genomes reference panel. Top-ranked findings were evaluated by incorporating information from publicly available GWAS data from 4,063 subjects. Then, the most significant GWAS SNPs were genotyped in 2,549 independent subjects. We observed one genomewide-significant (GWS) result: rs7086629 at the FAM53B (“family with sequence similarity 53, member B”) locus. This was supported in both AAs and EAs; p-value (meta-analysis of all samples) =4.28×10−8. The gene maps to the same chromosomal region as the maximum peak we observed in a previous linkage study. NCOR2 (nuclear receptor corepressor 1) SNP rs150954431 was associated with p=1.19×10−9 in the EA discovery sample. SNP rs2456778, which maps to CDK1 (“cyclin-dependent kinase 1”), was associated with cocaine-induced paranoia in AAs in the discovery sample only (p=4.68×10−8). This is the first study to identify risk variants for CD using GWAS. Our results implicate novel risk loci and provide insights into potential therapeutic and prevention strategies.


INTRODUCTION
Cocaine dependence (CD) is a serious form of substance dependence, with lifetime prevalence in the United States of 1.0%. 1 Cocaine use is costly to society, directly contributing to morbidity and medical costs, lost workdays, and other adverse individual, interpersonal, and societal effects.
CD is understudied, particularly in relation to the extent of the individual and societal problems it causes. Animal studies have begun to elucidate the biological substrates of CD (e.g., ref. 2), but this has not been accompanied by comparable elucidation of the sources of the genetic contribution to this trait. CD has a heritability of about 0.65 in females 3 and 0.79 in males 4 , so the potential exists to identify specific genetic variants that underlie disease risk. There have been numerous candidate gene association studies of CD and related traits, and several genomewide linkage studies, the latter identifying chromosomal regions likely to harbor risk-influencing genes. [5][6] Genomewide association studies (GWAS), when adequately powered, have generally been successful at identifying genes responsible for some of the risk for most complex traits for which they have been employed. However, no GWAS for CD has been published to date. To our knowledge, the only other GWAS for an illegal substance dependence (SD) diagnosis with genomewide-significant (GWS) results is our investigation of opioid dependence (OD). 7 A previous GWAS of cannabis dependence 8 did not report GWS results.
In the present study, we sought to identify genes that modify risk for CD by means of a GWAS in family-based and case-control samples of 2,379 European Americans (EAs), including 1,809 subjects with CD, and 3,318 African Americans (AAs), including 2,482 subjects with CD. Multiple independent samples of EAs and AAs (2,549 identically ascertained subjects that we collected and 4,063 subjects from the Study of Addiction: Genetics and Environment (SAGE) dataset) were used to replicate and extend our findings. We identified one novel CD risk locus at genomewide significance (GWS) and numerous others, relevant to CD and the related trait of cocaine-induced paranoia (CIP), worthy of future investigation.

Subjects and Diagnostic Procedures
The GWAS discovery sample included a total of 5,697 subjects. A replication dataset (identically evaluated) was comprised of 2,549 subjects and was genotyped for individual markers. (Public domain GWASed samples were included in some analyses as well, as described below.) All subjects were recruited for studies of the genetics of cocaine, opioid, or alcohol dependence. The sample consisted of small nuclear families (SNFs) originally collected for linkage studies, and unrelated individuals. Subjects were recruited at five US clinical sites: Yale University School of Medicine (APT Foundation; New Haven, CT), the University of Connecticut Health Center (Farmington, CT), the University of Pennsylvania Perelman School of Medicine (Philadelphia, PA), the Medical University of South Carolina (Charleston, SC), and McLean Hospital (Belmont, MA). Details regarding the sample can be found in Table 1 and Supplementary Tables 1 and 2. Our previous CD linkage study 5 included a subset of the SNFs included in this study.
All subjects were interviewed using the Semi-Structured Assessment for Drug Dependence and Alcoholism (SSADDA) 5,9 to derive diagnoses for lifetime CD and other major psychiatric traits according to DSM-IV 10 criteria. CIP was assessed with a specific item on the SSADDA, which we have previously shown to be valid. 11 The inter-rater reliability of the SSADDA for the diagnosis of CD was excellent (κ=0.83), 9 as was the reliability for the CIP trait assessment (κ =0.86).
The distribution of symptom count observations is given in Supplementary Figure S1.
Subjects gave written informed consent as approved by the institutional review board at each site and certificates of confidentiality were obtained from the National Institute on Drug Abuse and the National Institute on Alcohol Abuse and Alcoholism.

Genotyping and Quality Control
Samples from individuals in the discovery sample were genotyped on the Illumina HumanOmni1-Quad v1.0 microarray (988,306 autosomal SNPs). GWAS genotyping was conducted at the Yale Center for Genome Analysis (YCGA) and the Center for Inherited Disease Research (CIDR). Genotypes were called using GenomeStudio software V2011.1 and genotyping module version 1.8.4 (Illumina, San Diego, CA, USA). A total of 44,644 SNPs on the microarray and 135 individuals with call rates < 98% were excluded, and 62,076 additional SNPs were removed due to minor allele frequencies (MAF) <1%. Additional quality control details are described in Supplementary Materials. SAGE samples (see below) were genotyped on the Illumina Human1M array.
Follow-up genotyping in the replication sample was performed using a custom Illumina GoldenGate Genotyping Universal-32, 1536-plex microarray assay. Most SNPs included in the custom array were selected for studies of other phenotypes. Additional SNPs were genotyped individually using the TaqMan method. 12 To verify and correct misclassification of self-reported race, we compared the GWAS data from all subjects with the genotypes from the HapMap 3 reference CEU, YRI, and CHB populations. Principal components (PCs) analysis was conducted in the discovery GWAS sample using Eigensoft [13][14] and 145,472 SNPs that were common to the GWAS dataset and HapMap panel (after pruning the GWAS SNPs for linkage disequilibrium (r 2 )>80%) in each sample to characterize the underlying genetic architecture by deriving 10 PCs for each individual. The PCs were used to distinguish EAs from AAs by a K-means (K=2) clustering algorithm 15 and the two groups were analyzed separately. Because many subjects selfidentified as EA Hispanic or AA Hispanic, PC analyses were repeated within the AA and EA groups, and the first three PCs in each were used in all subsequent analyses to correct for residual population stratification within the group. 7 The same procedures to address population classification and substructure within groups were applied to the SAGE dataset.

Additional GWAS Sample: SAGE
In Phase 2 analyses described below, we included publicly available GWAS data (obtained via an application process) from the SAGE dataset, including individuals from the Collaborative Study on the Genetics of Alcoholism (COGA), 16 the Family Study of Cocaine Dependence (FSCD), 17 and the Collaborative Genetic Study of Nicotine Dependence (COGEND) (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi? study_id=phs000092.v1.p1). 18 Information on these samples is provided in Supplementary Materials. The combined SAGE analysis set contained 1,311 AA and 2,752 EA individuals (Supplementary Table 1).

Analysis Overview
There were three independent sets of subjects employed in these CD-phenotype analyses, in three phases. Phases 1 and 2 evaluated GWAS data from two different sets of subjects using two different, but similarly dense, microarrays: Phase 1 included our GWAS discovery dataset, consisting of 5,697 subjects. Phase 2 also incorporated information from the SAGE dataset, with GWAS data from 4,063 subjects. The assessments used for our study (SSADDA) and the SAGE study are sufficiently similar for most phenotype data to be combinable directly. Phase 3 included our replication dataset of 2,549 subjects who were (directly) genotyped for selected SNPs rather than for GWAS. Thus, analyses included up to 8,246 of our own subjects and 12,309 subjects overall. The overall analytic design is similar to that of our OD GWAS study. 7 In our own subjects, we also analyzed the CIP trait (which is included in the SSADDA but not the SAGE assessment), limited to subjects with cocaine exposure.

Genotype Imputation
Genotypes for 37,426,733 SNPs were imputed with IMPUTE2 19 using the genotyped SNPs and the 1000 Genomes reference panel released in June of 2011 (http://www. 1000genomes.org/), which contains phased haplotypes for 1,094 individuals of various ancestries. 20 EA and AA samples were imputed separately. We considered for analysis imputed SNPs with r 2 greater than 0.8.

Statistical Methods for Association Analyses
Association tests in the GWAS datasets (our own dataset individually in Phase 1, then combined with the SAGE dataset in Phase 2) used linear or logistic association models with generalized estimating equations (GEE) to correct for the correlations among related individuals. We evaluated the replication sample of unrelated individuals-the part of the sample that was genotyped individually only for replication SNPs-using linear and logistic models. All models were adjusted for age, sex, and the first three PCs of ancestry.
Three primary analyses to identify genetic factors contributing to risk for CD 1. A model (Sympcount adj ) used imputed minor allele dosage as the dependent variable and DSM-IV symptom count for CD and each of three other major SD diagnoses (opioid, alcohol, and nicotine dependence-OD, AD, and ND, respectively) as ordinal predictors of genotype. This allowed us to remove the effect attributable to substances other than CD, thereby facilitating the identification of genetic risk factors unique to that trait by limiting confounding due to comorbid dependence symptoms. All individuals contributed to this analysis, including those meeting DSM-IV criteria for CD and individuals with 0-2 CD symptoms (who did not receive a diagnosis of CD). The ordinal trait model has greater power to detect genetic associations than a univariate model based on disease status because it contains more information and is more specific. The β coefficient and p-value for the CD symptom count (adjusted for the symptom counts for OD, AD, and ND) were used to assess the magnitude and significance of the association, respectively. To ensure that modeling minor allele dose as the dependent variable did not produce unreliable results and to assess the effects of comorbid dependence, we tested post hoc the top SNPs identified from this model in a model (Sympcount) using CD symptom count as the dependent variable and SNP (not adjusted for OD, AD, or ND) as the independent variable.

2.
We used case-control status as the outcome in logistic regression models but included as controls only individuals who had used cocaine at least once in their lives without becoming dependent. This excludes subjects who have genetic liability but were never exposed to cocaine (i.e., "false negative" cases).

3.
We used logistic regression to examine association with cocaine-induced paranoia (CIP). A majority of chronic cocaine users experience transient paranoid symptoms that typically resolve with abstinence. 21-23 CIP represents a genetically distinct phenotype reflecting inter-individual differences in cocaine response. 23,24 Subjects who answered the question, "Have you ever had a paranoid experience when you were using cocaine?" affirmatively were diagnosed as being affected with CIP. Subjects with CIP must be cocaine-exposed and most meet CD criteria.
In each model, the data were analyzed separately by population group, and the results from the two groups were combined by meta-analysis using the inverse variance method implemented in the computer program METAL. 25

Replication of Top Findings
In Phase 2, SNPs with p <1.0 × 10 −3 in either EAs or AAs, or in the EA-AA meta-analysis, were tested for association in the SAGE dataset using identical statistical models (3,381 SNPs from the Sympcount adj model, 3,323 from the case-control model). Results from Phases 1 and 2 were combined within population groups by meta-analysis. The threshold for evaluating a SNP in Phase 2 was chosen to minimize false negatives, assuming that an equally strong effect observed in the Phase 2 sample would result in a GWS meta-analysis result. Based on the combined Phase 1+2 meta-analysis, we selected 153 SNPs (Sympcount adj , N=34; case-control, N=84; CIP, N=39) for further replication in Phase 3 based on a cutoff of p < 1.0×10 −4 (four SNPs met this criterion for more than one trait).

Pathway Analysis
We used the association results from the discovery + SAGE meta-analysis (i.e. Phase 1 ± Phase 2) for each of the primary traits (except CIP) to conduct a pathway analysis with the Ingenuity Pathway Analysis (IPA) software suite (http://www.ingenuity.com). First, the number of independent SNP association tests for each gene in the genome (only including SNPs within the transcribed portion of each gene) was computed according to the method of Li and Ji. 26 We multiplied the smallest P-value within a gene by the number of independent SNPs in that gene and created a list of genes containing a SNP with gene-based multipletest-corrected significance (P adj <0.05). The genes in the list were evaluated by pathway analysis to identify an overrepresentation of genes within defined canonical pathways based on information culled from multiple sources. A Fisher's exact P-value was computed for each pathway indicating whether, after accounting for the total number of pathways and the number of genes in a given pathway, there were more significantly associated SNPs than would be expected by chance. Separate gene lists were created for the Sympcount adj and case-control results within each population.

RESULTS
We observed GWS association of CD with rs2629540 at the FAM53B ("family with sequence similarity 53, member B") locus in the Sympcount model after removing OD, AD, and ND symptom counts as covariates ( Figure 1). This was supported by evidence in both AAs and EAs. The p-value for all samples combined was 4.28×10 −8 ( Table 2). Under the same model, NCOR2 (nuclear receptor corepressor 1) SNP rs150954431 was associated in the EA discovery sample at the GWS level (p=1.19×10 −9 ), but there were no consistent observations of this association in any other sample. Numerous additional SNPs were associated at the 10 −7 level. We also observed GWS association of CIP with rs2456778, which maps to CDK1 ("cyclin-dependent kinase 1"), in the AA discovery sample (p=4.68×10 −8 ). This association nearly reached nominal significance in the EA discovery sample (p=0.0502) and was slightly improved in the two populations combined (p=4.26×10 −8 ), but was not well supported in the Phase 3 replication sample. Phenotypic data for CIP were not available for SAGE. Additional associations (p<1×10 −6 ) were observed with numerous other SNPs in AAs, EAs or in both groups combined (Table 2).
(Manhattan plot and associated Q/Q plot, Figures 2 and 3; Complete results, Supplementary Table 3).
The pathway analyses identified several pathways significantly associated with CD. The most significant canonical pathway was calcium transport in the AA case-control analysis (p = 0.002) (Supplementary Figure S2). The pathway was identified via associations in two genes encoding Ca 2+ -transporting ATPases, which are important for Ca 2+ homeostasis: ATPase, Ca 2+ -transporting, plasma membrane (ATP2B2) and ATPase, Ca 2+ -transporting, type 2C, member 2 (ATP2C2). The highest ranked networks from both the EA Sympcount adj analysis and the AA case control analysis, and the second highest ranked network from the AA Sympcount adj model, showed associated genes (SNAP25, KCNQ4, KCNN2, and ATP2B2) with direct interactions with CALM1, which encodes calmodulin, a key calcium binding protein (Supplementary Figures S3, S4).

DISCUSSION
This is, to our knowledge, the first GWAS reported for CD. To obtain these findings, we made use of our own SSADDA-assessed GWAS sample, an additional SSADDA-assessed replication sample, and publicly available data from the SAGE project. Our strongest finding statistically, and the only one that meets genomewide significance in the entire sample (i.e., p<5×10 −8 ), is at FAM53B (Figure 2, Table 2). Although both the AA and EA parts of the sample contributed to this association signal, it was stronger in AAs, where the MAF was 0.07, vs. 0.25 in the EAs. FAM53B falls within the 1-lod support interval of the most significant genome-wide linkage peak for CD (lod score 2.7) that we identified previously. 5 As in the present association result, both the EA and AA parts of the sample contributed to the linkage finding. It is relatively uncommon for association and linkage findings to coincide in this way. The previous positional information from linkage increases the probability that the association finding is valid.
FAM53B seems to play a role in regulating cell proliferation, 27 but additional work is needed to determine the relationship of this function to CD risk or whether the gene has additional biological functions. The effect was strongest for the Sympcount measure unadjusted for comorbid dependence, indicating that the gene may influence susceptibility to CD with a co-occurring SD disorder, or SD more generally.
Several other SNPs attained genomewide significance in some analysis phases or in specific subgroups or were nearly GWS. We observed association of the DSM-IV diagnosis of CD with two SNPs near RANP6 (rs4129566 and rs11944332) in the AA discovery sample (4.26×10 −7 and 4.11×10 −7 , respectively) and the Phase 3 EA sample (p=1.92 ×10 −4 and 2.17×10 −4 , respectively). In the EA discovery sample, CD was also associated with rs6677435 (2.18×10 −7 ), located approximately 400 kb from its nearest gene, KCNT2, which encodes a potassium voltage-gated channel that we previously reported to be associated (p=2.1×10 −7 ) with OD in AAs. 7 There was also evidence of association with rs1757939 (p=7.88×10 −7 ) in the AA discovery sample. This SNP is approximately 132kb 5′ of SCLT1, which encodes a protein that links the voltage-gated sodium channel Na(v)1.8 with clathrin.
Although these may be false positive findings, which is more likely when the MAF is <5% (as for rs72840936 and NCOR1), our replication samples were smaller than the discovery sample, so that lack of replication in later study phases could reflect inadequate statistical power.
Similar to FAM53B, the meta-analysis result for rs2005290-a SNP located in a cluster of olfactory receptor genes, between OR3A1 and OR3A2 and about 8 kb from each-is supported by evidence in both populations (p=4.47×10 −7 ). It may be relevant to this finding implicating olfaction that variation in a taste receptor gene was previously associated with alcohol dependence. 28 Further, these genes have a structure similar to that of neurotransmitter and hormone receptors.
The pathway analysis results are noteworthy primarily because they implicate variation in systems regulating calcium signaling. Although there were no individually GWS findings in calcium-system genes, calcium signaling was one of two primary domains implicated in our recent opioid dependence GWAS (the other was potassium signaling). There is prior association evidence of calcium system genes in cocaine dependence (e.g., neuronal calcium sensor 1, NCS1). 29 Our results approaching GWS in the discovery Phase 1 sample obtained with SNPs near KCNT2 and SCLT1 suggest a possible link between CD and potassium signaling. Although the evidence of overlap in risk loci for opioid and cocaine dependence is limited, it is consistent with the high rate of comorbidity of these two disorders.
This study generated numerous remarkable findings, including those at or near GWS. Several design factors may have contributed to the results. First, we studied two distinct populations of reasonable sample size. Second, one of the analytic models that we employed defined cocaine-related effects as an ordinal trait. This approach increased the average phenotypic information for subjects and increased power; similar approaches have been used successfully in previous SD GWAS, e.g., for alcohol dependence. 30 This was especially important in light of our case-control design, which used exposed controls, reducing sample size for those analyses (but excluding from the control group individuals who were unexposed to cocaine and who therefore can reasonably be considered diagnosis-unknown in this context).
Our findings should be viewed in the context of several limitations. In Phases 1 and 2 (but not Phase 3), many associated loci were imputed, albeit with excellent quality (Table 2). Although in absolute terms our sample was reasonably large (over 12,000 subjects considered overall), in the context of complex trait GWAS, it is still modest. This factor may have led to false negative findings at all phases of the study. In addition, 11 of the 27 topranked associations in Phase 1 were observed with infrequent alleles (<5%); as none of these results replicated in subsequent phases, they may be false positives. Finally, our findings are not adjusted for testing association in two populations and with three (albeit highly correlated) traits. However, a Bonferroni correction is too conservative given the high correlation among the traits and distinct hypotheses for EAs and AAs (different populations frequently have different common risk alleles). Future studies in large independent samples are necessary to address these concerns.
In summary, we identified one locus with GWS support for association to CD, and others with more limited support. Although there have been prior GWAS for related traits, such as methamphetamine response 31 and opioid sensitivity, 32 to our knowledge, there are no studies published for stimulant dependence per se. The risk loci we identified did not conform to what might have been regarded as the most likely candidate gene predictions, and therefore will lead to novel directions in research that aims to increase our understanding of the genetics and pathophysiology of cocaine dependence.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Regional Manhattan plot for FAM53B, showing the meta-analysis P-value from EAs and AAs in the discovery and SAGE, as well as a single point for the phase 1-3 meta-analysis result (the highest purple dot on the graph). Since the result is driven primarily by AAs, the LD heat map is based on AAs also. Imputed SNPs are shown as circles, and genotyped SNPs as squares.    Table 1 Sample characteristics.    Table 2 Results from each GWAS phase where at least one result generated p<10 −6 . AA=African American; EA=European American; RAF=reference allele frequency; RSQ=imputation quality; genome wide significant results shown in bold, italicized font and underlined; p-values < 1×10 −6 . underlined; SNPs in bold were genotyped in phase 1; Meta=meta-analysis results.