Smoking is a major risk factor for several somatic diseases and is also emerging as a causal factor for neuropsychiatric disorders. Genome-wide association (GWA) and candidate gene studies for smoking behavior and nicotine dependence (ND) have disclosed too few predisposing variants to account for the high estimated heritability. Previous large-scale GWA studies have had very limited phenotypic definitions of relevance to smoking-related behavior, which has likely impeded the discovery of genetic effects. We performed GWA analyses on 1114 adult twins ascertained for ever smoking from the population-based Finnish Twin Cohort study. The availability of 17 smoking-related phenotypes allowed us to comprehensively portray the dimensions of smoking behavior, clustered into the domains of smoking initiation, amount smoked and ND. Our results highlight a locus on 16p12.3, with several single-nucleotide polymorphisms (SNPs) in the vicinity of CLEC19A showing association (P<1 × 10−6) with smoking quantity. Interestingly, CLEC19A is located close to a previously reported attention-deficit hyperactivity disorder (ADHD) linkage locus and an evident link between ADHD and smoking has been established. Intriguing preliminary association (P<1 × 10−5) was detected between DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, 4th edition) ND diagnosis and several SNPs in ERBB4, coding for a Neuregulin receptor, on 2q33. The association between ERBB4 and DSM-IV ND diagnosis was replicated in an independent Australian sample. Recently, a significant increase in ErbB4 and Neuregulin 3 (Nrg3) expression was revealed following chronic nicotine exposure and withdrawal in mice and an association between NRG3 SNPs and smoking cessation success was detected in a clinical trial. ERBB4 has previously been associated with schizophrenia; further, it is located within an established schizophrenia linkage locus and within a linkage locus for a smoker phenotype identified in this sample. In conclusion, we disclose novel tentative evidence for the involvement of ERBB4 in ND, suggesting the involvement of the Neuregulin/ErbB signalling pathway in addictions and providing a plausible link between the high co-morbidity of schizophrenia and ND.
Smoking has an established impact on several somatic conditions, such as chronic obstructive pulmonary disease, peripheral arterial disease and various cancers.1 Further, smoking may not merely be a consequence but also a causal factor in the etiology of several common mental disorders, with growing evidence supporting the causal effect of cigarette smoking on risk of depression.2, 3, 4 However, the epidemiology of the association and underlying mechanisms are less understood than the established impact of smoking on somatic conditions.5 Persistent smoking is principally sustained by nicotine dependence (ND), which is a complex phenotype with physiological, pharmacological, social and psychological dimensions.6 ND can be measured in various distinct ways, ranging from interview assessments based on DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, 4th edition)7 for a ND diagnosis to simple questionnaires, such as the Fagerström Test for Nicotine Dependence (FTND).8 Furthermore, the number of cigarettes smoked per day (CPD) has been widely used in genetic association studies, with heavy smoking commonly considered as a proxy for ND.
Although many aspects of the biology of ND are known,6 the underlying genetic architecture is still largely uncharted. ND has a notable heritability (estimates ranging from 40% to 75%),9 yet candidate gene and genome-wide association (GWA) studies have pinpointed only a handful of genes. A robust smoking behavior locus was established in 2008, with three GWA studies reporting association between the CHRNA5-CHRNA3-CHRNB4 nicotinic acetylcholine receptor (nAChR) gene cluster on 15q24-25 and lung cancer risk as well as CPD and ND measured by FTND,10, 11, 12 though <1% of the variance in amount smoked was explained by alleles of these genes.12 The proportion of variance explained increases almost fivefold when a biomarker of nicotine intake is used instead of CPD,13 suggesting that simple self-reported phenotypes measuring smoking behavior may not adequately reflect nicotine intake. Consideration of phenotype quality and precision may be more beneficial than recruitment of increasing numbers of subjects with crude phenotypes.14 By utilizing detailed phenotype profiles, we have detected novel associations between the CHRNA5-CHRNA3-CHRNB4 gene cluster and various measures of ND, such as DSM-IV ND symptoms and the Nicotine Dependence Syndrome Scale (NDSS)15 tolerance subscale.16 The evidence supporting the involvement of nAChRs in the etiology of ND is indisputable and is supported by their central role in mediating the rewarding effects of nicotine.6 However, variants in nAChR genes likely account for a minor fraction of the phenotypic variance; thus, other predisposing genes are bound to exist.
Evidence for predisposing loci outside the 15q24-25 locus has clearly been weaker. In 2007, the first two modestly powered GWA studies suggested several potential genes, but with negligible overlap between the findings.17, 18 In 2010, three meta-analyses assessed GWA studies with data on smoking-related phenotypes; however, all these consortia had limited smoking-related phenotypes (ever/never smoked, age at initiation, amount smoked and cessation).19, 20, 21 Despite a combined sample size of over 140 000 subjects, only a handful of loci achieved genome-wide significance. Various approaches have been utilized for mining the GWA data. A two-stage approach with preliminary set of single-nucleotide polymorphisms (SNPs) identified in a discovery set followed by replication in an independent sample has been commonly employed.18, 22, 23, 24, 25 Alternatively, convergent evidence for the relevance of detected signals has been quested by pathway analyses and visualization of functional networks22, 24 as well as by scrutiny for pleiotropic effects.17 Some studies have clustered nominally significant SNPs located within a confined distance,26 while others have focused on a priori candidate genes.27 Finally, meta-analyses, either genome-wide19, 20, 21, 28 or among selected variants,24, 29 have been used to gain statistical power and to demonstrate the analogical impact of the identified variants across various cohorts and populations.
Here, we utilized a Finnish twin sample (N=1114) ascertained for smoking with exceptionally detailed phenotype profiles and a genetically homogenous background. In our GWA analyses, we included a total of 17 phenotypes, clustered into the domains of smoking initiation, amount smoked and ND, in order to comprehensively portray the dimensions of smoking behavior. We listed all preliminary associating SNPs (P<1 × 10−5) and identified all the genes with at least one such SNP within ±50 kb flanking of the gene. In order to nominate genes likely to be involved in the etiology of smoking behavior, we collected convergent data, that is, supporting evidence for the involvement of the genes by utilizing several sources.
Materials and methods
The sample collection has been previously described in detail.30, 31, 32 Briefly, the study sample was ascertained from the Finnish Twin Cohort study consisting of altogether 35 834 adult twins born in 1938–1957. Based on earlier data, the twin pairs concordant for ever-smoking were identified and recruited along with their family members (mainly siblings) for the Nicotine Addiction Genetics (NAG) Finland study (N=2265), as part of the consortium, including Finland, Australia and USA. Twin pairs concordant for heavy smoking were primarily targeted in order to increase the genetic load. Data collection took place in 2001–2005. The GWA study sample consisted of 1114 individuals (62% males; mean age 55 years), including 914 dizygotic (DZ) twin individuals (both co-twins per twin pair were included), 138 monozygotic (MZ) twin individuals (one co-twin per twin pair was included) and 62 other family members. Ninety-eight percent had smoked 100 cigarettes over their lifetime and the average number of CPD was 19.8 (s.d. 9.6). The study was approved by the Ethics committee of the Hospital District of Helsinki and Uusimaa, Finland and by the IRB of Washington University, St Louis, Missouri, USA. Altogether 207 of the 1114 subjects have been previously used in a chromosome 15q25 meta-analysis29 and altogether 733 subjects were used in a meta-analysis scrutinizing the rs16969968 variant on 15q25.33
For replication of the most interesting signals, we utilized a longitudinal Finnish twin study of adolescents and young adults (FT12, N=869; sample demographics previously described in Knaapila et al.34 and an Australian twin family sample (NAG-OZALC, N=4425; sample demographics previously described in Heath et al.35).
Participants were interviewed using the diagnostic Semi-Structured Assessment for the Genetics of Alcoholism36 protocol, including an additional section on smoking behavior and ND adapted from the Composite International Diagnostic Interview.37 The customized computer-assisted telephone interviews included >100 questions on smoking behavior. All participants provided written informed consent. All phenotypes used in analyses are based on the interview data (except for questionnaire survey for NDSS). The examined binary, continuous and categorical smoking-related phenotypes are divided into three groups: (i) smoking initiation (age at first puff, age at first cigarette, second cigarette, age of onset of weekly smoking, age of onset of daily smoking, first time sensation), (ii) amount smoked (CPD, maximum CPD), and (iii) ND (DSM-IV ND diagnosis, DSM-IV ND symptoms, FTND (4), FTND score, FTND time to first cigarette (TTF), NDSS drive/priority factor, NDSS stereotypy/continuity factor, NDSS tolerance factor, NDSS sum score). Phenotype definitions are presented in Supplementary Table S1, and their inter-correlations are in Supplementary Table S2. For the majority of the traits, modest-to-high heritability estimates have been previously reported (Supplementary Table S3). When calculating MZ and DZ correlations among 116 MZ pairs and 429 DZ pairs identified from the Finnish NAG study sample, MZ correlations were greater than DZ correlations for all of the traits (Supplementary Table S3), providing evidence for the involvement of genetic factors. As our study sample has been ascertained for heavy smoking, the pattern and point estimates of MZ and DZ correlations are likely to be somewhat different from an unselected population sample. Based on an analysis of the phenotype correlation matrix,38 the number of independent traits was 11. We conducted post hoc analyses for those genes highlighted in our study that were previously associated with smoking cessation. In these analyses, we included only ever smokers (N=1095, 98.3% of the sample) and coded former smokers (N=549), that is, successful quitters, as ‘affected’, and utilized all SNPs with ±50 kb flanking of the genes.
In an attempt to replicate the most interesting findings in the NAG-OZALC sample, we utilized CPD, maximum CPD, age of onset of weekly smoking, TTF, DSM-IV ND diagnosis, FTND (4) and NDSS drive/priority factor. In the FT12 replication sample, we utilized CPD, maximum CPD, FTND (4), TTF, schizotypy (assessed by the Schizotypal Personality Questionnaire-Brief, SPQ-B,39 with three dimensions: cognitive-perceptual, interpersonal, and disorganization,40 DSM-IV attention-deficit hyperactivity disorder (ADHD) symptoms and three cognitive functions previously showing association in a Finnish schizophrenia sample (Wedenoja et al., unpublished data) (verbal attention: ‘Digit span forward’ from Wechsler Memory Scale-Revised, verbal ability: ‘Vocabulary’ from Wechsler Adult Intelligence Scale-Revised, and executive functioning: ‘Trail Making B’ from Trail Making Test).
Genotyping was performed at the Welcome Trust Sanger Institute (Hinxton, UK) on the Human670-QuadCustom Illumina BeadChip (Illumina, Inc., San Diego, CA, USA), as previously described.16 Imputation was performed by using IMPUTE v2.1.041 with the reference panel HapMap rel#24 CEU—NCBI Build 36 (dbSNP b126). The posterior probability threshold for ‘best-guess’ imputed genotype was 0.9. Genotypes below the threshold were set to missing. Genotypes for altogether 2 614 137 polymorphic markers were available for analysis.
For the replication sample sets, genotype data were derived from previously conducted genome-wide genotyping studies with either HapMap or 1000 Genomes (http://www.1000genomes.org/) imputation data available. The FT12 samples were genotyped on the Human670-QuadCustom Illumina BeadChip (Illumina, Inc.) at the Welcome Trust Sanger Institute (Hinxton, UK). The NAG-OZALC samples were genotyped on Illumina platforms, including the Illumina CNV370-Quadv3 platform (Illumina, Inc.) by the Center for Inherited Disease Research (Baltimore, MD, USA) and by deCODE (Reykjavik, Iceland), the Illumina 317K platform by the University of Helsinki Genome Center (Helsinki, Finland) and the Illumina 610 Quad platform by deCODE.
Statistical analyses summary
Details of the statistical analyses are presented in Supplementary Note. Briefly, the GWA analyses were performed with Plink 1.07 42 (http://pngu.mgh.harvard.edu/purcell/plink/). The QFAM (family-based test of association for quantitative traits) in Plink was used for quantitative and categorical traits. QFAM performs a simple linear regression of phenotype on genotype. Adaptive permutation (up to 1 × 109 permutations) was used to correct for family structures. The DFAM (family-based test of association for disease traits) in Plink was used for the analysis of binary traits. DFAM implements the sib-TDT (transmission disequilibrium test) and also allows for unrelated individuals (that is, singletons) to be included. Furthermore, the ‘non-founders’ option was used, as our sample contains no parents.
The linkage disequilibrium (LD) between SNPs was estimated among nonrelated individuals (one per family) in the study sample and HapMap2 release 24 CEU individuals by using Haploview 4.2.43 All genotyped and imputed SNPs within the region were considered when estimating the LD structures. The number of independent SNPs in the top loci was estimated with SNPSpD.38 Gene-based analyses were performed for all the genes with at least one SNP with P<1 × 10−5 within ±50 kb of the gene. For binary traits, we utilized VEGAS (Versatile Gene-based Association Study; http://gump.qimr.edu.au/VEGAS/),44 which performs gene-based tests for association using the results from genetic association studies. VEGAS reads in SNP association P-values, annotates SNPs according to their position in genes, produces a gene-based test statistic and then uses simulation to calculate an empirical gene-based P-value. As VEGAS failed to report gene-based P-values for several of the genes, we utilized the set-based test in Plink 1.07 for quantitative traits. This model takes into account the inter-marker LD and uses permutation to correct for multiple SNPs in the defined sets of independent SNPs. Family structures were ignored as the set-based test only works in the case-control setting.
To estimate effect sizes for the five loci highlighted in the GWA analyses, we conducted linear and logistic regression analyses with the additive model in Stata statistical software release 11.1 (StataCorp).
As our sample size is limited, we did not anticipate genome-wide significant findings but rather decided to use a more liberal P-value threshold as a starting point for the gene discovery process. First we identified SNPs with P<1 × 10−5 (considered as ‘preliminary association’) and then identified all genes with at least one such SNP within ±50 kb flanking of the gene. This was primarily done based on feasibility, as a more stringent threshold (for example, P<1 × 10−6) would have resulted in the inclusion of only a handful of SNPs in the quest for convergent data. On the other hand, a less stringent threshold (for example, P<1 × 10−4) would have resulted in an overwhelming number of signals to be followed up. In order to mitigate false-negative discovery rate, we gathered supporting evidence for the involvement of the genes by utilizing (a) gene-based analyses, (b) in silico replication utilizing previously published GWA and linkage loci for smoking-related traits as well as reported associations for other substance use or dependence, as the high rates of co-morbid dependence to different substances suggest shared underlying architecture, (c) pleiotropic signals, that is, association signals emerging also for other studied traits, and (d) relevance of known function. Finally, we focused on signals with P<1 × 10−6 (P-values an order of magnitude lower than those identified as ‘preliminary association’ were considered as ‘approaching genome-wide significance’) and the functionally highly relevant ERBB4 and attempted replication in two independent data sets. Genes with supporting evidence from at least one additional source were nominated as likely to be involved in the etiology of smoking behavior.
Genome-wide plots of P-values for all 17 traits are presented in Supplementary Figure S1. Regional plots for the five highlighted loci are presented in Figure 1 and Supplementary Figure S2. We detected a total of 327 SNPs with P<1 × 10−5 (Supplementary Table S4) and 55 genes with at least one such SNP within ±50 kb flanking of the gene (Supplementary Table S5). Altogether four loci (16p12.3, 10p11.21, 15q22.2 and 2q21.2) approached genome-wide significance (P<1 × 10−6) (Table 1).
16p12.3 (CLEC19A) smoking quantity (CPD) locus
Altogether 17 SNPs on 16p12.3 located close to CLEC19A (C-type lectin domain family 19, member A) showed association with CPD (best rs762762, P=1.02 × 10−7) (Table 1). Eighteen additional nearby SNPs showed preliminary association (P<1 × 10−5) with CPD. These 35 SNPs cluster within a 46-kb region, fall into four distinct LD blocks (Figure 1a) and are correlated (r2 range 0.55–1.00), representing an estimated number of 1.6 independent SNPs. Significant effect sizes were obtained for SNPs in each of the blocks (beta range 4.27–5.68), roughly corresponding to an increment of five cigarettes per day for each allele of the locus (Table 1). Gene-based analysis yielded a P-value of 2.60 × 10−7 (Table 2). Altogether 16 out of the 35 SNPs showed preliminary association (P<1 × 10−5) with maximum CPD (Supplementary Table S4). In the NAG-OZALC replication sample, a single SNP showed association with CPD (P=8.38 × 10−4), while all other CLEC19A SNPs yielded P-values in the range of 10−1–10−2 (Supplementary Table S6). In the smaller FT12 replication sample, no association was seen.
10p11.21 (PARD3) NDSS drive/priority locus
An intronic SNP in PARD3 (par-3 partitioning defective 3 homolog (C. elegans)) on 10p11.21 showed an association with NDSS drive/priority factor (rs1946931, P=7.61 × 10−7) (Table 1). Four additional SNPs showed preliminary association (P<1 × 10−5). These five SNPs cluster within an 11-kb region, fall into three distinct LD blocks (Supplementary Figure S2A) and are highly correlated (r2 range 0.93–1.00), representing only one independent signal. Modest effect sizes were obtained for the SNPs (beta range 0.68–0.71), implying that minor allele carriers score higher on the drive/priority factor (Table 1). Gene-based analysis yielded a P-value of 2.18 × 10−4 (Table 2). This finding did not replicate in the NAG-OZALC sample.
15q22.2 FTND TTF locus
An intergenic SNP on 15q22.2 located 9 kb from LACTB (lactamase, beta) and 71 kb from TPM1 (tropomyosin 1) revealed association with TTF (rs2652813, P=2.54 × 10−7) (Table 1). Three additional nearby SNPs showed preliminary association (P<1 × 10−5). These four SNPs cluster within a 9-kb region, fall into a single LD block (Supplementary Figure S2B) and are highly correlated (r2 range 0.97–1.00), representing only one independent signal. Modest effect size was obtained (beta −0.35), with the minor allele decreasing the TTF in the morning (shorter TTF indicates higher ND; Table 1). A gene-based P-value for LACTB was 9.00 × 10−6 (Table 2). This finding did not replicate in the FT12 or NAG-OZALC sample.
2q21.2 age of onset of weekly smoking locus
Three intergenic SNPs on 2q21.2 located between NCKAP5 (NCK-associated protein 5) and MGAT5 (mannosyl (alpha-1,6-)-glycoprotein beta-1,6-N-acetyl-glucosaminyl-transferase) (264–277 kb and 408–422 kb from the genes, respectively) showed association with age of onset of weekly smoking (best rs4954080, P=5.35 × 10−7) (Table 1). Two additional nearby SNPs showed preliminary association (P<1 × 10−5). These five SNPs cluster within a 23-kb region, fall into three distinct LD blocks (Supplementary Figure S2C) and are correlated (r2 range 0.62–1.00), representing two independent signals. Substantial effect sizes were obtained for SNPs in each of the blocks (beta range 0.88–0.93), roughly corresponding to a decrease of nearly a year in the age of onset of weekly smoking for each allele of the locus (Table 1). This finding did not replicate in the NAG-OZALC sample.
2q33 (ERBB4) DSM-IV ND locus
Intriguing preliminary association was detected between DSM-IV ND diagnosis and a total of 17 SNPs in ERBB4 (v-erb-a erythroblastic leukemia viral oncogene homolog 4 (avian)) on 2q33 (eight SNPs located in 3′ flanking, five SNPs in 3′UTR and four SNPs intronic) (best rs7562566, P=1.68 × 10−6) (Table 1). These 17 SNPs cluster within a 53-kb region, fall into a single LD block (Figure 1b) and are highly correlated (r2 range 0.83–1.00), representing an estimated number of 1.5 independent SNPs. Significant effect sizes were obtained for the SNPs (odds ratio=1.42; Table 1). Gene-based analysis yielded a P-value of 9.94 × 10−3 (Table 2). The association between ERBB4 and DSM-IV ND diagnosis was replicated in the NAG-OZALC sample, with several SNPs showing P-values in the range of 10−4 (best rs7589512, P=2.14 × 10−4), some 739 kb from the region highlighted in the study sample (Supplementary Table S6). FTND (4) showed no association in the FT12 replication sample. Due to previously reported ERBB4 associations, we utilized a variety of traits when attempting to replicate the association in the FT12 sample. We detected association between ERBB4 and verbal ability (P-values in the magnitude of 10−4), emerging some 568 kb from the highlighted region (Supplementary Table S6). Schizotypy (SPQ-B) dimensions showed no significant association (Supplementary Table S6).
A total of 55 genes harbored at least one SNP with P<1 × 10−5 (the threshold used as a starting point for the gene discovery process) within ±50 kb flanking of the gene (Supplementary Table S5). After collecting supporting evidence from gene-based analyses, in silico replication, pleiotropic signals across the studied traits, relevance of known function as well as replication in independent data sets, we disclose altogether 33 genes whose involvement in the etiology of smoking behavior is substantiated by at least one additional source of evidence (Table 2). Altogether 11 of the highlighted genes have previously been associated with smoking cessation. In our post hoc analyses, only UNC13C showed P-values of the magnitude of 10−4 for the former smoker phenotype (data not shown).
The identification of the functional variant (rs16969968) in CHRNA512 has provided key insights into the mechanisms of nicotine addiction in men and mice;45, 46 however, we have only begun to comprehend the genetic underpinnings of ND. Patients with psychiatric disorders, especially depression, schizophrenia, and attention-deficit disorders are clearly more frequently nicotine dependent.47 The identification of specific predisposing genes for smoking behavior will likely provide insights into the co-morbidity.
The identification of susceptibility genes for smoking behavior has suffered from small sample sizes and lack of replication, and due to the complexity of the phenotype, inadequate phenotypic definitions likely have substantially contributed to the scarcity of findings. Of the previous GWA studies of smoking behavior or ND (http://www.genome.gov/gwastudies), only four with sample sizes >10 000 achieved associations considered to be genome-wide significant at the standard definition of P<5 × 10−8.48, 49 The remaining studies disclose between a few hundred and several thousands of SNPs with P-values in the 10−6–10−7 range. More signals can be expected as sample sizes increase50, 51 and genetic information content is increased by imputation, halpotype construction52 and sequencing. Scrutinizing a large number of inter-related and carefully characterized traits is another approach to better capture the effects of the variants on the underlying shared architecture. Shared risk loci can be detected in GWA analyses even for diseases with distinct clinical features,51 suggesting that unforeseen shared mechanisms are involved.
Here, we utilized a Finnish twin sample of adults (N=1114) with exceptionally detailed phenotype profiles and a homogenous genetic background. We scrutinized 17 phenotypes in order to comprehensively portray the complex dimensions of smoking behavior, clustered as smoking initiation, amount smoked and ND, while looking for associations in a genome-wide analysis. In contrast to many previous GWA studies focusing on smoking quantity as a proxy for ND, we have included two smoking-quantity phenotypes as well as direct validated measures of ND, which are also correlated with amount smoked. Although a person can be substance dependent even with low consumption levels, in the population overall dependence is associated with substantially higher levels of consumptions as documented in the recent very large (N>43 000) US survey of substance use, abuse and dependence.53 The paper also demonstrates that of the studied licit and illicit substances, the liability to dependence is greatest for nicotine.53 Although our study is underpowered in a conventional assessment, the sample was highly enriched for smoking by inviting all available heavy smoking concordant pairs (both MZ and DZ) from among the >14 000 twin pairs with smoking information in the cohort.54 Further, our main findings are supported by convergent data from multiple sources. To the best of our knowledge, none of our highlighted loci have yielded significant results in GWA meta-analyses for smoking-related traits.
Compelling association with CPD was detected in the vicinity of CLEC19A on 16p12.3, supported by signals emerging from other traits encompassing smoking quantity (maximum CPD and FTND score) as well as TTF. In line with this, the 16p12.3 locus overlaps with nominally significant linkage loci for maximum CPD and FTND highlighted in a linkage meta-analysis, which included subjects also from the current sample.55 Substantial effect sizes, roughly corresponding to an increment of five cigarettes per day for each allele of the locus, were detected. However, the associating SNPs are relatively rare (minor allele frequency 0.04–0.06), and thus the population level impact is less prominent than that of the effect of the established CHRNA5-CHRNA3-CHRNB4 smoking quantity locus, with effect sizes corresponding merely to an increment of one CPD.12 The plausible function of CLEC19A is unknown, but interestingly, it is located merely 44 kb from an ADHD linkage locus.56 The locus at 16p12.3-12.2 is in close proximity to previously reported ADHD linkage loci.57, 58 ADHD and smoking are associated both in adolescents and adults.59, 60 In the Finnish twin sample of adolescents (FT12), ADHD-related symptoms of inattentiveness, hyperactivity and impulsivity rated by parents and teachers consistently predicted daily smoking at ages 14 and 17.5 years.61 In the FT12 sample, no association was seen between CLEC19A SNPs and DSM-IV ADHD symptoms. However, this sample is not enriched for ADHD, the symptoms were assessed at age 14 years from the adolescents and the distribution of symptoms is skewed. Together, they are likely to have reduced the power to detect an association. Further studies are warranted to clarify the role of CLEC19A or nearby genes on 16p12 in the etiology of ND and ADHD.
Association was detected between NDSS drive/priority factor and PARD3, coding for an adapter protein involved in neuronal polarity and axon formation,62 however, with relatively rare SNPs (minor allele frequency 0.02). PARD3 has previously been associated with smoking cessation.63 In line with this, NDSS drive reflects craving, withdrawal and smoking compulsions, while priority reflects preference for smoking over other reinforces.15 Interestingly, another member of the gene family, PARD3B, located on the 2q33.3 linkage region previously detected in the current sample,31 has been associated with ND defined by the FTND.26
Among the preliminary associations (P<1 × 10−5), the most notable is the association between DSM-IV ND diagnosis and ERBB4, coding for an ErbB4 receptor tyrosine kinase that acts as receptor for Neuregulins, with diverse functions in the development of the central nervous system.64 Convergent data supporting the involvement of ERBB4 in smoking behavior is provided by its location within the 2q33 linkage locus previously identified for a smoker phenotype (‘smoked 100 cigarettes in lifetime’) in the current sample.31 Further, the 2q33 locus overlaps with a linkage locus for maximum CPD highlighted in a linkage meta-analysis.55 No association was detected in the FT12 replication sample with ND defined by the FTND (4). In the study sample, FTND showed non-significant P-values, suggesting that the association signal may emerge from ND dimensions not adequately addressed by FTND. This is in line with previous studies suggesting that DSM-IV ND and FTND extract somewhat different aspects of ND.65, 66 The association between ERBB4 and DSM-IV ND diagnosis was replicated in the Australian NAG-OZALC sample with SNPs located ∼739 kb from the association signal detected in the study sample. It is plausible that both regions harbor rare, functional variants, one specific for Finland and the other found in the mixed European population. Such rare, functional variants specific to Finns exist for behavioral traits.67 ERBB4 spans 1.1 Mb in the genomic sequence, with >1000 SNPs included in the current study; thus, some association signal can be expected to emerge by chance. However, further support comes from the study by Turner et al. (Molecular Psychiatry, in press) showing significant induction of ErbB4 and Neuregulin 3 (Nrg3) during nicotine withdrawal in a mouse model. In addition, Turner et al. report novel association of SNPs in NRG3 with smoking cessation success in a clinical trial. This paper together with the current study strongly implicates the Neuregulin/ErbB pathway in the molecular mechanisms underlying ND.
Evidence from genetic,68, 69, 70, 71, 72 transgenic,73 and post-mortem74 studies strongly supports the critical role of NRG1 and its ErbB4 receptor in the pathophysiology of schizophrenia. In healthy individuals, genetic variants in ERBB4 associate with reduced white matter integrity75 and may influence cognitive functioning, as seen for verbal working memory.70 ERBB4 is located within the linkage locus for schizophrenia and visual working memory in a Finnish family sample76, 77 and the 2q33 locus has also been highlighted in a schizophrenia-linkage meta-analysis.72 An association between ERBB4 and schizophrenia symptoms and impairment in executive functioning and verbal ability/attention has been detected in a Finnish schizophrenia sample (Wedenoja et al., unpublished data). Interestingly, we detected association between ERBB4 and verbal ability, although some 89 kb from the region highlighted for verbal ability in the Finnish schizophrenia sample (Wedenoja et al., unpublished data). However, schizotypy, which is a psychological concept encompassing a set of behavioral traits and cognitions thought to represent the subclinical manifestation of schizophrenia in the general population, showed no significant association with ERBB4. The scrutiny of other members within the Neuregulin/ErbB pathway may further uncover shared genetic predisposition for ND and schizophrenia.
Our study sample comes from one of the best-characterized founder populations, the Finns. Unique LD patterns are observed in founder populations;78 thus, the lack of replication for other findings than ERBB4 may, at least partly, be due to the genetic heterogeneity between the Finnish and Australian populations. It has been shown that population isolates, especially those founded recently, such as Finland, have longer stretches of LD than outbred populations and may thus achieve better genome-wide coverage with equivalent numbers of markers.78, 79 Furthermore, the significant age difference between the study sample (mean age 55 years) and the FT12 replication sample (mean 21.9 years) may partly explain the negative replication results, as many of the included phenotypes may become expressed only after extended exposure to smoking.
Due to the evident differences in genetic background between the CEPH subjects and the Finnish population, imputation based on HapMap data may not be optimal. It has been shown that even a relatively small population-specific reference set yields considerable benefits in SNP imputation and increases the power to detect associations in founder populations and population isolates in particular.80 However, at least for the top loci, the LD blocks in the study sample were very similar to those in the HapMap CEPH data, and the somewhat stronger intermarker LD is in agreement with previous findings from the Finnish population.78
It has been proven that the ability to achieve genome-wide significant P-values is dependent on sample size, with almost a linear relationship between sample size and the number of detected loci.51 In studies with relatively small sample sizes, such as ours, genome-wide significant P-values are unlikely to emerge. We have focused on collecting detailed phenotypic profiles, which may well turn out to be more beneficial than recruitment of increasing numbers of subjects with crude phenotypes.14 Support for the involvement of a particular locus thus must be collected from several sources in order to diminish the false-positive discovery rate; the individual P-values merely serve as a starting point for the discovery process. We set a somewhat arbitrary P-value threshold at P<1 × 10−5 and looked for convergent, supportive evidence for all such findings. Genes with supporting evidence from at least one additional source were nominated as likely to be involved in the etiology of smoking behavior.
In conclusion, by utilizing a comprehensive set of smoking behavior and ND traits, we detected novel intriguing associations. Some of the detected associations were further supported by replication in independent data sets, pleiotropic signals across the traits, previously reported association or location within previously identified linkage loci. Our results suggest that genetic variation in the 16p12.3 locus harboring CLEC19A may, in part, underlie the co-occurrence of smoking and ADHD. We disclose novel tentative evidence for the involvement of ERBB4 in ND, suggesting the involvement of the Neuregulin/ErbB signalling pathway in addictions and providing a plausible link between the high co-morbidity of schizophrenia and ND.
We warmly thank the participating twin pairs and their family members for their contribution. We would like to express our appreciation to the skilled study interviewers A-M Iivonen, K Karhu, H-M Kuha, U Kulmala-Gråhn, M Mantere, K Saanakorpi, M Saarinen, R Sipilä, L Viljanen and E Voipio. E Hämäläinen and M Sauramo are acknowledged for their skilful technical assistance. Dr E Vuoksimaa and Dr A Latvala are thanked for collaboration in FT12 traits related to cognitive functions and schizotypy. Professor A Palotie is acknowledged for his advice and expertise in whole-genome genotyping. We are ever grateful to the late Academician Leena Peltonen-Palotie for her indispensable contribution throughout the years of the study. This work was supported for data collection by Academy of Finland grants (JK) and a NIH Grant DA12854 (PAFM). Genome-wide genotyping in the Finnish sample was funded by Global Research Award for Nicotine Dependence/Pfizer Inc. (JK), and Wellcome Trust Sanger Institute, UK. Genome-wide genotyping in the Australian sample was funded by NIH Grants AA013320, AA013321, AA013326, AA011998 and AA017688. This work was further supported by the Sigrid Juselius Foundation (JK), Doctoral Programs of Public Health (UB), the Yrjö Jahnsson Foundation (UB), the Jenny and Antti Wihuri Foundation (JK), the Juho Vainio Foundation for Post-Doctoral research (UB), Finnish Cultural Foundation (TK), a NIH Grant DA019951 (MLP) and by the Academy of Finland Center of Excellence in Complex Disease Genetics (Grant numbers: 213506, 129680 to JK).
About this article
Supplementary Information accompanies the paper on the Molecular Psychiatry website (http://www.nature.com/mp)