Introduction

Smoking is a well-established primary risk factor for several types of cancer1, cardiac disease2 and many other chronic illnesses3. It is responsible for nearly 8 million premature deaths each year worldwide (including 1.2 million deaths from second-hand smoke)4, and is the cause of substantial loss in productivity and increased healthcare expenditures in the US5. Several studies6,7,8,9,10,11,12 have shown that multiple aspects of smoking behavior are moderately heritable (~ 50%), including smoking cessation (~ 54%)11, and that this relationship may have increased over time13.

The overwhelming majority of studies relating genetic factors to smoking behavior have utilized large-scale epidemiological cross sectional and cohort samples and have concentrated on behavioral phenotypes that can be readily assessed through questionnaires and single item surveys, such as nicotine dependence, cigarettes per day, heaviness of smoking index, age of initiation and quitting status (current/former smoker). Multiple studies of this type has shown that non-overlapping SNPs from chr15q25.1, within the CHRNA3-CHRNA5 -CHRNB4 (nicotinic acetylcholine receptor) gene cluster, are consistently related to nicotine dependence14,15,16, with rs16969968 (within CHRNA5) having substantial influence, and a second signal tagged by rs680244 (CHRNA3)17. CHRNA3 SNP rs1051730 also shows some of the strongest associations with nicotine dependence (cigarettes per day)18, as do the intronic SNPs rs588765 and rs578776, all of which are highly correlated with rs1696996814,19,20. The rs578776 SNP has demonstrated a protective effect in relation to nicotine dependence (minor allele more frequent in controls than dependent smokers). Joint analyses of rs16969968 and rs3743078 (highly correlated with rs578776) representing the risk and protective haplotypes at the cluster, resulted in a 2.4-fold increase in risk of heavy versus light smoking21. Other SNPs in this region have demonstrated nicotine dependence susceptible and protective haplotypes, and the relationship of these loci with heaviness of smoking is supported in meta-analysis of 34 datasets20. Another meta-analysis, involving 38,602 smokers with European and African origins across 15 studies, re-confirmed the association between smoking and SNPs in this gene cluster but also found that the SNP rs910083 C allele in the DNA methyltransferase 3 beta gene DNMT3B was associated with increased risk of nicotine dependence22. Other cohort studies have identified genetic markers of: 1) tobacco use and nicotine dependence23,24,25,26,27,28 including several from a cross-ancestry analysis of smokers of European and African descent (rs16969968 at CHRNA5, rs13284520 at DBH, rs151176846 at CHRNA4, rs2714700b between MAGI2 and GNAI1, rs1862416 at TENM2)29 and of European and Asian decent (rs6474414 at CHRNB3 and rs1072003 at CHRNA6); 2) number of self-reported quit-attempts, including SNPs rs6298, rs834829 and rs8192729 from HTR1B, NR4A2, and CYP2A6 respectively30; 3) other addictive behaviors, such as alcohol use10,29,31; and 4) psychiatric disorders29,32. Most recently, in a large sample of both current and former smokers (~ 110K) and never smokers (~ 375K), an exome-wide association study (ExWAS) showed that rare predicted loss-of-function and likely deleterious missense variants in CHRNB2 in aggregate were associated with a 35% decreased odds (protective) for smoking more than 10 cigarettes per day. An independent common variant of CHRNB2, rs2072659, also showed a protective effect for heavy smoking33.

GWAS studies of nicotine metabolism and clearance in European ancestry cohorts have shown significant associations between SNPs on chromosome 19 (including CYP2A6, MAP3K10, ADCK4, and CYP2B6) and on chromosome 4 (including TMPRSS11E) and the nicotine metabolism ratio (trans-3’-hydroxycotinine/cotinine or NMR) and between SNPs on chromosomes 9, 4 and 15 (including CHRNB4, CHRNA3, and CHRNA5) and measures of nicotine clearance (cotinine and the sum of cotinine and trans-3’-hydroxcotinine) 34,35. GWAS analysis in smokers of African American ancestry identified multiple independent SNPs at the CYP2A6 chromosome 19 locus and two SNPs on chromosome 2 associated with the NMR; most of these SNPs were not previously identified in European ancestry cohorts.

While important for understanding population level associations between genetics and smoking, the type of studies noted above do not directly assess genetic factors that may drive quitting success during an actual quit attempt, as a smoker makes the transition from smoking to abstinence. The genetic mechanisms underlying the process of smoking cessation and relapse are poorly understood36. National surveys indicate nearly 70% of smokers want to quit smoking37, but despite over 50% of smokers attempting to quit each year, only about 7.5% achieve success annually38. Genetic studies that use a prospective sample of smokers trying to quit may more directly address biological substrates associated with cessation success and help improve treatment outcomes by providing new targets for pharmacotherapy and/or informing efforts at precision medicine that attempt to assign treatments to smokers based on a genetic profile39.

Although fewer in number, prospective studies of smokers attempting to quit have shown some promise in realizing these goals. For example, haplotypes of rs16969968 (CHRNA5) and rs680244 (CHRNA3) have demonstrated an association with abstinence among smokers receiving a placebo vs active pharmacotherapy for smoking cessation40. Moreover, an analysis of eight clinical trials41 found that minor alleles of CHRNA3 rs1051730 and CHRNA5 rs588765 were associated with increased abstinence among smokers receiving nicotine replacement therapy (NRT) but reduced abstinence among those receiving placebo, though these findings were not replicated in later studies42,43. The CHRNB2 SNP rs2072661 has been associated with reduced cessation and the tryptophan 2,3-dioxygease (TDO2) SNP rs10517626 with enhanced cessation in a trial including NRT, placebo and bupropion, with the most pronounced effect of rs2072661 on the bupropion treated smokers44. Similarly in a clinical trial involving varenicline, bupropion or placebo, King and colleagues45 found that CHRNB2 SNPs, most notably rs3811450, and SNPs in the CHRNA5-CHRNA3-CHRNB4 region, e.g., rs7164594, were associated with increased abstinence among varenicline treated smokers; several SNPs from CYP2B6, including rs8109525, were associated with an enhanced response to bupropion specifically, and to overall cessation among all treated smokers.

In a first of its kind genetically informed treatment trial, Chen and colleagues46 examined the treatment response to combined NRT (patch plus lozenge) vs varenicline among smokers stratified by the CHRNA5 SNP rs16969968 (GG vs. AA/GA alleles) at treatment onset. Results showed that among African American smokers, compared with placebo, those with the GG genotype quit significantly more often with NRT but not varenicline, while those with the AA/GA genotype quit significantly more often with varenicline and not NRT. No treatment by genotype interactions were observed for European descent smokers. This group also observed that polygenic risk scores for age of smoking initiation (older) and smoking persistence (past failed attempts) were predictive of abstinence across two prospective treatment trials, though treatment specific interactions were not reported.

In another pioneering trial Lerman and colleagues47 randomly assigned smokers to patch NRT, varenicline or placebo, stratifying by the NMR, and showed that smokers classified as normal metabolizers were more likely to quit using varenicline vs. NRT, while slow metabolizers quit equally often on both medications. NMR is a genetically informed biomarker that encompasses multiple SNPs, particularly within CYP2A6.

While a highly desirable approach for addressing questions related to precision and development of pharmacotherapies, the limitation of studying genetic predictors of smoking cessation in prospective clinical trials is that such studies typically involve much smaller samples than the large-scale epidemiological studies noted above. Moreover, when multiple trials are combined to increase sample size, harmonization of measurements and time points across trials can be problematic. In the current study, we address these potential limitations by combining smoking cessation outcomes from a cohort of 2231 smokers across 8 smoking cessation studies, which shared several common instruments and measurement points and the baseline collection of DNA. To our knowledge, this is the largest prospective sample of its kind. In this paper, we present novel findings relating common and rare variants to cessation success at 6-month post-treatment follow-up, a commonly used standard for measuring long term treatment outcome48. This contrasts with several of the trials reviewed above that focused on abstinence at the end of treatment, typically 12 weeks. The availability of cessation data of all studies at the 6-month time-point and the measurement of abstinence using both self-report and biochemical verification enables us to examine the relationship between key smoking-related genes in an integrated, well-phenotyped, ethnically diverse sample and to evaluate the findings for these traits more systematically than has previously been possible. This study can significantly improve our understanding of the etiology and pathophysiology of this complex phenotype, and aid in prevention and treatment efforts.

Methods

Subjects

The study included smokers who participated in 7 NIH- and 1 CPRIT- (Cancer Prevention Institute of Texas) funded studies of smoking cessation awarded to Drs. Paul Cinciripini and David Wetter, conducted at the University of Texas MD Anderson Cancer Center. Details of the design, recruitment, and inclusion/exclusion criteria for each of the studies have been described in detail elsewhere and include: Breakfree49, CARE50, CASSI51, MIND52, PNS53, QuitRx54, STEPS55 and Two2Quit56 (Grant numbers and ClinicalTrials.gov registration numbers, where required, are provided in the acknowledgements). Participants were recruited from the Houston metropolitan area from a wide variety of sources including local print media, flyers, and collaborations with local healthcare institutions. All studies were prospective smoking cessation clinical trials which shared at a minimum, recruitment of current smokers wanting to quit, exposure to smoking cessation guideline based treatment57 involving behavioral counseling for smoking cessation and pharmacotherapy, common measures of abstinence and 6 month outcome information. Regardless of treatment type, 6-month post-treatment outcome was the primary outcome variable for this analysis. All participants signed an informed consent form that permitted us to collect buccal swab samples and demographic and phenotypic data shared across studies. The study protocols were approved by the Institutional Review Board of MD Anderson in accordance with tenets of the Declaration of Helsinki.

Outcome of interest

Abstinence status was measured at the end of six months using the self-reported 7-day point prevalence (no smoking even a puff in last 7 days) verified by exhaled carbon monoxide (CO) at or below a cutoff of 4 parts per million (ppm). Such a cutoff was recommended to verify smoking abstinence and has been shown to be more accurate than cutoffs of 8 or 10 ppm58,59,60. Based on the affirmative self-report of no smoking in the last 7 days plus CO ≤ 4 ppm, participants were classified as abstainers (i.e., successfully quit smoking). Individuals who reported smoking or had a CO > 4 ppm were classified as nonabstainers.

Covariates

Demographic information (age, gender, race/ethnicity), type of cessation treatment received, and baseline smoking information (e.g., numbers of cigarettes smoked per day), were used as covariates in the analyses. The participants were treated using different medications for smoking cessation, including bupropion, nortriptyline, varenicline, NRT, combination of varenicline and bupropion, and placebo. Because the smoking cessation counseling duration and the pharmacotherapies differed across some of the trials, a study ID and a designator for medication type were included in all analyses as covariates. In addition, covariates for population structure, as described below, were included in the model.

Sequencing and genotyping

The final sample consisted of 2231 participants prospectively enrolled in smoking cessation trials across the 8 studies shown in Table 1, for which both genetic and phenotypic (abstinence) information was available. For each of these individuals we collected DNA samples using buccal swabs, a 30″ mouth rinse using standard travel size bottle of Scope mouthwash (~ 2.5 oz) and processed using genomic DNA purification kits from Qiagen. The sample included a total of 10,020 genetic markers that were derived from both sequencing and genotyping procedures as described below.

Table 1 Distributions of population characteristics in the two-phase study (N = 2231).

Sequencing was carried out using the Illumina Hiseq2000 sequencing system at the Human Genome Sequencing Center, Baylor College of Medicine. We sequenced 55 candidate genes (Supplementary Material Table S1) primarily covering exon regions. Candidate genes were selected based upon literature survey of markers previously associated with smoking phenotypes (including dependence and cessation), other substance abuse and psychiatric disorders. The short sequence reads were filtered by Illumina CASAVA analysis software (v 13.10.01) and mapped to the reference genome using Burrows-Wheeler Aligner (BWA)61 to create a .bam file. Variants were called by Atlas-SNP2 (v1.4.3 r171, includes Atlas-Indel)62 to create a VCF file and these in turn were annotated using the Cassandra pipeline63. The average coverage of the target bases for the samples was 221x. Standard quality assurance and quality control procedures were conducted to detect problems with initial DNA quality, library construction methods, emulsion and bead quality, instrument chemistry and performance during the run and final sequence metrics after completion of the run. The initial result of the sequencing identified a total of 45,365 variants. Those not having a quality score of “PASS”, lower genotyping rate (< 0.95) and failing the Hardy–Weinberg proportion (HWP) test (p < 10–6) were removed, resulting in a total of 5138 common variants with a minor allele frequency (MAF) ≥ 0.05 and 24,197 rare variants (analyzed separately) with a MAF < 0.01 but ≥ 0.0001.

For the genotyping, we used Illumina’s Infinium iSelect Custom Genotyping Chip that included 6839 tagging SNPs. The tagging SNPs selected for this analysis, were derived from the literature identifying genetic markers associated with tobacco use, substance abuse and psychiatric disorders, plus those derived from our sequenced SNPs and included 169 ancestral informative markers. A set of 75 duplicate samples were genotyped to ensure genotyping quality. Illumina GenomeStudio was used for genotype calling based on the GenTrain clustering algorithm64. Cluster boundaries were determined using samples from the study. SNPs were filtered according to GenCall Score (GC score) > 0.15 and a median score of 0.88 using GenomeStudio (v1.9.4). Furthermore, we removed SNPs with a MAF < 0.05, and those that failed HWP test (P value < 10–6). The Genome Reference Consortium Human Build 37 (GRCh37) was used to map the genetic variants.

PLINK65 software (v1.90b3) was used to convert sequencing VCF and GenomeStudio files, and to process basic quality controls. The final analytic sample following quality control and availability of 6-month smoking cessation outcome data (the phenotypic of interest here) was 2231, which included 5138 variants from sequencing and 4882 additional markers from genotyping for a total of 10,020 markers used in the analyses (see Supplementary Information 2). The final composition of genetic information for the 2231 participants used in this analysis, included 1169 that had both sequencing and genotyping data, 439 and 623 with sequencing or genotyping alone, respectively. The mix of subjects in the datasets of two phases were proportional across these subsets.

Population structure

Population structure was assessed using the Structure (v.2.3.4) program66. For assurance of the results, analysis using Admixture (v.1.3.0)67,68 was done on the same data sets. Population reference data from 1000 Genome Project (Phase 3 release, 2504 individuals)69 were used. The reference data contains 2478 unrelated individuals and over 84 million SNPs from 5 major racial groups from 26 geographic locations, African (7 locations), Latin American (4 locations), European (5 locations), East Asian (5 locations), and South Asian (5 locations)70. We first extracted 5317 markers with a minor allele frequency (MAF) ≥ 0.05 from the reference data set, which were overlapping with our data, and had a P value ≥ 1 × 10–6 for the HWP test. A general measurement of informativeness for assignment71 was calculated for each of the 5317 markers using the reference data. A set of 935 SNPs (including 169 ancestry informative markers) with a measurement of informativeness > 0.05 were selected for assessing population structure in our study.

The number of ancestries (K) was estimated with the use of the admixture model, based on the set of 935 SNPs. We considered a range of 5 to 15 for the number of ancestries K. For each of the value’s K, the Markov chain Monte Carlo (MCMC) process ran for 15,000 iterations, among which the first 5000 iterations were used as burn-in process. The likelihoods of the data given different K values (i.e., posterior probabilities) were calculated and the K value which maximized the posterior probability was selected as the number of ancestries in the study population. We thus obtained K = 11 ancestries for our study.

Given 11 ancestries, for each individual, STRUCTURE provided the probability of this individual belonging to each of the ancestry group (i.e., 11 probabilities per individual). One can assign the individual to one of the ancestries based on the highest probability. In our study, we created the population cluster score for each individual based on his/her ancestries corresponding to the three highest probabilities, which provided a higher resolution to classify individuals into different ancestries. The population structure scores created in this way were included in the statistical analysis as a covariate.

Statistical analyses

Statistical analyses were conducted using PLINK65 (v1.90), SKAT-O72,73, R74, SAS 9.3 (SAS Institute, Cary NC) and KING75 (v1.4) software. We used the genotypic data to identify individuals with discordant gender information, duplicates, and closely related individuals. We identified genetically related individuals by estimating the pairwise kinship coefficients using KING (v1.4) software. For any pair of individuals which were duplicates or related (i.e., with allele sharing of > 80%), we excluded the individual with lower call rate. Deviation from HWP for each genetic variant was assessed by 1 degree-of-freedom χ2 test or Fisher’s exact test where an expected cell count was less than five76.

Association analyses for common variants (SNPs; MAF ≥ 0.05) were conducted using multivariable unconditional logistic regression based on a two-sided Wald test implemented in the software PLINK65. We tested each common variant assuming an additive genetic model. Age, gender, study ID, medication type and population cluster were included in the analyses as covariates. For the smoking cessation (abstinence) phenotype, participants with missing information (14% of the sample) were imputed as “smoking” as it is the common practice in smoking cessation studies.

The study data were randomly divided into phase 1 data (70% of the participants) and phase 2 data (30% of the participants). For the joint analysis with pooled data from both phases, we included a fixed indicator as a covariate for the phases to control for the possible confounding effects of phases. We used the standard established threshold of genome-wide significance level of P value 5 × 10–8 to declare statistical significance.

For the association analyses of rare variants (MAF ≤ 0.01), we conducted the gene-based analysis using the optimal Sequence Kernel Association Test (SKAT-O)72,73 , which uses the collapse method to test the joint effect of multiple rare variants within a gene region on a phenotype. Same covariates, including age, sex, study ID, medication type, population cluster, and the indicator for phases (phase 1/phase 2), were included in the analyses as covariates. To account for multiple testing issues, we used the significance level of 9.1 × 10–4 (i.e., 0.05/55) for the gene-based rare-variants genetic association analysis.

Ingenuity pathway analysis

Ingenuity Pathway Analysis (IPA; Ingenuity® Systems, www.ingenuity.com) 77 is a software program employed to connect molecules based on the scientific data in the Ingenuity Knowledge Base, including information on biological interactions and functional annotations78. In this study, we used IPA to further explore the biological mechanism/insight of the genes that harbor the genetic variants identified to be significantly associated with the abstinence phenotype in the association analysis. These genes of interest are denoted as focus genes in IPA. The IPA core analysis function was employed to determine biological functions, search for signaling and metabolic canonical pathways, and generate relevant molecular networks on the basis of the focus genes79,80. IPA creates the biological functions and canonical pathways from the literature, independent of focus genes. Specifically, IPA core analysis compares the focus genes with all build-in canonical pathways and biological functions in the IPA database and identifies the canonical pathways/biological functions, which include genes that overlap with the focus genes. The molecular network is constructed based on the focus genes and the connections in which they function, based on the main assumption that the biological function involves locally dense interactions. The details regarding the network generation algorithm have been described (summarized in the Supplementary Materials)81,82. Importantly, when generating a network, the iterative algorithm attempts to connect additional non-focus genes from its entire database to any of the genes which have already involved in the gene network (focus or non-focus genes) if such genes are more likely to have connections (i.e., biological relationships) with the network. As these non-focus genes are from a background consisting of all genes in the database, the resulting relevant networks may potentially identify additional genes that interact with the focus genes associated with abstinence. These additional genes emerge as potential candidate genes of interest for future investigations of abstinence. The resulting network also presents a bigger view of the genes likely to be interacting and directly or indirectly associated with abstinence. To evaluate the resulting functions, pathways, or networks, P values are calculated using a right-tailed Fisher's exact test, which measures the likelihood that the association between the set of focus genes and a given function/pathway/network is due to random chance81,82.

Results

Characteristics of study populations

The study included 1571 participants (275 abstainers and 1296 non-abstainers) in the phase 1 data; and 660 participants (120 abstainers and 540 non-abstainers) in the phase 2 data (Table 1). In the phase 1 dataset, the distributions of age and cigarettes smoked per day were very similar in the abstainers and non-abstainers: mean age 45.7 (standard deviation [SD] = 11.4) for the abstainers and 43.3 (SD = 10.9) for the non-abstainers; mean cigarettes smoked per day, 19.0 (SD = 9.1) for the abstainers and 20.8 (SD = 10.2) for the non-abstainers. Approximately half of the participants were male (56% for abstainers and 51.8% for non-abstainers). There were more White (both Hispanic and non-Hispanic) participants who were abstainers (70.9%) than non-abstainers (53.5%), while more Black (Hispanic and non-Hispanic) participants were non-abstainers (41.2%) than abstainers (24.8%). More participants were employed in the abstainer group (79.3%) compared with the non-abstainer group (63.6%). More participants had a high school/GED or less education in the non-abstainers (37%); while more participants had a bachelor’s degree or some post-graduate work or above in the abstainers (29.1%). The majority of the participants were treated using NRT for smoking cessation (57.1% for abstainers and 72.7% for non-abstainers).

The population characteristics in the phase 2 dataset were very similar to those in the phase 1 dataset. The abstainer group had similar distributions of age (mean 44.4 [SD = 10.2]), cigarettes smoked per day (mean 19.9 [SD = 10.1]), and sex (53.3% male), compared with the non-abstainer group (mean age 43.3 [SD = 10.9], mean cigarettes per day 20.7 [SD = 10.2], 53.1% male; Table 1). Similarly, in the abstainers, more participants were Hispanic and non-Hispanic White (65%), employed (77.1%) and had a bachelor’s degree or some post-graduate work or above (30.3%). The majority of the participants were treated with NRT for smoking cession (62.5% for abstainers and 72.2% for non-abstainers).

Analyses of common variants

We found 14 genetic variants associated with the abstinence phenotype that met the genome-wide statistical significance threshold (that is, P < 5 × 10–08; Table 2). A Manhattan plot for the joint analysis using data merged from both phases is shown in Supplementary Fig. S1.

Table 2 Summary for genetic variants associated with abstinence phenotype in the combined analysis.

Based on the P values using the meta-analysis of data from both phases, the variant rs1175607105 was found to be the strongest statistically significant signal protective for smoking cessation behavior (i.e., OR > 1; likely to quit smoking) (OR = 2.34, 95% CI: 1.83–2.98; P = 9.06 × 10–12). The rs1175607105 localizes to 19q13.2 (41,520,210 bp; Fig. 1A and Table 2) and maps to the gene CYP2B6 (cytochrome P450 family 2 subfamily B member 6).

Figure 1
figure 1figure 1

The genetic regions harboring the significant signals associated with smoking abstinence phenotype. y-axis: -log10(p-values) based on logistic regression. x-axis: base pair positions based on NCBI human annotation release 105. Grey dot: SNPs analyzed in the studies. Red dot: significant SNPs in the combined analysis.

There were three additional genetic variants identified to be protective for smoking cessation behavior, including rs1413172952 (OR = 1.98, 95% CI: 1.57–2.49; P = 6.1 × 10–9), rs1204720503 (OR = 2.17, 95% CI: 1.65–2.85; P = 2.72 × 10–8) and rs80210037 (OR = 1.91, 95% CI: 1.52–2.40; P = 3.49 × 10–8). The variants rs1413172952 (113,792,339 bp; Fig. 1B and Table 2) and rs1204720503 (113,781,550 bp) localize to 11q23.2 and maps to the gene HTR3B (5-hydroxytryptamine receptor 3B).

In addition, ten risk genetic variants were identified as being significantly associated with the abstinence phenotype (i.e., OR < 1; less likely to quit smoking), including rs2173763 (OR = 0.64, 95% CI: 0.55–0.74; P = 5.88 × 10–10), rs6749438 (OR = 0.55, 95% CI: 0.45–0.67; P = 9.91 × 10–10), rs6718083 (OR = 0.58, 95% CI: 0.49–0.69; P = 1.82 × 10–9), rs7349 (OR = 0.69, 95% CI: 0.61–0.78; P = 2.28 × 10–9), rs6869603 (OR = 0.67, 95% CI: 0.58–0.76; P = 2.35 × 10–9), rs363222 (OR = 0.67, 95% CI: 0.59–0.77; P = 3.73 × 10–9), rs1288980 (OR = 0.65, 95% CI: 0.57–0.76; P = 6.71 × 10–9), rs992528 (OR = 0.70, 95% CI: 0.61–0.79; P = 3.22 × 10–8), rs11064432 (OR = 0.71, 95% CI: 0.63–0.80; P = 4.14 × 10–8) and rs1333758 (OR = 0.61, 95% CI: 0.51–0.73; P = 4.4 × 10–8).

Among the ten significant signals, two variants (rs2173763 and rs1288980) were located on chromosome 3. In particular, the variant rs2173763 localizes to 3q21.1 (122,329,160 bp; Fig. 1C and Table 2) and maps to the intron of the gene PARP15 (poly(ADP-ribose) polymerase family member 15). The variant rs1288980 localizes to 3p13 (71,105,863 bp; Fig. 1D) and maps to the gene FOXP1 (forkhead box P1).

Three variants (rs7349, rs363222, rs992528) were located on chromosome 10. The variant rs7349 localizes to 10p11.22 (31,817,905 bp; Fig. 1E) and maps to the 3’ untranslated region of the gene ZEB1 (zinc finger E-box binding homeobox 1), which has been associated with lung cancer83,84,85. The variant rs363222 localizes to 10q25.3 (119,019,448 bp; Fig. 1F) and maps to the gene SLC18A2 (solute carrier family 18 member A2).

Two variants (rs6749438 and rs6718083) were located on chromosome 2 and are close to each other. The variant rs6749438 localizes to 2p23.3 (25,190,127 bp; Fig. 1G) and maps to gene DNAJC27 (DnaJ heat shock protein family (Hsp40) member C27). The variant rs6718083 localizes to 2p23.3 (25,362,194 bp; Fig. 1G) and maps to the gene EFR3B (EFR3 homolog B). The variant rs11064432 localizes to 12p13.31 (6,968,741 bp; Fig. 1H) and maps to the intron of the gene USP5 (ubiquitin specific peptidase 5). The variant rs1333758 localizes to 13q32.3-q33.1 (101,897,883 bp; Fig. 1I) and maps to the gene NALCN, a gene that belongs to a family of voltage-gated sodium and calcium channels expressed throughout the nervous system86.

The linkage disequilibrium (LD) was assessed for the significant genetic variants (Table 2) in close proximity. No strong LD was observed between pairs of significant variants in proximity: rs6749438 and rs6718083 (\({r}^{2}=0.07\)), rs363222 and rs992528 (\({r}^{2}=0.104\)), and rs1413172952 and rs1204720503 (\({r}^{2}=0.015\)).

We further investigated the LD between the 14 significant genetic variants (Table 2) and the variants associated with smoking cessation from the literature. Specifically, we extracted SNPs associated with smoking cessation from the EBI/NHGRI GWAS Catalog87 as of January 30, 2024, selecting entries referencing "smoking cessation" as the disease/trait. We omitted SNPs from studies that compared current smokers to former smokers because such comparisons do not align with the methodology of our current study. Consequently, we identified three SNPs associated with smoking cessation89,90. We employed LDlink, a comprehensive web-based platform, to investigate the LD92. LDlink utilizes the genomic data from the 1000 Genomes Project, offering a rich repository of human genetic variation across diverse populations. The resulting \({r}^{2}\) values are reported in Supplementary Material Table S2. No strong LD was noted, with \({r}^{2}\) values ranging from < 0.001 to 0.027.

Ingenuity pathway analysis (IPA)

From the analysis of common variants, we identified 10 genes that harbor the genetic variants significantly associated with smoking cessation (Table 2). The 10 identified genes were employed as the focus genes in the IPA core analysis. Significant canonical pathways and biological functions were identified based on the focus genes. As described in the Methods section, the core analysis provides a measure of the association of focus genes of interest with the built-in canonical pathways and biological functions. In particular, the most significant canonical pathway identified was the Serotonin Receptor signaling pathway (P = 1.78 × 10–4), which is relevant in the etiology of neuropsychiatric and mood disorders88. The focus genes were also shown to be potentially related to Bupropion Degradation and Nicotine Degradation pathways (P = 1.15 × 10–2 and 2.33 × 10–2, respectively), which are, in general, related to smoking cessation. The significant P values imply over-representation of focus genes in these pathways, and that the association between focus genes and pathways is non-random.

Furthermore, the IPA core analysis generated a network showing the additional molecules that directly or indirectly relate to or interact with the genes identified through the association analyses (Supplementary Figure S2). The molecules with most interconnections are of interest since the highly connected molecules are considered to be most likely associated with diseases or biological functions79,81. Fourteen molecules with 15 or more interactions, as indicated by the numbers of edges connected to other molecules in the network, were identified and are highlighted in the figure. The most highly connected molecules including several related to cancer etiology (see Discussion) such as: TP53 (tumor protein p53)91; RB1 (RB transcriptional corepressor)93; CDKN2A (cyclin dependent kinase inhibitor 2A)94 and EGFR (epidermal growth factor receptor)95, plus RELA, a subunit of the heterodimeric transcription factor called NF-Kappa-B, related to substance abuse96.

Analyses of rare variants

Based on the gene-based association analyses for rare variants (MAF ≤ 0.01), we observed marginal associations of ADCY5 (P = 1.16 × 10–2) and SLC6A2 (P = 1.72 × 10–2) for smoking abstinence phenotype using data combined from both phases. Gene ADCY5 localizes to 3q21.1 and has been associated with low birth weight and type 2 diabetes97,98. Gene SLC6A2 localizes to 16q12.2 which is associated with norepinephrine transport and bipolar disorder, depression and ADHD99,100,101. Note that these signals were not statistically significant after adjusting for multiple comparisons (a significance threshold of P value ≤ 9.1 × 10–4 corresponding to 0.05/55 candidate genes).

Discussion

This study examined genetic predictors of long-term treatment success (6-months) for smoking cessation that used a prospective sample of 2331 smokers undergoing standard smoking cessation therapy including behavioral counseling and pharmacotherapy. Genotyping involved sequencing of 55 candidate genes previously associated with smoking behavior, other substance abuse and psychiatric disorders, and the 6606 tagging SNPs and 233 AIMS, yielding 10,020 common and 24,147 rare variants of sufficient quality for analysis. We took the approach of engaging in two phases using 70% and 30% of the sample, respectively and present the combined results of all markers exceeding GWAS defined significance levels, associating common and rare variants with the cessation phenotype, while controlling for multiple factors including genomic ancestry, study related factors, and demographics.

Our analysis revealed 14 novel markers not previously identified with smoking cessation phenotype of interest defined in the manuscript (P < 5 × 10–08). When mapping these SNPs to specific genes and regions, two major themes emerged. The first theme highlights shared genetic substrates between abstinence from smoking and selected psychiatric and substance use disorders among 6 of these markers, four of which were protective (OR > 1 favoring smoking cessation). Among them, the variant rs1175607105 produced the strongest signal and maps to the gene CYP2B6 which is the primary enzyme responsible for metabolism of the smoking cessation and antidepressant drug bupropion102,103 but has also been implicated in nicotine metabolism34,35,104,105. While a modest inhibitor of norepinephrine and dopamine reuptake, which may account for its antidepressant effects, bupropion also acts as an antagonist of several nicotine cholinergic receptor subtypes106. The other 3 protective variants include rs1413172952, and rs1204720503 which map to the HTR3B (5-hydroxytryptamine-serotonin receptor 3B) gene and variant rs80210037 on chromosome 15. The HTR3B serotonergic receptor gene has been implicated in longer time to relapse following treatment in a combined analysis of bupropion, varenicline and placebo treated smokers45 and nicotine dependence in a mixed ancestry sample of African and European decent107. This suggests that loci on this gene may be predictive of smoking cessation treatment regardless of the type of pharmacotherapy given and of dependence on nicotine in a mixed ancestry sample. Interestingly, other polymorphisms on this gene have been related to a protective effect for obsessive compulsive disorder108 and major depression109 which, like other psychiatric disorders, have been associated with increased prevalence of smoking110 and a shared causal genetic basis111,112.

Two other novel variants, associated with reduced likelihood of quitting, also mapped to genes with previously noted markers for psychiatric and substance use disorders: rs2173763 maps to PARP15 on which several locations have been associated with a broad mood disorder phenotype (Major Depression and Bipolar disorder)113; and rs363222 maps to SLC18A2 which is associated with monoamine neurotransmitter transport (dopamine, norepinephrine, serotonin). Varenicline is one of our most effective smoking cessation medications114 and acts as a dopamine partial agonist115. Several other loci on this gene have been related to alcohol116, opioid117 and nicotine dependence118,119, and PTSD120. Moreover, while our analyses of rare variants did not yield significant associations that survived correction for multiple comparisons, a strong signal was present for SLC6A2 (norepinephrine transporter) which has been implicated in mood disorders and ADHD99,100,101, both of which are more prevalent among smokers110,121.

Consistent with the relationship between smoking cessation and psychiatric disorders described above for individual markers mapped to specific genes, our IPA of the 10 significant common variants that mapped to specific genes (see Table 2) showed significant canonical pathways for the serotonin receptor signaling pathway and for nicotine and bupropion metabolism. Serotonin reuptake inhibitors are used routinely in the treatment of depression122, while as noted previously, differences in nicotine metabolism47 have been associated with a differential response to NRT, varenicline47, and bupropion123. Interestingly, the drug venlafaxine, a norepinephrine and serotonin reuptake inhibitor, has been associated with increased smoking cessation when combined with NRT124, which is commensurate with the findings noted above for SLC18A2 and SCL6A2. Our IPA network analysis of related molecules also revealed relationship between these genes and RELA, a transcription factor involved in NFkB heterodimer formation, nuclear translocation and activation and previously implicated in drug addiction96.

The second theme among our results points to genetic regions associated with both smoking cessation and cancer pathophysiology (note smokers in this sample did not have a current cancer diagnosis, though past-history is unknown). For example, we found significant associations between abstinence and the variant rs1288980 mapping to the gene FOXP1 containing regions reported to act as a tumor suppressor125,126. The variant rs7349 on chromosome 10 maps to the gene ZEB1 which has been associated with invasiveness, metastasis and poor prognosis of lung cancer83,84,85. While previous studies have found associations between lung cancer127,128,129,130 and COPD131,132 and the CHRNA5 SNP rs16969968, noted for its relationship to nicotine dependence14, our findings suggest that additional markers associated with poor cessation outcome may also be related to lung cancer pathophysiology.

These findings were extended by our IPA network analysis of molecules related to the genes identified above in our association analyses. The results further highlight the connection between genetic predictors of smoking cessation and cancer, most likely attributable to tobacco smoke exposure. Relations with several tumor suppressors were noted, including: TP5391 which is associated with tobacco related mutations133,134 and several cancer types including breast, leukemia, cervical135,136,137 and lung133,138,139; RB1, is related to several cancers including childhood retinoblastoma, osteogenic sarcoma, bladder93 and lung, specifically with regard to smoking behavior140,141; and CDKN2A, which has been associated with a wide variety of cancers,94 including those that are tobacco-related, such as head and neck squamous cell carcinoma, oral and lung cancer142,143,144,145. Numerous interactions were also noted for MYC, a proto-oncogene146 and EGFR. Down regulation of c-Myc is associated with invasion/migration capacity of bronchial epithelial cells exposed to cigarette smoke extract147. EGFR95 mutations act an oncogenic driver of lung cancer in non-smokers and light smokers148,149,150.

Other findings include two variants, rs6749438 and rs6718083 mapped to genes DNAJC27 and EFR3B, on chromosome 2, respectively. While there is no specific information for rs6749438, that region is flanked by ERF3B and ADCY3. Multiple studies have linked SNPs in this area to regulation of body weight, obesity and BMI151,152. Interactions between body weight and smoking have been reported for rs16969968-rs1051730 in the CHRNA5-A3-B4 cluster which are associated with reduced bodyweight in smokers but increased body weight in nonsmokers153. Of the remaining variants, no relevant information is available for rs11064432 which maps to the intron of the gene USP or rs1333758 mapping to NALCN.

Finally, we did not find associations with rare variants that survived correction for multiple comparisons, although the strongest signal was noted for SLC6A2, associated with norepinephrine transport: a finding consistent with our other observations associating smoking cessation with regulation of monoamine neurotransmitters especially serotonergic as noted above.

Conclusions

In this study of over 2000 smokers attempting to quit smoking from a multiple ancestry sample, we found 14 novel markers, not previously identified with smoking cessation. When mapped to specific genes and regions, shared genetic substrates between abstinence from smoking and selected psychiatric and substance use disorders were noted among 6 of these markers, four of which were protective. Strong signals were observed for CYP2B6; HTR3B; PARP15; SCL18A2; and SLC6A2. Loci within the HTR3B gene may be of particular interest as they may be predictive of smoking cessation regardless of the type of pharmacotherapy administered. Our network analysis also showed significant canonical pathways for the serotonin receptor signaling pathway and for nicotine and bupropion metabolism. We also found several markers of smoking cessation among genes previously implicated in the development of cancer. These included FOXP1 and ZEB1; and through our network analysis, TP53; RB1; CDKN2; MYC and EGFR. Two novel markers (rs6749438; rs6718083) on chromosome 2 are flanked by genes associated with regulation of bodyweight. Overall, our results identified several novel genetic markers of smoking cessation, both protective and at-risk, both individually and in combination. Larger studies are needed to identify future targets for smoking cessation pharmacotherapy and personalized treatment based on genetic profiles.