Cigarette smoking behaviors and the importance of ethnicity and genetic ancestry

Cigarette smoking contributes to numerous diseases and is one of the leading causes of death in the United States. Smoking behaviors vary widely across race/ethnicity, but it is not clear why. Here, we examine the contribution of genetic ancestry to variation in two smoking-related traits in 43,485 individuals from four race/ethnicity groups (non-Hispanic white, Hispanic/Latino, East Asian, and African American) from a single U.S. healthcare plan. Smoking prevalence was the lowest among East Asians (22.7%) and the highest among non-Hispanic whites (38.5%). We observed significant associations between genetic ancestry and smoking-related traits. Within East Asians, we observed higher smoking prevalence with greater European (versus Asian) ancestry (P = 9.95 × 10−12). Within Hispanic/Latinos, higher cigarettes per day (CPD) was associated with greater European ancestry (P = 3.34 × 10−25). Within non-Hispanic whites, the lowest number of CPD was observed for individuals of southeastern European ancestry (P = 9.06 × 10−5). These associations remained after considering known smoking-associated loci, education, socioeconomic factors, and marital status. Our findings support the role of genetic ancestry and socioeconomic factors in cigarette smoking behaviors in non-Hispanic whites, Hispanic/Latinos, and East Asians.


Introduction
Cigarette smoking contributes to numerous common diseases, including cancers, chronic obstructive pulmonary disease, and cardiovascular diseases, and it is one of the leading causes of death in the United States 1-6 . Despite the substantial decrease in cigarette smoking prevalence over the last one-half century,~40 million people are still smokers in the United States, and disparities among smokers remain 7,8 . Higher prevalences of smokers have been observed in populations who are disadvantaged socially and economically 7,9 . Further, among smokers, socioeconomic status is a major determinant of the degree of nicotine dependence 10 , which can be approximated by the number of cigarettes smoked per day (CPD) 11 .
In the United States, smoking behaviors vary widely across race/ethnicity, with individuals of Asian and Hispanic/Latino ancestry having the lowest smoking prevalence compared to individuals of other ancestry 7,8 . The reasons for these disparities may include variation in genetic ancestry, which has the potential to explain variation in smoking behaviors between Asian and Hispanic/ Latino ancestry populations and other populations. However, to date, no study has investigated the role of genetic ancestry and smoking behavior-related traits.
Twin and family studies suggest that genetic factors accounted for approximately half of the variance in smoking initiation and smoking quantity, and heritable variation in cigarette use seems comparable across ethnic groups [12][13][14] . Recently, the GWAS and Sequencing Consortium of Alcohol and Nicotine Use (GSCAN) study 15 conducted in European ancestry individuals reported 467 genetic variants associated with cigarette smoking-related traits, including age at smoking initiation, smoking initiation, smoking cessation, and CPD.
Here, we hypothesize that genetic ancestry may explain some of the wide-variability in cigarette smoking behaviors across ethnic groups. To answer this question, we conduct genetic ancestry analyses of cigarette smoking behaviors within each of the four ethnic groups (non-Hispanic whites, Hispanic/Latinos, East Asians, and African Americans) from the Genetic Epidemiology Research in Adult Health and Aging (GERA) cohort 16 . Two smoking-related traits were used: smoking initiation (15,862 'ever' smokers vs. 27 623 'never' smokers) and CPD for all smokers (i.e., 2271 'current' + 13,591 'formers' smokers). We then investigate whether genetic ancestry associations are: (1) due to genetically determined smoking-related traits based on known smoking genetic variants 15 ; and (2) modified by education, socioeconomic factors such as, employment/work status, household income, and marital status.

Study population
Individuals were selected from the Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH) Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. The cohort consists of over 110,000 adult members of Kaiser Permanente Northern California (KPNC), ranging in age from 18 to 100 years at enrollment 16 . The RPGEH was established as a resource for research on genetic and environmental influences on health and disease, and participants were asked to complete a mailed survey. On this survey, participants were asked: 'What best describes your race/ethnicity?'. Briefly, and as previously described 16 , self-reported race/ethnicity for each individual was derived from responses to this question, and, for individuals who reported more than one category, the selections were collapsed into race/ethnicity categories. In particular, all East Asian nationalities (i.e., Chinese, Japanese, Korean, Filipino, Vietnamese, or other Southeast Asian) were collapsed into a single East Asian group; all Latino nationalities (i.e., Mexican, Central/South American, Puerto Rican, or other Latino/Hispanic) were collapsed into a single Hispanic/Latino category; all African descent populations (i.e., African-American, African, or Africo-Caribbean) were collapsed into a single group; all white-European ethnicities (i.e., White or European-American, Middle Eastern, or Ashkenazi Jewish) were collapsed into a single non-Hispanic white group. In addition to self-reported race/ethnicity, individuals included in the current study provided self-reported information regarding their cigarette use, education, employment/ work status, household income, and marital status (N = 43,485, Table 1). All study procedures were approved by the Institutional Review Board of the Kaiser Foundation Research Institute.

Smoking-related traits
Two smoking-related traits (i.e., smoking initiation, and the number of CPD) were assessed based on the RPGEH survey, via the following questions: 'Have you ever smoked one or more cigarettes per day for six months or longer?' (yes or no); 'Do you currently smoke, or have you stopped smoking?' (current smoker or former smoker); and 'On average how many packs of cigarettes do you (or did you) smoke per day?'(< ½ pack, ½-1 pack, 1-1½ packs, or more than 1½ packs). For smoking initiation, ever (former/current) and never smokers were assigned as cases and controls, respectively. For smokers ('former' and "current' smokers), the number of CPD, as a quantitative trait, was assessed by considering~20 cigarettes per pack. The RPGEH survey has been shown to be successful in assessing other substance use, such as alcohol consumption, as in our recent study 17 we confirmed previous findings implicating ADH1B, AUTS2, SGOL1, SERPINC1, KLB, and GCKR loci in alcohol consumption [18][19][20][21] .

Socioeconomic covariates
The RPGEH survey was also used to assess education, socioeconomic factors (i.e., employment/work status and household income), and marital status, via the following questions: 'What is the highest level of school that you have completed?'; 'What is your employment or work status?'; 'What best describes your household income (before taxes)?'; and 'What is your current marital status?'. Answers to these questions were combined in: (1) 4 categories for education: 'less than high school' which corresponds to "grade school (grades 1-8)", 'high school' which combines "some high school (grades 9-11)" with "high school or GED", 'some college', and 'college degree or more' which combines "college", "graduate school", and "technical/trade school"; (2) 4 categories for employment or work status: 'full-time employed', 'part-time employed', 'unemployed' and 'disabled for work'; (3) 3 categories for household income: '<$20,000' which corresponds to an annual household income (before taxes) <$19,999 per year, '$20,000 to $59,999/year', and '$60,000/year or more'; and (4) 3 categories for marital status: 'never married', 'married or living as married', and 'separated or divorced'. 'Female' sex, 'college or more' education, '$60,000 or more' income, 'full-time employed' employment, and 'married or living as married' marital status served as the reference groups for Models 3.

Genotyping and imputation
GERA DNA samples were genotyped on four custom Affymetrix Axiom arrays that were designed for individuals of non-Hispanic white, East Asian, African American, and Latino race/ethnicity, as previously described 22,23 . We applied genotype quality control (QC) procedures for the GERA samples on an array-wise basis 23 . Briefly, we included genetic markers with an initial genotyping call rate ≥97%, genotype concordance rate >0.75 across duplicate samples, and allele frequency difference ≤0.15 between females and males for autosomal markers.

Principal component (PC) and genetic ancestry
Banda et al. 16 conducted an analysis of ancestry in GERA using PC analysis (Eigenstrat v4.2), and identified 10 and 6 ancestry PCs reflecting genetic ancestry among non-Hispanic whites, and the other ethnic groups, respectively. To adjust for genetic ancestry, we also included the percentage of Ashkenazi (ASHK) Jewish ancestry as a covariate for the non-Hispanic white ethnic group analysis. For genetic ancestry analyses, for each ethnic group, we examined the effect of the first 2 PCs, which are the only ones geographically interpretable and represent geographic clines, on smoking-related traits prevalence/distribution. Each model was adjusted for additional PCs (i.e., up to 10 for non-Hispanic whites and up to 6 for the other ethnic groups). To visualize the smoking-related traits prevalence/distribution by the ancestry PCs, we created a smoothed distribution of each individual's smoking phenotype using a radial kernel density estimate, as previously described 25 .

Genetic risk score (GRS)
To determine if known smoking-associated SNPs could explain the ancestry effect, we repeated the ancestry analyses including a GRS for each smoking-related trait based on the findings of the largest genetic study conducted to date, including up to 1.2 million individuals with information on multiple stages of tobacco use 15 . To derive the GRS, we used a 'classic' method 26 which consists of computing GRS based on a subset of SNPs exceeding a specific GWAS association P-value threshold (i.e., P ≤ 5.0 × 10 −8 in Liu et al. 15 ). The first GRS was based on 365 smoking initiation genome-wide associated-SNPs associated-SNPs, and the second was based on 53 SNPs previously reported to be associated at a genome-wide level of significance with CPD 15 . Out of the 365 SNPs, 133 (36.4%) were confirmed to be associated with smoking initiation in GERA, including 14 at a Bonferroni-corrected alpha level of 1.37 × 10 −4 (0.05/365) (Supplementary Data 1). Out of the 53 SNPs, 34 (64.1%) were confirmed to be associated with CPD in GERA, including 15 at a Bonferroni-corrected alpha level of 9.43 × 10 −4 (0.05/53) (Supplementary Data 2). The GRSs were built on these known smoking-associated SNPs by summing up the additive coding of each SNP weighted by the effect size ascertained from the original study 15 . As the original study 15 was conducted in cohorts of European ancestry, we also generated unweighted GRSs and included those in the models for each ethnic group. Results were similar using unweighted or weighted GRS in all ethnic groups (Supplementary Data 3).

Statistical analyses
For smoking initiation, we used a logistic regression model to examine the impact of ancestry on this smokingrelated trait using R version 3.4.1 with the following covariates: age, sex, and ancestry PCs (first 10 PCs for the non-Hispanic white analyses and first 6 PCs for the other ethnic groups) (Model 1). For the number of CPD, we used a linear regression model. In Model 2, in addition to all covariates included in Model 1, we added one of the two GRS described above. In Model 3, in addition to all covariates included in Model 2, we added education, socioeconomic factors, and marital status as covariates.

GERA cohort and smoking behavior
The study sample consisted of 43,485 GERA participants from four ethnic groups (non-Hispanic whites, Hispanic/ Latinos, East Asians, and African Americans) ( Table 1). In our study, the prevalence of 'ever' smokers varied by ethnicity with the lowest prevalence (22.7%) for East Asians and the highest (38.5%) for non-Hispanic whites. On average, the number of cigarettes per day (CPD) smoked by non-Hispanic whites was higher (21.2 CPD) compared to the number of CPD smoked by individuals from other ethnic groups (range of 16.4-17.1 CPD). 'Ever' smokers were more likely to be 'former' smokers compared to 'current' smokers in all ethnic groups.
In our study, the prevalence of 'ever' smokers also varied by education level, employment, income level, and marital status (Supplementary Table 1). Individuals with high school education levels were more likely to have smoked compared to individuals with a college degree or higher education level (51.3% vs. 31.7%). Individuals who were disabled were more likely to have smoked compared to individuals who were part-or full-time employed (53.3% vs. (34.8-36.1%)), and individuals having an annual income of $60,000 or more were less likely to have smoked compared to individuals who had an annual income of <$59,999 (34.5 vs. 43.6%). Finally, individuals who were separated/divorced were more likely to ever smoked compared to individuals who were never married (45.7% vs. 28.9%). Similar trends were observed across the four ethnic groups (Supplementary Table 2).

Genetic ancestry and smoking behaviors
We first investigated genome-wide genetic ancestry using principal components (PCs) that were assessed within each ethnic group separately 16 . Genetic ancestry associations with smoking initiation and CPD were then assessed and visual representations are provided in Figs. 1, 2. Within non-Hispanic whites, the first two PCs represented geographically interpretable genetic ancestry, with PC1 characterizing a northwestern vs. southeastern European cline and PC2 a northeastern vs. southwestern European cline. The first two PCs were both associated with CPD (Model 1: β = 27.95, P PC1 = 0.017; β = −50.32, P PC2 = 9.06 × 10 −5 ) (Table 2), with the lowest prevalence observed for individuals of southeastern European ancestry (Fig. 2a). In contrast, neither PC1 nor PC2 was associated with smoking initiation within non-Hispanic whites.
In African Americans, neither PC1 (representing African vs. European ancestry) nor PC2 (representing East Asian ancestry) were associated with smoking initiation or CPD (Table 3; Figs. 1d and 2d).

Genetic ancestry and known smoking-associated loci
To determine whether the genetic ancestry associations with smoking-related traits were due to known smokingassociated loci, we repeated the ancestry analyses, including one of the two following GRS: the first GRS was based on 365 smoking initiation associated-SNPs, and the second GRS was based on 53 SNPs previously reported to be associated with CPD 15 . While the GRS for smoking initiation was significantly associated with smoking initiation in all four ethnic groups, the GRS for CPD was a predictor for CPD in all ethnic groups, except Hispanic/ Latinos (Table 2).

Genetic ancestry associations and socioeconomic factors
To determine whether education, socioeconomic factors, and marital status explain the remaining genetic ancestry associations (after considering genetically determined smoking-related traits), we repeated the ancestry analyses, including education, employment, income level, and marital status. In non-Hispanic whites, only the genetic ancestry association Note: In model 3, sex (female), education (college or more), income ($60,000 or more), marital status (married or living as married), employment (full-time employed) served as the reference group. Each model was adjusted for age, sex, and additional PCs. We also included the percentage of Ashkenazi (ASHK) ancestry as a covariate for the non-Hispanic white analyses. CPD number of cigarettes smoked per day, PC principal component, β beta, SE standard error, GRS genetic risk score (based on 365 SNPs previously reported to be associated with smoking initiation, or 53 SNPs previously reported to be associated with CPD).

Discussion
In this study, we observed substantial differences in cigarette smoking behaviors across race/ethnicity groups, and we found that smoking initiation and/or CPD were associated with genetic ancestry within non-Hispanic whites, Hispanic/Latinos, and East Asians. Specifically, a higher smoking initiation prevalence and higher number of CPD were associated with greater European (versus Native American) ancestry among Hispanic/Latinos and were associated with greater European (versus Asian) ancestry among East Asians. Furthermore, individuals of northwestern European ancestry had a higher number of CPD compared to individuals of southeastern European ancestry among non-Hispanic whites. No significant associations between genetic ancestry and cigarette smoking behaviors were detected in African Americans, which was the smallest sample size of the groups. After considering genetic variants known to contribute to cigarette smoking behaviors and accounting for education, socioeconomic factors such as employment/work status and household income, and marital status, these genetic ancestry associations remained, but were attenuated. Study findings suggest that genetically determined smoking traits and socioeconomic factors can explain some of the ancestry effects in Hispanic/Latinos, East Asians, and non-Hispanic whites, and that additional factors correlated with genetic ancestry remain to be discovered.
Our results are consistent with previous studies showing disparities in adult cigarette smoking prevalence among specific sub-populations, including individuals from certain ethnic groups, variation by education level, and socioeconomic groups. Indeed, we found that East Asian and Hispanic/Latino individuals had the lowest prevalence of smoking initiation compared to non-Hispanic white and African American individuals, consistent with the previous studies 7, 28 . Similarly, in our study, the prevalence of these 'ever' smokers was much lower for college-educated individuals compared to those with high school education, and for individuals who earned >$60,000 compared to those with lower income, consistent with previous studies 7, [28][29][30] . Furthermore, in our study, married individuals had the highest prevalence of smoking cessation compared to those who were single or divorced, consistent with previous findings 31 .
We recognize several potential limitations of our study. First, the cigarette smoking-related traits were based on self-reported information, and no information regarding other forms of tobacco use, such as pipes, cigars, or ecigarettes, were collected on our survey. Further, GERA cohort members are older on average compared to the general population. As older adults may consume tobacco in a different form than younger adults who may prefer ecigarettes 32,33 , this may limit the generalizability of the findings to the groups represented in this study. Second, no information regarding the previous U.S. addresses of the participants included in the current study was collected. All the GERA members were living in the Northern California region at the time of survey completion, however, as smoking prevalence has been shown to vary considerably across states 7,34 , considering the previous U.S. addresses of the participants could identify an additional potential source of variation in smoking behavior. Third, because of the limited number of 'current' smokers in our sample (N = 2271), we did not consider the smoking cessation phenotype (i.e., 'current' vs. 'former' smokers) for the subsequent genetic ancestry association analyses. Lastly, for the calculation of GRS for smoking-related traits, we used a 'classic' GRS method 26 that restricts to only genetic variants reaching genome-wide significance in the original GWAS 15 . This 'classic' approach has been commonly applied [35][36][37][38][39] and has key advantages 26 , including that it is relatively fast to apply and is more interpretable compared to more sophisticated methods, such as Bayesian regression models that perform shrinkage [39][40][41] . Further, this 'classic' approach has been shown to have relatively similar performance compared to alternative methods [39][40][41] . Future studies applying those alternative methods to derive GRS for smoking-related traits may provide a further refinement to the effects that we observed in the current study. Despite these limitations, our study is based on a unique and very large cohort of individuals, who were all members of the KPNC health plan, a single integrated healthcare delivery system. Participants were recruited in a similar manner and were assessed for their cigarette smoking behaviors using a single questionnaire providing greater consistency, in contrast to consortia which often include different questions across studies.
In conclusion, this study is the first investigation of genetic ancestry and cigarette smoking-related trait associations. We observed significant associations between genetic ancestry and smoking-related traits within each race/ethnicity, except for African Americans. Known smoking-associated genetic variants identified in populations of European ancestry 15 explained only a small proportion of these associations, and the observed ancestry effects may be due to population-specific genetic variants. Future studies including additional genetic variants associated with smoking behavior-related traits in non-European populations, such as those recently identified in a Japanese population 42 but not validated yet, may better explain these genetic ancestry associations.