Mendelian randomisation study of smoking exposure in relation to breast cancer risk

Background Despite a modest association between tobacco smoking and breast cancer risk reported by recent epidemiological studies, it is still equivocal whether smoking is causally related to breast cancer risk. Methods We applied Mendelian randomisation (MR) to evaluate a potential causal effect of cigarette smoking on breast cancer risk. Both individual-level data as well as summary statistics for 164 single-nucleotide polymorphisms (SNPs) reported in genome-wide association studies of lifetime smoking index (LSI) or cigarette per day (CPD) were used to obtain MR effect estimates. Data from 108,420 invasive breast cancer cases and 87,681 controls were used for the LSI analysis and for the CPD analysis conducted among ever-smokers from 26,147 cancer cases and 26,072 controls. Sensitivity analyses were conducted to address pleiotropy. Results Genetically predicted LSI was associated with increased breast cancer risk (OR 1.18 per SD, 95% CI: 1.07–1.30, P = 0.11 × 10–2), but there was no evidence of association for genetically predicted CPD (OR 1.02, 95% CI: 0.78–1.19, P = 0.85). The sensitivity analyses yielded similar results and showed no strong evidence of pleiotropic effect. Conclusion Our MR study provides supportive evidence for a potential causal association with breast cancer risk for lifetime smoking exposure but not cigarettes per day among smokers.


BACKGROUND
Breast cancer is the most common cancer in women, representing approximately one-quarter of all cancers diagnosed in women worldwide. 1 Besides well-established risk factors for breast cancer, tobacco smoking has been widely studied as a potential risk factor for breast cancer since it is also a leading modifiable risk factor for cancers at sites not directly reached by tobacco smoke. 2,3 Carcinogens associated with tobacco smoke include polycyclic aromatic hydrocarbons, aromatic amines and N-nitrosamines. 4 It is biologically plausible that tobacco smoking may affect risk of breast cancer since metabolites of lipophilic tobacco-associated carcinogens have been detected in breast adipose tissue, 5,6 and specific DNA adducts as well as p53 gene mutations are found in the breast cancer tissue of smokers. [7][8][9] Based on the experimental and epidemiologic findings, evidence is insufficient to establish a causal relationship between tobacco smoking and breast cancer risk. 2,10 Some of the inconsistencies in findings could be attributed to a potential dual effect of smoking on breast cancer. 11 The anti-oestrogenic effect of smoking may attenuate or mask the carcinogenic effects. 12 In a recent pooled analysis of 14 prospective cohort studies, adjusting for potential confounding by alcohol intake as well as body mass index (BMI), education, reproductive factors and other risk factors, a modest association was found between smoking and breast cancer risk. 13 There was also an increased risk with longer duration of smoking prior to first birth, particularly for oestrogen receptor-positive tumours. It is however difficult to establish causation based on these observational studies.
Mendelian randomisation (MR) has been increasingly used to strengthen causal inference in observational studies and under certain assumptions is less vulnerable to residual confounding, reverse causation and selection bias. 14 MR uses genetic variants such as single-nucleotide polymorphisms (SNPs) associated with an exposure of interest as an instrumental variable (IV), which is robust if three assumptions are met: (1) the genetic variants are casually associated with the exposure, (2) the variants are not associated with known or potential confounders for the exposure-outcome relationships and (3) the variants are associated with outcome only via the exposure of interest and not through other pathways. 15,16 Given the unclear causal nature of the findings in observational studies, we conducted a MR analysis to investigate the association between smoking traits and breast cancer risk.

Study population
We used data from 81 studies participating in the Breast Cancer Association Consortium (BCAC), including 108,420 cases and 87,681 controls of European ancestry. Genotyping was performed using two custom-made genotyping arrays: OncoArray in 68,242 invasive breast cancer cases and 52,367 controls (https://epi. grants.cancer.gov/oncoarray/) 17 and iCOGS arrays in 40,178 cases www.nature.com/bjc and 35,314 controls (http://ccge.medschl.cam.ac.uk/research/ consortia/icogs/). 18 Genotype data were imputed based on the 1000 Genomes project Phase 3 as the reference panel using the programme IMPUTE2. 19 SNPs with high imputation quality (imputation r 2 > 0.5) were included. Overlapping participants between datasets were excluded from the iCOGS dataset as the OncoArray provides a better genomic coverage than the iCOGS array. Demographic and epidemiologic data were harmonised across BCAC sites based on a standardised protocol and derived with respect to a reference date, which was date at diagnosis for cases and date at interview for controls. For controls and cases from the nested case-control studies, data from the baseline interview were considered, or if available, follow-up information. Chosen characteristics of the two datasets, including self-reported smoking behaviours (e.g., smoking status, smoking heaviness and smoking duration, including age at smoking initiation and lifetime smoking exposure) are shown in Supplementary Table 1. Ethics approval was obtained from the relevant institutional review boards for all BCAC studies, and all participants provided written informed consent.
Selection of SNPs associated with smoking exposure Exposure variables were selected based on the availability of associated genetic variants to reflect smoking exposure. We included two quantitative smoking behaviour-related traits, cigarettes per day (CPD; average number of cigarettes smoked per day by ever-smokers) and lifetime smoking index (LSI; composite score that captures lifetime smoking exposure by taking into account smoking status as well as smoking duration, heaviness and cessation in ever-smokers). 20 The recent GWAS of cigarettes per day from GSCAN consortium identified 55 conditionally independent genome-wide significant SNPs, explaining 1.09% of the variance in a sample of 337,334 ever-smokers of European ancestry. 21 For LSI, 126 significantly associated SNPs, 0.36% explained variance in LSI, were identified based on a sample of 426,690 individuals (never-smokers and ever-smokers) of European ancestry from the UK biobank. 20 The genetic scores of the two smoking behaviour-related traits were reported to be associated with significantly higher risks of lung cancer. 20,22 For our analysis, SNPs were selected if they were reported to be associated at genome-wide significance level (P ≤ 5 × 10 −8 ) and had a minor allele frequency (MAF) above or equal 1%. For each behaviour phenotype, we filtered the list of behaviour-associated SNPs so that the remaining SNPs were not in linkage disequilibrium (LD) (r 2 > 0.1). For CPD, the SNP with lowest P value (rs10519203) of the correlated SNPs (rs12438181, rs28438420, rs72740955, rs146009840, rs28681284, rs8040868, r 2 > 0.01) was retained. One SNP (rs4886550) was not available in our data and without any proxy SNPs (LD, r 2 > 0.8). Given that alcohol consumption is a recognised confounder of the association between smoking and breast cancer risk, [23][24][25] we excluded nine SNPs (three of CPD, six of LSI, respectively) correlated with any alcohol consumption-associated SNPs (r 2 > 0.1) from recent largescale GWAS (drinks per week, P ≤ 5 × 10 −8 ). 21 After exclusions, we included a total of 164 variants associated with one of the smoking traits; 44 and 120 variants, respectively, for CPD and LSI (Supplementary Table 2 ). No variants overlapped between the two smoking traits.
Statistical methods Statistical power. Power calculations were conducted to estimate the magnitude of effects detectable with our study size assuming 5% α level and an R 2 of 0.0109 for CPD and R 2 of 0.0036 for LSI, which correspond to the variance in each smoking behaviour explained by the SNPs used for this analysis. Power calculations were performed using an online tool available at http://cnsgenomics.com/shiny/mRnd/. 26 Detailed power calculations for all outcomes (invasive breast cancer, ER-positive and ER-negative) to detect different odd ratios are shown in Supplementary Table 3. wPGS-based analyses. For our primary analysis, we generated weighted polygenic scores (wPGSs) using individual-level data of BCAC participants as follow: wPGS = P n i¼1 β gx Ã α i , which is the sum of the effect allele dosage (α i ) (ranging from 0 to 2) for each SNP weighted by the β-coefficient β gx for the effect of the genetic variant (g) related to the smoking quality or/and duration (CPD, and LSI) (x) (Supplementary Table 2). 20,21 The wPGSs of LSI was weakly correlated with wPGSs of CPD (0.17 in iCOGS and 0.03 in OncoArray) in ever-smokers. Analysis of LSI was conducted using all participants (including ever-and never-smokers), whereas that of CPD was performed solely in ever-smokers, and based on 26,147 cases (7342 for iCOGS, 18,805 for OncoArray) and 26,072 controls (8489 for iCOGS, 17,583 for OncoArray). These inclusion criteria correspond to those used in the GWAS studies that identified the SNPs for the smoking traits. 20,21 Association analysis of the wPGSs with breast cancer risk using logistic regression was performed using fixed effect meta-analyses combining iCOGS and OncoArray results based on heterogeneity evaluated by Cochran's Q statistics. The basic model (Model 1) was adjusted for age (continuous), principal components (PCs) of genetic ancestry (first ten PCs for iCOGS and OncoArray, separately) and study site, as previously described. 27 In order to assess if the genetic instrument is independent of established risk factors for breast cancer, associations of wPGSs with selected breast cancer risk factors were assessed by linear regression for continuous variables and logistic regression for categorical variables (Supplementary Table 4). We adjusted for the risk factors that were associated with wPGSs of at least one of the smoking traits in Model 2. We additionally adjusted for alcohol assumption, a well-known confounder of the association between breast cancer risk and smoking, separately in Model 3 because of the large amount of missing data. Participants with missing covariables were excluded from all of the analyses.
Stratified analysis was performed to assess potential differences in associations with breast cancer risk by menopausal status (pre-, postmenopausal women) adjusting for age, study site and top ten PCs. Heterogeneity was tested employing the likelihood ratio test (LRT) for evaluating the multiplicative interaction terms in nested models. Polytomous regression was used to estimate the association according to oestrogen receptor (ER) status.
Scaling was applied to convert the wPGSs for CPD into meaningful units through dividing them by linear regression coefficients of self-reported CPD (0.35 per pack of cigarettes per day). The regression coefficient of CPD was derived from a meta-analysis of iCOGS and OncoArray data on smoking behaviours among 26,072 among ever-smoker controls (8490 and 17,703 controls of iCOGS and OncoArray, respectively).
Two-sample MR analyses. Five different two-sample MR methods using summary association data were applied: inverse-variance weighted (IVW), 28 MR Egger, 29 weighted median, 30 weighted mode, 31 and robust adjusted profile score (RAPS). 32 Each of these methods makes slightly different assumptions about the nature of pleiotropy and therefore a roughly consistent point estimate across the multiple methods provides the strongest evidence of causal inference. 28 The IVW method was implemented since the instruments consisted of multiple SNPs. 33 Multivariable MR methods 34 were conducted also using summary association data from GWASs of alcohol consumption (drinks per week), 21 body mass index (BMI) among females 35 and education attainment. 36 To produce valid results, the IVW method requires that all instruments are associated with the exposure of interest (relevance assumption), but neither directly with the outcome of interest (only via the exposure; exclusion restriction) nor any confounders of the relationship between the exposure and the outcome (independence assumption). 37 The intercept from MR-Egger regression is a statistical test for horizontal pleiotropy, whereas the slope can be interpreted as the smoking behaviour effect on breast cancer adjusted for horizontal pleiotropy. 29 This method assumes however that the pleiotropic effects are independent of the instrument strength (InSIDE assumption). The weighted median estimator provides a valid causal estimate when at least half of the instruments are valid. 30 The estimate from the weighted-mode analysis is valid when the largest group of instruments with consistent MR estimates is valid. 31 MR-RAPS test extends the basic IVW random-effects approach by making the weight each variant receives in the analysis a function of the causal effect and the precision of the SNP-exposure association. 32 The MR pleiotropy residual sum and outlier test (MR-PRESSO) was also implemented to identify outlying genetic variants and analyses were re-run after excluding these variants. 38 All two-sample MR analyses using summary association data were performed with respect to three cancer susceptibility phenotypes: overall breast cancer (108,067 cases/88,386 controls) as well as oestrogen receptor (ER)-positive (70,435 cases) and ERnegative tumours (17,365 cases). Due to the nature of summarylevel data, the analyses of both LSI and CPD were conducted using all samples regardless of smoking status. R version 3.4.3 was used to conduct analyses. R package "Mendelian randomisation", "mr_raps" and "MR-PRESSO" were used for two-sample MR analysis. All tests were considered at the 0.05 level of significance.

RESULTS
There was an association of wPGS for LSI with increased invasive breast cancer risk (OR per SD 1.18, 95% CI: 1.07-1.30, P = 0.11 × 10 -2 ) whereas little evidence was found for an association between wPGSs for CPD and invasive breast cancer (OR 1.02 per pack of cigarettes per day, 95% CI: 0.78-1.19, P = 0.85) after adjustment for age and study (Model 1) ( Table 1). Several breast cancer risk factors were associated with wPGSs of one of the smoking traits (CPD or/and LSI), including ever breastfeeding, menopausal status, age at menopause, BMI, age at first live birth, parity and education level (Supplementary Table 4). Adjustment for all of the identified risk factors did not change the association substantially (Model 2) (wPGSs for LSI, OR per SD 1.24, 95% CI: 1.06-1.45, P = 0.60 × 10 −2 ) ( Table 1). The point estimate of association between wPGS for LSI and invasive breast cancer risk remained unchanged after additional adjustment for alcohol consumption although imprecisely estimated (i.e. wide confidence intervals) (Model 3) (OR per SD 1.13, 95% CI: 0.86-1.49, P = 0.39).
The association between wPGSs for CPD and invasive breast cancer did not change after adjustments.
There was no evidence for effect heterogeneity of the associations of LSI and CPD with breast cancer risk according to ER status or menopausal status ( Supplementary Fig. 1).
Using IVW random-effects analysis, positive associations of genetically predicted LSI were found for overall breast cancer risk (OR per SD 1.14, 95% CI: 1.02-1.28, P = 0.02) and breast cancers according to ER status (OR per SD 1.14, 95% CI: 1.00-1.30, P = 0.04, for ER-positive and OR per SD 1.14, 95% CI: 0.95-1.37, P = 0.17, for ER-negative tumours) ( Table 2, Supplementary Table 5 and Supplementary Figs. 2-4). There was no indication of horizontal pleiotropy based on the MR-Egger intercept test for any outcomes. The point estimates of associations were consistent across the different methods although the results based on MR-Egger regression, and weighted mode method were imprecisely estimated (i.e. wide confidence intervals). They remained substantially unchanged after multivariable adjustment for alcohol consumption, BMI and education. The MR-PRESSO analysis revealed three outliers for LSI. Removal of an outlier (rs2867112) with respect to risk for overall breast cancer and ER-negative cancer did not change the associations. No outlier was observed with ER-positive tumour ( For genetically predicted CPD, we found little evidence for an association with overall breast cancer and ER subtypes using IVW random-effects method (

DISCUSSION
This MR study supports an association between genetically predicted lifetime smoking exposure and increasing invasive breast cancer risk but no clear association with cigarettes per day among smokers. The estimates based on the wPGS for the two smoking traits and several two-sample MR methods were consistent. The LSI has not been assessed in studies of breast cancer risk but the modest association found in this analysis is in line with the modest associations of current and former smoking with invasive breast cancer risk reported in the recent large pooled analysis. 13 We did not find support for an association between cigarettes per day in ever-smokers and invasive breast cancer risk, whereas modest associations were reported for cigarettes per day in current smokers compared with never-smokers in the pooled analysis. With the restriction to ever-smokers, the MR analysis of CPD had low statistical power for the very modest dose-response association estimated by our data (Table 1) as well as that reported in the pooled analysis. 13 In addition to smoking exposure traits, we addressed two dichotomous smoking status traits, namely smoking initiation and smoking cessation, which account for the lifetime smoking exposure. Despite a pooled analysis of epidemiological studies reporting an increased risk of breast cancer associated with current smokers compared to non-smokers (OR of 1.02), 13 a recent MR study found inconclusive evidence of association between genetically predicted smoking initiation and breast cancer risk using summary-level data (OR: 1.05, 95% CI: 0.99-1.12, P = 0.12 in BCAC; OR: 0.97, 95% CI: 0.90-1.06, P = 0.51 in UK Biobank). 39 We conducted association analyses of wPGS of smoking initiation and smoking cessation with breast cancer risk (see Supplementary Note) and found no statically significant association for either smoking initiation (OR 1.05, 95% CI 0.99-1.11, P = 0.08) or smoking cessation (OR 1.06, 95% CI: 0.88-1.27, P = 0.52) (Supplementary Table 6). The results remained unchanged after adjusting for breast cancer risk factors (Supplementary Table 6). We cannot rule out that the MR analysis of the smoking status had low statistical power to detect the very modest association estimated by our data ( Table 1) along with that reported in the pooled analysis. 13 It is also possible that the MR analysis of smoking status especially smoking initiation alone does not capture the association between cigarette smoking and breast cancer risk comparing to the LSI which accounts for other smoking traits.
There are several not entirely resolved issues concerning the association between smoking and breast cancer risks, such as potential effect modification by timing of smoking exposure, menopausal status and oestrogen receptor (ER) status, potential confounding by alcohol consumption. We conducted an association analysis between wPGS of age at smoking initiation and invasive breast cancer (see Supplementary Note). The result showed an inverse association but statistically nonsignificant with low precision (OR 0.88, 95% CI: 0.25-3.07, P = 0.84) (Supplementary Table 6). This is in line with the result of a pooled analysis of epidemiological studies showing that women who started smoking later than 24 years old were at lower breast cancer risk than those who started smoking earlier when compared to non-smokers. 13 Smoking initiation in relation to first birth has been considered an essential factor in the association with breast cancer risk since the undifferentiated breast epithelium is particularly susceptible to carcinogens before the first birth. 40 Indeed this appears to be supported by findings of a stronger association with smoking with breast cancer risk if initiated before first birth and a stronger influence of smoking on breast cancer among women who started smoking more than 10 years before the first full-term pregnancy on breast cancer. 13,[41][42][43][44][45][46] Since the relevant information was only available for a subset of study participants, we did not have sufficient power to address potential differential associations according to timing of smoking exposure in relation to first birth.
Stronger associations between smoking and breast cancer among premenopausal women have been hypothesised since the morphology of the breast and the endogenous hormone levels change substantially during the menopausal transition, and menopausal status alter other breast cancer risk factors. 10 The MR results confirmed the lack of effect modification by menopausal status also reported by previous epidemiological studies. 13,47,48 Despite early evidence against a differential association by ER status, 2,3 recent epidemiologic studies reported a stronger association for risk of ER-positive breast cancer. 10,13, 41 We did not find clear evidence for heterogeneity by ER status although power to detect effect heterogeneity was limited particularly due to the small sample size for ER-negative disease. OR odd ratio, MR Mendelian randomisation, IVW inverse-variance weighted, CI confidential interval, CPD cigarettes per day, LSI lifetime smoking index, RAPS robust adjusted profile score. All two-sample MR analyses using summary-level data were performed in all samples regardless of smoking status (108,067 overall breast cancer cases/88,386 controls); a estimate derived using summary statistics (28); b multivariable analysis after adjusting for genetically predicted alcohol consumption (drinks per week), body mass index and education attainment by using summary-level data from GWAS outcome (alcohol assumption (21), body mass index (BMI) among female (35) and education attainment (36); c the MR pleiotropy residual sum and outlier test (MR-PRESSO) was implemented to identify outlying genetic variants (rs11940255, rs1737894 and rs73229090 for CPD; rs2867112 for LSI) and analyses were re-run after excluding these variants (38) Previous epidemiologic studies have addressed the confounding effect by alcohol consumption and found an association of smoking with breast cancer risk after stratifying on alcohol consumption. 13,[41][42][43] We addressed this issue by excluding SNPs associated with alcohol intake from the wPGS. However, the results did not change significantly in sensitivity analyses with wPGS, including the overlapping alcohol consumption-associated SNPs (Supplementary Table 7). Also, our multivariable analyses adjusting for alcohol consumption yielded association estimates that remained unchanged. The lower precision can be attributed to the reduced dataset that required information on alcohol intake with the ensuing diminished power.
Despite epidemiological evidence indicating alcohol consumption as an established risk factor for breast cancer risk, 49 a recent MR study reported no significant association of breast cancer risk with genetically predicted alcohol intake using summary association data. 39 We conducted an analysis between wPGS of alcohol consumption and invasive breast cancer using individual-level data (see Supplementary Note). We found no clear evidence of association between genetically predicted alcohol consumption and invasive breast cancer (OR: 0.98, 95% CI: 0.86-1.11, P = 0.74). The association remained unchanged after adjusting for LSI (OR: 0.88, 95% CI: 0.72-1.09, P = 0.25).

Strengths and limitations
Despite the relatively large sample size used for our analyses, weak statistical power to detect modest associations should be considered when interpreting the results. The power calculation shows that our study had 68% power to detect an OR of 1.20 per SD change in LSI but only around 9% power for an OR of 1.05 in combined dataset (iCOGS and OncoArray) (Supplementary Table 3). Since the reported order of magnitude for associations with CPD is around 1.05 per ten cigarettes, 13 we cannot exclude that the lack of association observed in the MR analysis particularly regarding CPD may be due to limited power. The unit for LSI is not scaled due to the nature of the phenotype. The reported order of magnitude for associations with LSI can therefore not be directly compared with the relative risk estimates for smoking from observational studies.
MR estimates have a causal interpretation only if the assumptions of the instrumental variable approach hold. Even though we performed extensive sensitivity analyses to detect potential violations, it is difficult to prove the validity of the assumptions. The LSI captures multiple aspects of smoking behaviours, which could have introduced more potential for horizontal pleiotropy. The more diffuse the definition of smoking, the more lifestyle factors might be correlated, making it especially important to test for horizontal pleiotropy. No evidence of pleiotropic effects was found by conducting various sensitivity analyses; however, residual pleiotropy is difficult to exclude and should be considered.
The genetic instrument for LSI allows for the use of the large entire sample to conduct MR analysis without stratifying on smoking status. The analyses for CPD were restricted to smokers by reason that the CPD-associated SNPs were identified among ever-smokers in the GWAS study, 21 which reduce statistical power to detect an association. Moreover, we should note that restricting to ever-smokers may induce a sampling bias and invalidate the MR assumptions. By restricting to smokers, smoking initiation can open up the path from exposure (wPGS for CPD) to outcome (breast cancer risk). It can make the association between smoking and breast cancer risk appear weaker by removing a part of the association that is attributable to smoking initiation.
Another limitation is that our analysis was restricted to participants of European ancestry; therefore, our results may not apply to populations of other ethnicities. However, it reduces the potential bias caused by population stratification.

Conclusion
In conclusion, this Mendelian randomisation analysis using both individual-level data and summary statistics supports a causal association between lifetime smoking exposure and breast cancer risk. Larger studies for MR analysis are warranted to address additional aspects of smoking behaviour.

ADDITIONAL INFORMATION
Ethics approval and consent to participate Collection of blood samples, urine samples and questionnaire information was undertaken with written informed consent and relevant ethical review board approval in accordance with the tenets of the Declaration of Helsinki (Supplementary Table 8).

Consent to publish Not applicable.
Data availability Availability of lifetime smoking index GWAS described by Wootton et al. 20 Availability of self-reported cigarettes per day GWAS described by Liu et al. 21 Individual genotyping data from BCAC will not be made publicly available due to restraints imposed by the ethics committees of individual studies; requests for data can be made to the Data Access Coordination Committee of BCAC Nationale contre le Cancer, Agence Nationale de Sécurité Sanitaire, de l'Alimentation, de l'Environnement et du Travail (ANSES), Agence Nationale de la Recherche (ANR). The CGPS was supported by the Chief Physician Johan Boserup and Lise Boserup Fund, the Danish Medical Research Council, and Herlev and Gentofte Hospital. The CNIO-BCS was supported by the Instituto de Salud Carlos III, the Red Temática de Investigación Cooperativa en Cáncer and grants from the Asociación Española Contra el Cáncer and the Fondo de Investigación Sanitario (PI11/00923 and PI12/00070). The American Cancer Society funds the creation, maintenance, and updating of the CPS-II cohort. The CTS was initially supported by the California Breast Cancer Act of 1993 and the California Breast Cancer Research Fund (contract 97-10500) and is currently funded through the National Institutes of Health (R01 CA77398, UM1 CA164917 and U01 CA199277). Collection of cancer incidence data was supported by the California Department of Public Health as part of the statewide cancer reporting program mandated by California Health and Safety Code Section 103885. The University of Westminster curates the DietCompLyf database funded by Against Breast Cancer Registered Charity No. 1121258 and the NCRN. The coordination of EPIC is financially supported by the European Commission (DG-SANCO) and the International Agency for Research on Cancer. The national cohorts are supported by: Ligue Contre le 00061 to Granada, PI13/01162 to EPIC-Murcia, Regional Governments of Andalucía, and MRC core funding (MR_UU_12023). The USRT Study was funded by Intramural Research Funds of the National Cancer Institute, Department of Health and Human Services, USA. Open Access funding enabled and organized by Projekt DEAL.