Chronic inflammation may play an important role in the pathogenesis of non-inflammatory diseases, such as breast cancer, from tumor initiation through progression1,2. Activation of innate immunity creates a tissue microenvironment high in reactive oxygen and nitrogen species, leading to potential DNA damage and alterations in nearby cells3,4,5. The inflammatory response also elevates the circulating levels of cancer-promoting inflammatory cytokines such as C-reactive protein (CRP) and interleukin-6 (IL-6)2. These key pro-inflammatory biomarkers reflect different molecular pathways in the immune cascade in acute and chronic immune responses but may be interrelated in carcinogenesis, yielding a congruent association with breast cancer risk. For example, IL-6, upregulated by macrophages and adipose tissue, promotes breast tumor initiation and progression6,7. CRP, a major acute-phase reactant and a biomarker of chronic low-grade inflammation, partially induced by IL-6, has been associated with increased risk of breast cancer8,9. The carcinogenetic mechanisms of these markers are partially understood. IL-6 regulates aromatase activity responsible for estrogen production in adipose tissue, which is important in developing postmenopausal breast cancer10,11. CRP levels are attenuated by prolonged inhibition of cyclooxygenase-2 action (promoting estrogen formation in adipose tissue)11,12. Thus, IL-6 and CRP may be involved in inflammatory pathways connected to breast cancer tumorigenesis.

Given the relationships between those inflammatory markers and breast cancer risk, genetic variants involved in the biomarkers’ functional and structural regulation may have potential implication in the causal pathway, affecting the risk of breast cancer. Previous genomic epidemiology studies for the associations between CRP/IL-6-related genome-wide genetic variants and breast cancer risk are limited and mostly showed null results13,14,15,16,17, while only a few reported a marginal effect on breast cancer risk6. The gene–phenotype pathway may not be connected to CRP and IL-6 alone, but also modulated by lifestyle pathways linked to obesity (overall and visceral)15,18,19,20,21,22,23,24,25, lipid metabolism25,26, high-fat diet, exercise, smoking, and alcohol18,27,28,29,30,31,32,33,34. Further, the inflammatory cytokines and the genetic markers have demonstrated different associations with breast cancer according to obesity16,35 and related lifestyle factors such as physical activity and dyslipidemia36,37,38. Thus, studying how those lifestyle factors modify and interact with gene and phenotype, leading to increased breast cancer susceptibility, may contribute to the understanding of the complex genotype–phenotype pathway and is important to develop a genetically targeted intervention tool for use in primary breast cancer prevention efforts.

In addition, immune-related etiologic pathways in breast cancer development may differ by menopausal status, probably due to the role of sex hormones in mediating the innate and adaptive immune systems. Our current study has focused on postmenopausal women who are vulnerable to a high incidence of inflammation39, obesity, and breast cancer (e.g., 80% of new cases occur in women age 50 years and older40,41). Using a large-scale postmenopausal women cohort from the Women’s Health Initiative Database for Genotypes and Phenotypes (WHI dbGaP) Study, we previously performed a genome-wide association (GWA) gene–environment (G × E) interaction study for CRP and IL-6 by addressing the pleiotropic effect of those biomarkers on the gene–phenotype relationship; we identified 88 top GWA single-nucleotide polymorphisms (SNPs)42. We have now extended the scope of modeled genetic factors by including 68 additional SNPs in relation to CRP and IL-6 from previous GWA studies that focused on European ancestry with independent replications20,21,43,44. We examined the association of those top GWA-based SNPs with primary invasive breast cancer risk overall and in obesity-related strata in which the SNPs were associated with CRP and IL-6 at genome-wide significance in our earlier GWA study42. This approach may allow us to elucidate an empirical pathway through which a substantial proportion of the susceptibility of GWA SNPs in CRP and IL-6 influences breast cancer risk through interactions with specific lifestyles (Figure S1).

In this study, we hoped to improve the predictability of breast cancer by better characterizing the genetic architecture of the inflammatory biomarkers that interact with lifestyle factors. We evaluated the GWA SNPs and 48 selected lifestyle factors together by conducting a two-stage multimodal random survival forest (RSF) analysis and ranked them according to their predictive value and accuracy for breast cancer. In addition, we applied a generalized multifactor dimensionality reduction (GMDR) model to characterize high-order gene–gene interactions and selected the best genetic prediction model45,46,47,48. Finally, with the most predictive SNPs and lifestyle factors selected via the RSF and GMDR, we constructed prediction models for breast cancer risk and estimated the combined and joint interaction effects of genotypes and lifestyles on the development of breast cancer. Ultimately, we tested the empirical hypothesis that the most-predictive genetic and lifestyle factors in combination increase the predictability of breast cancer risk in a synergistic manner.

Material and methods

Study population

Our study included healthy postmenopausal women enrolled in the WHI Harmonized and Imputed GWA Studies (GWASs) which was coordinated by dbGaP to contribute to a joint imputation and harmonization effort for GWASs within the 2 representative study arms, Clinical Trials and Observational Studies. The detailed study designs and rationale are described elsewhere49,50. Briefly, healthy women were enrolled in the WHI study between 1993 and 1998 at 40 clinical centers across the United States if they were 50–79 years old, postmenopausal, expected to stay near the clinical centers for at least 3 years after enrollment, and able to provide written informed consent. Participants were eligible for the WHI dbGaP study if they had met eligibility requirements for submission to dbGaP and provided DNA samples. The Harmonization and Imputation GWASs under the dbGaP study accession (phs000200.v12.p3) consist of 6 sub-studies (Table S1). Of the 16,088 women who reported their race or ethnicity as non-Hispanic white (Figure S2), in our earlier GWA GxE study, we applied the exclusion criteria (diabetes history; genetic data duplications; first- and second-degree relatives; and genetic quality control [QC] based on principal components), leaving 10,798 women. In the current study, we additionally excluded 619 with < 1 year follow-up period and/or a diagnosis of any type of cancer at enrollment, leaving a total of 10,179 women (94% of the eligible 10,798 GWA participants). These women had been followed up through August 29, 2014, with a mean of 16 years follow-up, and 537 of them had developed primary invasive breast cancer. The Institutional Review Boards of each WHI participating clinical center and the University of California, Los Angeles, approved this study. all methods were performed in accordance with the relevant guidelines and regulations.

Data collection and breast cancer outcome

The coordinating clinical centers conducted data quality assurance periodically and collected participant information through self-administered questionnaires. In this study, we initially selected 48 variables measured at screening for our analysis on the basis of (1) their association with inflammation and breast cancer through the literature review36,51,52,53,54 and (2) preliminary analyses including univariate and stepwise multiple regression analyses and a multicollinearity test. Those variables include demographic and socioeconomic factors (age, education, marital status, family income, and employment); family histories of breast and colorectal cancers and diabetes; medical histories (depressive symptoms, hypertension, high cholesterol, and cardiovascular disease); lifestyles (cigarette smoking and exercise); dietary factors (dietary energy, alcohol intake, total sugar, fiber, fruit, and vegetable consumption; % calories from protein, carbohydrates, saturated fatty acids [SFA], monounsaturated FA [MFA], and polyunsaturated FA [PFA]); and reproductive histories (history of hysterectomy, removal of one or both ovaries, ages at menarche and menopause, pregnancy, breast feeding, oral contraceptive (OC) use, and use of exogenous estrogen [E] only and E plus progestin [E + P]). We also included anthropometric variables, including height, weight, and waist and hip circumferences, which had been measured by trained staff.

The breast cancer outcomes were determined via a centralized review of medical charts by a committee of physicians on the basis of pathology or cytology reports. The time from enrollment to breast cancer development, censoring, or study end point was calculated and represented in years. Cancer cases were coded using the National Cancer Institute’s Surveillance, Epidemiology, and End-Results guidelines55.


We extracted genotyped data from the WHI dbGaP Harmonized and Imputed GWASs. Details of the data-cleaning process have been previously discussed42,56. Briefly, the genotypes were normalized to the reference panel GRCh37, and imputation was conducted via 1000 Genomes reference panels57. SNPs for harmonization were checked for pairwise concordance among all samples across the GWASs. The initial data QC included SNP filtering with a missing-call rate of < 2% and a Hardy–Weinberg equilibrium of p ≥ 1E–04. The second QC step included SNPs with \({\widehat{R}}^{2}\ge 0.6\) imputation quality58 but excluded individuals with a KING kinship estimate > 0.08859.

Statistical analysis

Differences in participants’ baseline characteristics and allele frequencies by breast cancer development were examined with unpaired 2-sample t tests (for continuous variables) and chi-squared tests (for categorical variables). If continuous variables were skewed or had outliers, Wilcoxon’s rank-sum test was conducted. Our previous GWA analysis evaluated the gene–lifestyle interactions via stratifications defined by body mass index (BMI; cutoff, 30 kg/m2), waist circumstance (WST; cutoff, 88 cm), waist-to-hip ratio (WHR; cutoff, 0.85), metabolic equivalents (METs; cutoff, 10 h/week), and % calories from SFA (cut-off, 9%). The results (G × E formal test and stratified analysis) from the sub-GWASs were combined in a meta-analysis assuming a fixed-effect model. In this study, we performed an association study of the 88 SNPs identified in subgroups by obesity and obesity-related lifestyle variables with breast cancer risk in the identical subgroups. The additional 68 SNPs from other GWA studies were pulled together overall and in subgroups for the purpose of analysis.

In the current study, we conducted the RSF analysis. The RSF initially generates bootstrap samples using approximately 63% of the original data and grows a tree from each sample via a splitting rule to maximize survival differences across daughter nodes. This tree-building process is repeated numerous times (n = 5000 in this study), creating a forest of trees60,61. An ensemble cumulative hazard estimate was calculated from each tree and averaged over all trees for each individual and used to compute a predicted cumulative breast cancer incidence rate. Also, using this ensemble estimate and creating the out-of-bag (OOB) data (about 37% of the original data not used for bootstrapping), the OOB concordance index (c-index) was estimated, which is a measure of prediction performance conceptually similar to the area under the receiver operating characteristic (AUROC) curve60,62. The rank of each variable was determined on the basis of its predictability for breast cancer according to 2 predictive parameters: (1) minimal depth (MD), in which variables that have a small MD and split the tree close to the root are considered highly predictive and (2) variable importance (VIMP), computed as the difference between the OOB c-indexes from the original OOB data and from the permuted OOB data, in which variables that have greater VIMP values are the more predictive63. Because they use different prediction algorithms, we expect the variables’ ranking to differ to some degree. The RSF, a machine-learning and nonparametric tree-based ensemble method, accounts for nonlinearity and high-order interactions among variables, which may not be handled by a traditional regression method63,64. The RSF may thus provide a more accurate risk estimation.

We performed a 2-stage RSF analysis (Figure S3). In the first stage, we implemented an RSF on SNPs and lifestyle factors separately. Only those SNPs and lifestyle factors with distinctly low MD and high VIMP values were carried over in the second stage. In that second stage, we took a multimodal approach overall and in subgroups (by BMI, WHR, WST, MET, and SFA) by (1) comparing MD and VIMP measures in the plot, (2) computing the OOB c-index from the nested RSF model, and (3) estimating the incremental error rate of each variable in the nested sequence of RSF models from the top variable and calculating a dropping error rate. This RSF multimodal approach enabled us to exclude from the outset the SNPs and lifestyle factors that were not significantly associated with breast cancer, leading to increased statistical power and corrected type I error rate compared with the original RSF model61.

Further, we applied a GMDR model that is described in detail elsewhere45,46,47. The GMDR reduces high-dimensional multifactor prediction to a single dimension by the ratio of high vs. low risk, and thus detects the best gene–gene interaction model. It produces key predictability performance measures, including testing balance accuracy (TBA), cross-validation consistency (CVC), and sign p value. The model with the highest TBA, CVC 10/10, and p < 0.05 based on 1000-times permutation testing was considered the best model.

Multiple Cox proportional hazards regressions, with a test of proportional hazards via a Schoenfeld residual plot and ρ evaluation, were conducted to obtain hazard ratios (HRs) and 95% confidence intervals (CIs) for the single and combined effects of SNPs and lifestyle factors on breast cancer, with adjustment for covariates (Table 1). A 2-tailed p value < 0.05 was considered statistically significant, and multiple comparisons were adjusted by the Benjamini–Hochberg method65. GMDR v.1.0. and R v.3.5.2. (survival, survivalROC, randomForestSRC, ggRandomForests, gamlss, ggsurvplot, and forestplot packages) were used.

Table 1 Characteristics of participants, stratified by breast cancer.


The allele frequencies of 156 GWA CRP/IL-6-related SNPs and baseline characteristics of participants are displayed in Tables S1 and 1, respectively. Breast cancer patients had relatively higher education, greater family income, and family history of diabetes and breast cancer, smoked more cigarettes/day, consumed more dietary alcohol/day, and were more depressed, obese both overall and viscerally, and taller. They also tended to experience early menarche and late menopause and had less history of hysterectomy and shorter duration of OC and E-only use, but longer duration of E + P use.

Two-stage multimodal RSF and GMDR approach

With the 156 GWA SNPs and 48 lifestyle factors, we performed the two-stage RSF and GMDR (Figure S3) to determine the most predictive variables with the highest predictability and lowest prediction error for breast cancer risk. In the first stage, we estimated 2 predictability performance measures, MD and VIMP. For lifestyles and SNPs separately, we created a plot to compare those 2 measures and identified the strongest predictive lifestyle and genetic factors that were in agreement with high ranks (Figure S4) in overall analysis: 12 of 48 lifestyles and 13 of 156 SNPs. We further conducted the first stage of RSF for SNPs in the subgroups, which yielded the following results: 8 and 13 of 117 SNPs (BMI < 30 and ≥ 30, respectively); 14 and 7 of 70 SNPs (WHR ≤ 0.85 and > 0.85, respectively); 10 and 6 of 81 SNPs (WST ≤ 88 and > 88, respectively); 7 and 12 of 82 SNPs (METs ≥ 10 and < 10, respectively); and 19 and 12 of 116 SNPs (SFA < 9 and ≥ 9, respectively). All of the SNPs identified in this first stage of RSF were associated with CRP.

Next, with the 12 lifestyles and selected SNPs together, overall and in subgroups, we conducted the second multimodal RSF to construct risk profiles with the most predictive variables. Particularly, in the overall group, we first computed the 2 measures MD and VIMP (Table 2) and compared them in a plot (Fig. 1A), in which a dashed red line represents agreement of the 2 measures. Both measures with high ranks indicated 5 SNPs (SALL1 rs10521222; HLA-DQA1 rs9271608; DUSP1 rs17658229; APOC1 rs4420638; and TRAIP rs2352975) and 3 lifestyles (duration of OC and E + P use and BMI) as the most influential variables for breast cancer. Second, we estimated the c-index (i.e., the AUROC) from the nested RSF model (Table 2) and plotted (Fig. 1B) where variables ranked by MD, identifying the same set of top variables (5 SNPs and 3 lifestyles). Those top variables substantially improved the c-index prediction accuracy, whereas others did not, suggesting that the c-index has complementary prediction ability. Last, we computed a dropping error rate for each variable in the nested sequence of RSF models (Table 2), and once again identified the same top 8 variables as the strongest contributors to reduce the error rate, thus improving the prediction accuracy. Further, using the GMDR method, we determined the best gene-by-gene interaction models up to 5 orders of interactions (Table 3), of which the one-factor model including TRAIP rs2352975 was the best predictive with the highest TBA of 0.5382 and CVC of 10/10 (p < 0.001).

Table 2 The second stage of random survival forest analysis: predictive value of variable for breast cancer in overall analysis.
Figure 1
figure 1

Overall analysis: the second stage of random survival forest (RSF) with 13 single-nucleotide polymorphisms and 12 behavioral factors selected from the first stage of RSF analysis. (A) Comparing minimal depth and VIMP rankings. (BMI, body mass index; E + P, exogenous estrogen + progestin; VIMP, variable of importance. 8 variables within the gold ellipse were identified as the most influential predictors). (B) Out-of-bag concordance index (c-index) (improvement in the out-of-bag c-index was observed when the top 8 variables [filled black circle] were added to the model, whereas other variables [open circle] did not further improve the accuracy of prediction)

Table 3 GMDR-based model for high-order gene–gene interactions in relation to breast cancer risk.

For each of the obesity strata (BMI, WHR, WST, MET, and SFA), we continuously applied those multimodal (Tables S2.1–10 and Figures S59) and GMDR (Table 3) approaches, and determined the strongest predictive markers with the most common 6 SNPs (TRAIP rs2352975, DUSP1 rs17658229, HLA-DQA1 rs9271608, SALL1 rs10521222, HNF1A-AS1 rs2243616, and APOC1 rs4420638) and 5 lifestyle factors (dietary alcohol intake, E + P and OC use, BMI, and hip circumference).

Combined and joint effects of the most influential SNPs and lifestyles on breast cancer risk

By accounting for confounding factors and the nonlinearity of each variable via the RSF method, we estimated the predicted cumulative incidence rate of breast cancer (Fig. 2). The genotypes of each SNP were originally continuous variables and then categorized accordingly for further analysis with the following risk genotypes (Fig. 2A–E): TRAIP rs2352975 CT + TT, DUSP1 rs17658229 CC, HLA-DQA1 rs9271608 GG, SALL1 rs10521222 TT, and APOC1 rs4420638 GG. Also, by using a cutoff value bisecting variables (Fig. 2F–I), high-risk lifestyle groups were defined as ≥ 18 g/day of alcohol consumption, ≥ 10 years of E + P use, < 5 years of past OC use, or ≥ 30 BMI and further analyzed as binary variables. With the best predictive GMDR-modeled SNPs and risk lifestyles overall and in subgroups, we developed multivariate models for breast cancer risk (Table S3). These results suggested a stronger individual effect of some SNPs than the rest of the SNPs and lifestyles on breast cancer risk, even after accounting for confounding factors.

Figure 2
figure 2

Cumulative breast cancer incidence rate for the 9 most influential variables (5 SNPs and 4 behavioral factors) based on random survival forest analyses. (E + P, exogenous estrogen + progestin; SNPs, single-nucleotide polymorphisms. Dashed red lines indicate 95% confidence intervals).

The SNPs and lifestyles, when combined or jointly associated, displayed different patterns of breast cancer risk. In particular, in the overall non-obese (BMI < 30) group (Table 4), the best predictive SNPs and lifestyles were combined separately. When stratified by alcohol intake, high alcohol consumers (≥ 18 g/day) who had the maximum number of risk genotypes had a 4 times increased risk for breast cancer than low alcohol consumers (< 18 g/day) who had less or null-risk genotypes. Consistently, high alcohol consumers with one or more risk lifestyles had 3 times higher risk than low alcohol consumers with a null-risk lifestyle. When SNPs and lifestyles were combined, compared with the lowest-risk group (null risk for genotypes and lifestyles), the moderate-risk (high risk of either genotypes or lifestyles) and the highest-risk groups (high risk of both genotypes and lifestyles) had about 3 times and 6 times greater risk, respectively, suggesting a gene–lifestyle dose–response relationship. Further, when stratified by alcohol consumption, higher alcohol consumers with high risk of both genotypes and lifestyles had 10 times the excessive risk, compared with low alcohol consumers with low risk of both genotypes and lifestyles. This indicates a significant joint effect of alcohol intake with the SNPs and lifestyles on breast cancer risk in an additive model (G × E: HR = 1.15, p 0.547). Multiple testing was corrected to control the false-discovery rate. The analyses of the non-viscerally obese (WHR ≤ 0.85) group (Table 4) yielded similar results but with stronger combined and joint effects of risk genotypes and lifestyles with alcohol intake on breast cancer risk in both additive and multiplicative models (G × E: HR = 1.37, p 0.253).

Table 4 Stratification analysis by BMI and WHR: joint effect of dietary alcohol intake with combined risk genotypes and behavioral factors on breast cancer risk.

We further evaluated the combined effect of SNPs and lifestyle factors and their joint effect with E + P use on breast cancer risk (Table S4) and determined that the risk genotypes and lifestyles, both separately and in combination, had a synergistic effect with longer use of E + P (≥ 10 years) on cancer risk. This pattern appeared more strongly in obesity strata (BMI, WHR, MET, and SFA) than in the overall group (Fig. 3).

Figure 3
figure 3

Forest plot of the joint effect of E + P use with risk behavioral factors and genotypes on breast cancer risk overall and in subgroups (A BMI < 30 and WHR ≤ 0.85; B MET ≥ 10 and SFA ≥ 9). The plot shows the independent and combined effect of risk behaviors and genotypes on breast cancer risk, jointly testing with E + P use, presented as the 95% CIs (indicated with red lines) and the estimates (proportional to the size of the blue squares). BMI, body mass index; CI, confidence interval; E + P, E + P, exogenous estrogen + progestin; HR, hazard ratio; MET, metabolic equivalent; SFA, saturated fatty acids; WHR, waist-to-hip ratio. * The combined number of risk genotypes and behavioral factors was based on risk genotypes defined as 0 (low risk: none or < total number of risk alleles) and 1 (high risk: combined all risk alleles) and based on behavioral factors defined as 0 (low risk: null risk behavior) and 1 (high risk: 1 or more risk behaviors). The ultimate number of risk genotypes combined with behavioral factors was defined as 0 (low risk for genotypes and behaviors), 1 (either high risk for genotypes or behaviors), and 2 (both high risk for genotypes and behaviors). ** The number of behavioral factors was defined as 0 (null risk behavior) vs. 1 (1 risk behavior) vs. 2 (2 or more risk behaviors).


An increasing number of population-based cancer genomic studies have incorporated environmental factors in the molecular causal pathway. Comprehending how lifestyle factors interact with genes and phenotypes, influencing risk for breast cancer, is important for constructing improved risk profiles, leading to the development of a gene–lifestyle combination intervention for primary cancer prevention efforts. Our 2-stage multimodal RSF and GMDR analyses identified the strongest predictive genetic and lifestyle variables overall and in obesity strata. The genetic effects in this study were associated with the SNPs involved in inflammatory cytokine pathways. The most common markers for breast cancer risk across the strata are 2 SNPs related to CRP (SALL1 rs10521222 and HLA-DQA1 rs9271608) and, consistent with previous studies66,67,68, 5 lifestyle factors such as alcohol intake, lifetime cumulative exposure to estrogen (post OC and E + P use), and overall and visceral obesity. The risk profiles that combined those influential variables presented a synergistic effect on the increased risk for breast cancer in a gene–lifestyle dose-dependent manner.

One SNP near SALL1, in relation to CRP, both overall and in the obesity strata, is associated with breast cancer risk. SALL1 is a member of the SALL gene family, encoding a multiple zinc-finger transcription repressor that regulates organogenesis and development of embryonic stem cells69,70,71. The role of the SALL genes (particularly SALL2 and SALL4) in tumorigenesis has recently been investigated as a tumor suppressor for ovarian and Wilms’ tumors72,73, hepatoblastoma, and gastric carcinoma74,75. However, the function of SALL1 in cancer development has not been determined. Few recent studies of in vivo RNAi screen and in vivo/in vitro breast cancer cells have implicated SALL1 as a tumor suppressor in breast cancer by inhibiting cancer cell growth, proliferation, and cell-cycle arrest, through the Nucleosome Remodeling and Deacetylase network76 or by regulating CDH1, a contributor to epithelial-to-mesenchymal transition77. Our finding of the SALL1 SNP’s association with CRP at the GWA level and with breast cancer risk is supported by these previous biologic studies and further suggests the involvement of SALL1 in immune mechanisms of breast cancer tumorigenesis.

HLA-DQA1 belongs to the human leukocyte antigen (HLA) class II alpha chain paralogues, which increase immune system sensitivity by distinguishing its own proteins from foreign invaders78,79. HLA class II, the human version of the major histocompatibility complex (MHC) class II, regulates the antitumoral cellular immune response by presenting MHC antigen in tumor cells to the immune system, stimulating tumor infiltration of CD4 + T cells80,81,82. Several previous studies reported that the SNPs of HLA class II have implications in the carcinogenesis of specific cancers (e.g., ovarian83, squamous cell lung84, gastric85, and esophageal86 cancers), but limited studies in association with breast cancer have been conducted and were restricted to subjects other than Caucasians; further, the results were inconsistent80,81 or null82. Our study is the first to report the association of the HLA-DQA1 SNP with breast cancer risk in non-Hispanic white women, suggesting that HLA class II plays a decisive role in the pathogenesis of breast cancer in this population by diminishing the efficacy of the antitumoral immune response. Also, this association would have been missed without the incorporation of obesity factors, which calls for further study of the biologic mechanism.

A number of epidemiologic studies have revealed that alcohol intake, even of a small amount (e.g., ≤ 1 drink [moderate]/day), can increase breast cancer risk in both pre- and post-menopausal women66,87,88,89,90. Notably, in postmenopausal women, few studies have examined the combined and joint effect of alcohol intake with other lifestyles66,67,68 or relevant genetic variants91,92 on breast cancer risk; in particular, the gene–lifestyle study results did not support a significantly increased risk among women who carried specific risk genotypes and had higher alcohol intake91,92. Molecular biologic mechanisms of alcohol-associated tumorigenesis in breast cancer may involve complicated pathways: an elevated level of estrogen by testosterone conversion; an increased level of insulin-like growth factors from the liver due to alcohol consumption93,94; and disruption of folate metabolism95. Also, acetaldehyde, derived from the metabolism of ethanol, is a carcinogenic metabolite that causes formation of DNA adducts and inhibits DNA repair and methylation patterns90,96. Further, high and regular alcohol intake may lead to a dietary deficiency of essential nutrients, making individuals susceptible to tumorigenesis90. Corresponding to this alcohol-response tumorigenic environment, and supported by previous research66, our study showed that more than moderate alcohol intake, jointly with the risk SNPs, substantially elevated the risk of breast cancer synergistically; and this synergistic effect occurred more strongly in the non-obese subgroups.

Another influential lifestyle factor in our study is the opposed E + P use that contributes to the lifetime cumulative exposure to estrogen. Synthetic progestin is a well-established risk factor for breast cancer97,98,99, with an affinity for androgen and mineralocorticoid receptors, leading to cell proliferation and anti-apoptosis97,100. Further, the joint effect of E + P use with the SNPs was profound in the non-obese subgroups, suggesting complementary pathways of sex hormones and obesity (i.e., the effect of sex hormones maximized in non-obese individuals with relatively lower hormone levels).

The amounts of daily dietary alcohol intake were obtained from self-reported food frequency questionnaires and then validated to be highly correlated with 1 month of food-diary records (r = 0.9)101. In addition, we confined our study population to non-Hispanic white postmenopausal women, limiting the generalizability of our study findings to other populations. Due to insufficient statistical power, we were unable to investigate the molecular subtypes of breast cancer. Despite several benefits from the 2-stage RSF multimodal and GMDR approaches, it can overfit the model owing to complicated analysis tasks, particularly in relatively small subgroups, so our results need to be replicated in an independent study with a large sample size.

Overall, in this study, the SNPs in proinflammatory cytokines previously identified as genome-wide significant had a synergistic effect on breast cancer risk by combining with lifestyle factors, including alcohol intake, lifetime cumulative exposure to estrogen, and obesity. Our findings warrant molecular biologic studies such as gene signature and aberrant cell signaling in relation to breast cancer in postmenopausal women who have a history of alcohol intake and estrogen use by different levels of obesity and related lifestyles. Our study may contribute to improved prediction accuracy and the ability to assess breast cancer risk, and suggest potential interventions for women who carry the risk genotypes, such as partial or absolute abstinence from alcohol intake, shorter duration of hormone therapy, and better weight control, potentially leading to an improved impact on the epigenetic aberrations and thus reducing the risk of breast cancer.