Introduction

Breast cancer risk has been reported to be consistently lower among celiac disease patients, ranging from being 10-15% lower in Nordic studies1,2,3,4 to as much as 50-80% lower in other European studies with smaller sample sizes5,6,7,8. Celiac disease is a lifelong gastrointestinal disease characterized by villous atrophy and inflammation in the small intestine9. It occurs in about 1% of the Caucasian population and is triggered by gluten exposure10. There is little in literature to clarify why a diagnosis of celiac disease confers protection against breast cancer. Explanations that have been forwarded include a lower body mass index and lower estrogen exposure both as a consequence of celiac disease; for example, characteristics secondary to undernutrition such as later menarche and earlier menopause commonly observed at high frequencies among women with celiac disease are associated with decreased risk of breast cancer2. A third hypothesis involves the role of immunogenic factors in breast cancer development and progression11. Inverse relationships have also been observed between breast cancer and other inflammatory disorders such as ulcerative colitis and rheumatoid arthritis, which is consistent with a possible involvement of the immune system in the etiological pathway of breast cancer2. Others have also shown that the interplay between hormonal and immune-related mechanisms can shape mammary tissue development12.

Both breast cancer and celiac disease have strong genetic components. Heritability is estimated to be between 25 to 31%13, 14 and 68 to 75%15 for breast cancer and celiac disease, respectively. Heritability of complex diseases such as celiac disease and breast cancer is highly polygenic, which means that it is controlled not just by one gene, but rather, by multiple genes16. For example, genome-wide association studies (GWAS) interrogating upwards of ~200,000 single nucleotide polymorphisms (SNPs) have identified ~40 loci associated with celiac disease17, and more than twice as many (107 loci) for breast cancer18. Studies performed on different phenotypes have shown that certain genetic loci can be associated with seemingly distinct traits, otherwise known as pleiotropy19. It is also known that many diseases and traits exhibit significant coheritability20 and have shared genetic components21, and such genetic associations often reveal clues about novel mechanisms and pathways.

With the emergence of large-scale GWAS studies, it is timely to leverage on the data collected from international consortia to investigate possible causal trait relationships22 using clinical and epidemiological data as a guide23,24,25. Motivated by evidence of an association between celiac disease and breast cancer from epidemiologic studies, the aims of this study are to test whether these findings could be due to shared genetic determinants, and to elucidate what mechanisms are potentially responsible for the common connection.

Results

Inverse genetic correlation and genetic overlap between breast cancer and celiac disease

Genetic correlation is indicative of shared genetic etiology. It refers to common genetic variation associated with a pair of phenotypic traits, assuming an additive model. Given the epidemiological association between breast cancer and celiac disease, as well as their strong genetic heritabilities, we used GWAS summary statistics for each disease to estimate genetic correlation. For breast cancer, data was obtained from the Collaborative Oncological Gene-Environment Study (COGS) consortia (http://www.cogseu.org/). The study includes individuals of European ancestry genotyped on a custom Illumina iSelect Array (iCOGS)26, which comprises 211,155 SNPs. iCOGS was designed to understand genetic susceptibility of three hormone related cancers: breast, ovarian, and prostate. As breast cancer is a heterogeneous disease, we also included two major subtypes of breast cancer based on estrogen receptor (ER) status. In the COGS, analyses had been conducted on 46,785 breast cancer cases and 42,892 controls to estimate breast cancer risk overall, 27,078 ER-positive cases and 42,111 controls for ER-positive breast cancer risk, and 7,333 ER-negative cases and 42,468 controls for ER-negative breast cancer risk18. Summary statistics for celiac disease (133,352 SNPs) were downloaded from the ImmunoBase (https://www.immunobase.org/), a web based resource focused on the genetics and genomics of immunologically related human diseases. Celiac disease data have been reported in a GWAS study by Trynka et al.17 on 12,041 celiac disease cases and 12,228 controls of European ancestry using the Illumina Infinitum High-Density array (ImmunoChip), interrogating 195,806 SNPs and designed to target immune associated genome regions27. To improve the comparability between the two datasets, we used summary statistics for 173,301 SNPs in the iCOGS study imputed against the 1000 Genomes Project (1KG) March 2012 release reference panel18, which were also present on the ImmunoChip. The datasets were matched based on chromosome and SNP base pair positions, which resulted in 129,618 celiac disease SNPs used as input.

Genetic correlation between breast cancer and celiac disease was analyzed using LD score regression (LDSC)28 to model effect size estimates for immunogenic SNPs in both diseases. Therefore, rather than studying cross-correlation, we are studying the role of celiac disease risk loci in breast cancer susceptibility. Given the observed reduced risk of breast cancer in celiac disease patients, we would expect an inverse genetic correlation. After LDSC filtering procedure and merge with reference LD scores (Supplementary Table 1), genetic correlation analyses included 45,451, 45,451 and 45,447 matching SNPs between the celiac disease and breast cancer overall, ER-positive and ER-negative datasets, respectively. Significant inverse genetic correlations (r) were found between celiac disease and overall breast cancer (r = −0.17, s.e. 0.05, P = 0.0005) and ER-positive breast cancer (r = −0.15, s.e. 0.06, P = 0.01), but not for ER-negative breast cancer (r = −0.03, s.e. 0.07, P = 0.71) (Fig. 1).

Figure 1
figure 1

Genetic correlation using LD score regression. Plot is based on LDSC regression coefficients for each comparison pair (e.g. BC vs CD, BCER-negative Vs CD, BCER-positive Vs CD, etc.). Subdiagonal cells indicate the respective correlation coefficient and (standard error). Square size and color are scaled according to the correlation coefficient r. Traits are paired using hierarchical cluster analysis. #Number of overlapping SNPs included in the LDSC regression.

Further interrogation of shared common genetic components between the two diseases was carried out using SNP effect concordant analysis (SECA)29, where SNP effect size estimates were tested for concordant or discordant effects, analogous to genetic correlation tested with LDSC. Additionally, SECA also assesses the extent of genetic overlap (enrichment of overlapping SNPs between the two traits with low P-values). For each dataset pair comparison, SECA aligned and selected 15,365, 15,400, 15,428 independent SNPs that are common between the celiac disease and overall, ER-positive, and ER-negative breast cancer datasets, respectively (Supplementary Table 2). In the primary analyses which are summarized in Supplementary Figure 1, genetic overlap (defined as excess of celiac disease SNPs in overlap with breast cancer datasets) was significant between celiac disease and breast cancer overall (P BT-permuted = <0.001), ER-positive (P BT-permuted < 0.001), and ER-negative (P BT-permuted < 0.001). SNP effect discordance (inverse correlation) between celiac disease and breast cancer overall (P FT-permuted = 0.059), ER-positive (P FT-permuted = 1.000), and ER-negative (P FT-permuted = 0.278) did not reach significance. In order to determine size of the association, SECA identified a subset of overlapping SNPs yielding the most significant correlation, namely, minimum concordance or discordance. This analysis was carried out studying the association between celiac disease and breast cancer overall (ORFT-min, 0.60, 95% CI 0.44–0.82, P FT-min = 0.001), ER-positive (ORFT-min, OR, 0.86, 95%CI, 0.74–1.00 P FT-min = 0.05), and ER-negative (ORFT-min, OR, 0.73, 95%CI, 0.57–0.95 P FT-min = 0.02) (Table 1). After adjusting for multiple testing, only the association for overall breast cancer (which showed discordance) remained significant (P FTmin-permuted = 0.022).

Table 1 Inverse genetic correlation for the SNPs subset yielding most significant association (minimum discordance).

Higher celiac disease genetic susceptibility associated with decreased breast cancer risk

In a third approach, we tested whether top SNPs for celiac disease (CD-SNPs) could predict breast cancer status in a group of women. Genetic susceptibility to celiac disease was summarized using polygenic risk scores (celiac-PRS). In essence, celiac-PRS accounts for the genetic susceptibility of an individual based on the risk allele load weighted by the SNP effect sizes reported by the celiac disease GWAS, under an additive model. Celiac-PRS was analyzed as an exposure variable in a case-control study comprised of 5,002 breast cancer cases and 5,433 controls from the pKARMA cohort. Celiac-PRS was found to be inversely associated with overall and ER-positive breast cancer risk in a dose-dependent manner (P-trend < 0.02) (Table 2). Celiac-PRS based on 199 genome-wide significant CD-SNPs was associated with 6% lower risk of overall (ORper-SD, 0.94, 95% CI 0.91 to 0.98, P = 0.002) and ER-positive breast cancer (ORper-SD, 0.94, 95% CI 0.90 to 0.98, P = 0.004), and 2% for ER-negative breast cancer (ORper-SD, 0.98, 95% CI 0.90 to 1.06, P = 0.54). The risk was 13% lower in individuals with the highest genetic susceptibility to celiac disease (4th celiac-PRS quartile compared to 1st quartile) for both overall and ER-positive breast cancer risk (overall: ORQ4, 0.87, 95% CI 0.78 to 0.97, P = 0.016; ER-positive: ORQ4, 0.87, 95% CI 0.77 to 0.98, P = 0.022). The risk was up to 17% lower in women with highest susceptibility when the celiac-PRS included 3,803 SNPs nominally associated with celiac disease (i.e. P CD < 0.05) (overall: ORQ4, 0.83, 95% CI 0.75 to 0.93, P = 0.001; ER-positive: ORQ4, 0.83, 95% CI 0.74 to 0.93, P = 0.002). As expected under a polygenic model, including more CD-SNPs in the profiles improved the strength of the association with breast cancer risk (yielding smaller P-values) (Fig. 2). In a case-only study, no significant difference was observed within tumor subgroups defined by ER-status, lymph node involvement status, HER2 status, tumor grade, and tumor size (Supplementary Table 3).

Table 2 Association of celiac-PRS profiles with breast cancer risk.
Figure 2
figure 2

Significance of celiac-PRS profiles for association with overall breast cancer risk. Profiles are based on CD-SNPs under given P CD thresholds. P-values are transformed into negative logarithmic scale base 10. Red solid line denotes threshold for a nominally significant association (P < 0.05).

Candidate immune response genes and pathways underlying the association between celiac disease and breast cancer

To identify celiac disease genes and molecular pathways involved in breast cancer susceptibility, we performed enrichment analysis using Data-driven Expression Prioritized Integration for Complex Traits (DEPICT)30. The method assumes that the selected loci surpass the genome-wide significant threshold (P < 5.00E-08). However, since genome-wide significant variants did not determine the genetic overlap between breast cancer and celiac disease, we changed the threshold to include variants with moderate signals. From the 15,365 independent SNPs defined in the SECA analysis, we selected SNPs which had suggestive association with celiac disease (118 SNPs under P CD < 1.00E-05). Given that the genetic overlap was limited to immune-related genomic regions genotyped for celiac disease, we ranked overlapping SNPs by their evidence of association (by P-values) relative to the overlap with breast cancer to reduce the possibility that the prioritized genes could be due to chance (sensitivity analysis). SNPs were divided into two SNP subsets, P BC ≤ 0.05 and P BC > 0.05, containing 52 and 66 SNPs respectively. If the loci were truly important for the genetic association between celiac disease and breast cancer, findings based on second SNP subset would be less reliable. Hits derived from SNPs not nominally associated with breast cancer were therefore removed from the initial findings. The SNPs with smallest breast cancer P-value (P BC < 0.0001) were rs114762590 and rs115258774, both of which were significantly associated with celiac disease surpassing the genome-wide significant threshold and had opposite direction to breast cancer. Summary statistics for the top 52 SNPs are shown in Supplementary Table 4.

Genes mapping onto the 52 ‘top’ overlapping SNPs were significantly overrepresented in 21 biological processes (FDR < 0.05). After the sensitivity analysis (yielding 600 significant processes at FDR < 0.01), 14 processes remained as unique hits and were mainly related to induction of programmed cell death, MAP3K7 cytokine-activated transduction pathway, other signaling protein-protein interaction subnetworks, myeloid leukocyte differentiation, as well as decreased cell number of leukocytes and lymphocytes (Supplementary Table 5). To identify genes most likely underlying these biological processes, DEPICT identified 15 prioritized genes that were deemed most relevant based on their probability to enrich for the same biological processes as other candidate genes (i.e. redundant genes were removed, FDR < 0.05) (Table 3). We found 13/15 genes to be exclusively prioritized from the analysis of 52 ‘top’ SNPs associated with celiac disease and overlapping with breast cancer at P BC ≤ 0.05 (Supplementary Figure 2).

Table 3 DEPICT prioritized genes mapping onto 52 ‘top’ overlapping SNPs.

Discussion

In agreement with epidemiological studies showing lower breast cancer risk in celiac disease patients, we found an inverse genetic association between the two diseases using three different methods. There was no evidence that the association between genetic susceptibility to celiac disease and lower risk of breast cancer differed by tumor characteristics. We also prioritized apoptotic and immune-related genes that could represent important etiological factors underlying the reduced risk of breast cancer in celiac disease.

In spite of the strong heritability of breast cancer and celiac disease, the etiological role played by their genetic components remains to be uncovered. Both prospective cohort studies31,32,33 and randomized clinical trials34, 35 have found that breastfeeding duration and age at gluten introduction may be less important than previously thought for the etiology of celiac disease. Instead, it seems that genetic factors15, 36 determine the risk of celiac disease. For breast cancer, immune response factors have been shown to be important factors associated with prognosis37, and potentially to breast cancer susceptibility38. As for genetic markers of breast cancer, studies on immune response candidate genes have identified few single risk alleles for specific populations39, 40.

In our study, inverse genetic correlation between breast cancer and celiac disease were consistent across different methods, indicating that the link between the two diseases is a result of shared immunogenetic components. Through LDSC and SECA methods which use all SNPs with available summary statistics for pair of traits, we found significant genetic correlation and overlap between breast cancer and celiac disease. While LDSC uses all common SNPs between the two diseases to estimate genetic correlation (r = −0.17), SECA identified inverse correlation in the most significant subset of SNPs (OR = 0.60), indicating the presence of allelic effects that increases risk for one disease and decreases risk for the other. The fact that genetic discordance (negative correlation) was not found significant in the SECA primary analyses as compared to the LD score regression, could be due to loss of power and related to the different approaches they use to deal with LD structure (see methods). In a third approach, genetic correlation was estimated by first summarizing the per-individual allelic load using a polygenic risk score, and then regressing the effect in a case control set up. We found 6 up to 17% lower risk to breast cancer to be comparable to the 10–15% decreased risk reported in Nordic epidemiological studies1,2,3,4. Given shared environmental exposures which could mediate the association, we considered GWAS summary statistics data for body mass index (BMI) from the GIANT consortium41. The average BMI has been shown to be lower in celiac disease patients, while low BMI is also associated with lower risk of breast cancer. However, we did not find any indication of a genetic correlation between BMI and breast cancer or celiac disease (P = 0.23 and P = 0.79 respectively, data not shown).

The involvement of the immune system is typically associated with ER-negative disease. Lymphocytic infiltration has been reported as a favorable prognostic factor for ER-negative42 and triple-negative breast cancer43. ER-negative breast cancer is characterized by a stronger immunogenic component, which could be proposed as the underlying link with celiac disease. Still, we did not observe genetic correlation between celiac disease and ER-negative breast cancer, most probably due to the low statistical power as a consequence of a smaller sample size for the ER-negative datasets.

Since complex traits such as breast cancer and celiac disease involve the deregulation of multiple interrelated biological processes, it may be informative to isolate which genes are important in the etilology of both diseases. In our functional enrichment results, we found genes involved in relevant mechanisms to be implicated in developmental and immunoregulatory processes. The most significantly prioritized genes were: Mesoderm Induction Early Response 1 (MIER1), C-src Tyrosine Kinase (CSK), Secretory Carrier Membrane Protein 2 (SCAMP2), and Interleukin 12 Alpha (IL12A). MIER1 codes for proteins with transcriptional repressive function and has been found upregulated in human breast carcinoma cell lines and tumors44. By interaction with transcription factors and chromatin modifiers such as ER-alpha and histone deacetylase inhibitor HDAC1/245, 46, MIER1-alpha inhibits estrogen dependent growth and lack nucleus internalization during breast cancer progression47. Thus, is possible that genetic variants could affect MIEK1 protein interactions and migration mechanisms necessary to exert its function, explaining the high expression level seen in breast cancer cells as a compensatory response. CSK gene codes for the human cytosolic non-receptor tyrosine kinase protein, which regulate different transduction signals implicated in cell growth, differentiation, migration and immune response processes. CSK inactivates the sarcoma (Src) family kinases which otherwise would lead to T-Cell antigen specific response by phosphorylating zeta chain T-cell receptor (TCR), which has been found downregulated for different cancer types, autoimmune disease and chronic inflammation48. IL12A codes for a cytokine with important effects on the regulation of immune and inflammatory responses49 and has been considered for cancer immunotherapy50 . ILI12A loci selected through genetic population factors have been associated with celiac disease and other autoimmune diseases51. A query in a pathway catalog (http://pathcards.genecards.org; accessed on November 15, 2016) showed that other significantly prioritized genes (SEMA7A, ITGA4, IL18RAP, CD24, TAGA, TNFRSF, ZFP36L1) are classified on immune-related signaling pathways with important immunomodulatory52, 53 and apoptotic functions54. Overall, this suggests that a complex network of signaling pathways play an important role in the regulation of the immune response and surveillance. Disruption of this network could lead to autoimmune responses, or to changes in mammary microenvironment elements predisposing to cancer immune evasion, in line with cumulative evidence highlighting the relevance of host immunity and genomic alterations in the disease heterogeneity and for tailoring therapeutic interventions55. It could be hypothesized that while some of the overlapping variants might be involved in heightened immune responses, they could at the same time increase immunosurveillance against carcinogenic processes in breast tissue, thereby reducing breast cancer risk. Our findings might guide future studies that can help to understand the role played by the immune system in breast cancer susceptibility.

The main strength of our study is the use of reliable summary statistics from large multicenter GWAS consortia for both diseases, and the leverage of epidemiological association to estimate genetic correlation and immune-related genetic susceptibility to breast cancer. By using different polygenic approaches and prior biological knowledge, we could detect novel associations. It is notable that the shared genetic component between celiac disease and breast cancer was not driven by strong signals (e.g. SNPs surpassing stringent GWAS threshold), but rather determined by several weaker signals, namely ‘suggestive variants’. Although this type of variants are typically not the ones identified in conventional genome-wide or candidate gene association studies, they may still be indicative of biological importance. A notable limitation is that the custom SNP chips used by the respective consortia target different regions of the genome, which reduces comparability, even when imputation of breast cancer genotypes is used. We also explored the use of methods such as Direct Imputation of Summary Statistics (DIST)56 on celiac disease dataset, which increased the number of common variants with good quality imputation (INFO > 0.9), to approximately 500 K. However, the use of this method did not improve the comparability between the two diseases. After LD-based pruning as performed in SECA analysis, ~14 K independent SNP remained for comparison. Despite its restrictions, the ImmunoChip provides information on the most important genetic components of celiac disease (mainly at both HLA and no-HLA regions) and therefore can be used to highlight an otherwise undermined immunogenic role in breast cancer susceptibility. Our analysis should not be regarded as a full genome assessment of the genetic overlap between the two diseases, but rather as an assessment of the shared genetic variation of immune-related regions. If we had had access to celiac disease individual-level genotype data, genetic correlation analysis using other robust methods such as the GCTA–GREML57 would have been possible. It is however unclear whether imputation based on raw data could allow for a more comprehensive comparison of the genetic variation between the two diseases. It is also possible that deeper genome coverage could improve the assessment of the genetic overlap and facilitate the identification of common causal variants.

In summary, we show evidence of a shared genetic component underlying the link between the two diseases at immune-related regions. The protective effect associated to higher load of celiac disease genetic susceptibility, summarized by the celiac polygenic risk score in a Swedish cohort, suggest that a less responsive immune system is implicated in the predisposition to breast cancer. While considering that our analyses were constrained by the immune-related genomic coverage, we used functional annotation analyses to identify genetic loci known to be involved in the complex regulation of the innate immune response which are likely to underlie common etiological basis between the two diseases. Replication of our findings and refined analysis related to disease subtypes will require larger samples sizes and better genotype data. Functional analyses integrating other layers of Omics data will be helpful to identify and validate specific mechanisms underlying breast cancer development, and possibly shed light on breast cancer prevention and treatment strategies.

Methods

Genetic correlation and overlap tests using GWAS summary statistics

Genetic correlation was estimated using the cross-trait LD score regression (LDSC)28 software (v1.0.0) on matching SNPs surpassing LDSC filter procedure. Given that imputation quality correlates with LD score, HapMap3 SNPs with European MAF > 1% (w_hm3.snplist) were filtered with the -merge-allele flag. LDSC defines genetic correlation between a pair of traits as the genetic covariance normalized by heritabilities on each phenotype accounted by the genotyped variants (SNPs) across the genome. Genetic covariance is estimated under a model where standardized genotype effects sizes are treated as random and is, in practice, estimated by regressing z-scores on sample size weighted linkage disequilibrium (LD) scores. In our analyses, we only included SNPs with reliable LD scores available in the in-software file (w_hm3.snplist), comprised of 1,217,311 SNPs with pre-computed LD scores estimated from European-ancestry samples in the 1 KG reference panel (see online Methods in Bulik-Sullivan et al.58), using the ‘merge-alleles’ flag.

SNP Effect Concordance Analysis (SECA)29 was used to test genetic overlap and SNP effect direction (analogous to genetic correlation tested by LDSC above). For each pair of datasets (celiac disease against overall, ER-positive, and ER-negative breast cancer), matching SNP were selected through SECA filtering and alignment procedures. Independent (index) SNPs were selected by SECA through a two-step ‘P BC-value informed” LD-clumping procedure (first round: pairwise LD r2 > 0.1 within 1 Mb windows; second round: pairwise LD r2 > 0.1 within 10 Mb windows) based on 1 KG v3 CEU (b37 rsIDs; MAF > 1%). Following SECA scripts, pleiotropy tests were performed for each dataset pair on 144 subsets defined by combinations of 12 × 12 P-value thresholds on breast cancer (P BC) and celiac datasets (P CD), that is {P BC, P CD} = {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. Genetic overlap was analyzed using binomial tests (BT) to determine whether there is an excess [observed (obs) ≥ expected (exp)] of SNPs with overlapping p-values (obs). Because GWAS are expected to produce an excess of lower P-values, the ‘overlap’ null probability (e.g.’expected proportion’) is defined as the observed proportion of celiac disease SNPs under the given P CD. Genetic concordance was analyzed using Fisher’s tests (FT) performed on 2 × 2 contingency tables for the SNP effect direction (positive or negative) on both datasets. The per-subset SNP effect direction is defined as concordant (positive correlation) when there is a significant larger proportion of SNPs with the same direction in both datasets (FT’s OR > 1; P < 0.05), and discordant (negative correlation) when in the opposite direction (FT’s OR < 1; P < 0.05). Primary tests were performed via permutation by repeating the analyses of the 144 subsets on one thousand uncorrelated datasets generated by randomly shuffling the observed SNP effect (BETA) and corresponding P-value between SNPs in breast cancer datasets. Empirical (permuted) P-values indicate whether the observed number of subsets with significant overlap (P BT < 0.05) or concordance/discordance (P FT < 0.05) are more than expected by chance (P permuted < 0.05). Minimum discordance was identified on SNP subsets yielding the lowest FT’s P-value, and adjusted for multiple testing (P min-permuted).

LDSC and SECA differ in their approaches to evaluating whether pairs of phenotypes have a shared genetic basis. LDSC is model based and treats genotype effects as random, whilst SECA is based on a fixed effects approach. The approaches deal in different ways with the issue that SNPs in high LD with an unknown causal variant will represent an important inflating factor59. LDSC deals with this using an LD-score weighting procedure, whilst SECA applies a strict LD pruning procedure to select a set of unbiased index SNPs. As a result, fewer SNPs were included in SECA analyses (~15,300) than in LDSC regressions (~45,500).

Association of celiac-PRS with breast cancer risk

We constructed celiac disease PRS profiles (celiac-PRS) using individual level genotype data for subjects in the pKARMA study genotyped as part of the iCOGS initiative. pKARMA is made up of 5,002 invasive breast cancer cases (from the Linne-Brost 1 (Libro1) study) and 5,433 controls (from the Karolinska Mammography Project for Risk Prediction of Breast Cancer (KARMA60)). Libro1 consists of female primary breast cancer cases diagnosed in Stockholm between January 2001 and December 2008 identified via the Regional Cancer register61. Tumor characteristics were retrieved from the Stockholm-Gotland Regional Breast Cancer quality registry62. ER status was recorded as negative or positive, determined by radioimmunoassay or immunohistochemistry. Tumor size was categorized as <20, 20–40 and >40 in diameter (mm). Human epidermal growth factor receptor 2 (HER2) status, assessed by IHC/immunocytochemistry and confirmed by fluorescence in situ hybridization analysis if protein levels from IHC/immunocytochemistry showed 2+ or 3+, was recorded in the register as positive or negative. Lymph node involvement status was dichotomized (No/Yes). Registry information was essentially complete (98%) for tumor size and lymph node status, but with more missing data for ER status (80% complete). Grade was available from 2004 onward, with 93% completeness. Controls were breast cancer-free participants recruited between 2010 and 2011 from Helsingborg and Stockholm in Sweden, a subset of the KARMA study. All participants had been genotyped on the iCOGS array in accordance with relevant guidelines as described previously26 and missing genotypes were imputed using 1 KG (phase I integrated variant set release (v3) in the National Center of Biotechnology Information build 37 [hg19] coordinates). Each participant gave informed consent and this study has been approved by the ethical review board at Karolinska Institutet.

Celiac-PRS profiles for each individual were generated by summing the number of celiac disease risk allele copies weighted by effect estimates reported on the GWAS study by Trynka et al. 17, using a scoring routine in the PLINK program (version 1.9b3x)63. We computed four celiac-PRS profiles based on subsets of independent (r2 > 0.2) celiac disease SNPs defined by different P-value thresholds [P CD < 5E-08 (n = 199), P CD < 1E-05 (n = 276), P CD < 0.01 (n = 1,284), and P CD < 0.05 (n = 3,803)].

Statistical analyses were performed in R (version 3.2.4). Unconditional logistic regressions were used to estimate ORs and corresponding 95% CI interval for association of celiac-PRS with overall, ER-negative, and ER positive breast cancer risk. PRS profiles were tested as both a continuous variable per standard deviation (per-SD) and a categorical variable defined by quartiles (based on PRS distribution in breast cancer controls), with the lowest quartile as the reference. We also investigated whether celiac-PRS differentially influences breast cancer tumor characteristics in a case-only study: ER status, lymph node involvement, and HER2-status were tested as binary outcomes using binomial logistic regressions; tumor grade and tumor size were modeled as categorical variables using multinomial logistic regressions (with the “nnet” R package).

Enrichment analysis on top-overlapping SNPs

To aid in the biological interpretation of top (lowest P-values) overlapping SNPs between celiac disease and breast cancer risk overall, we performed SNP enrichment analysis using DEPICT (version1 rel194)30, an integrative tool that systematically prioritizes the most likely causal genes at associated loci and highlights enriched pathways based on a pre-computed probability of gene set membership across 14,461 reconstituted gene sets. Loci within base pairs 25,000,000–35,000,000 on chromosome 6 are excluded due to the heightened LD seen on this major histocompatibility region.