Breast cancer is a multi-factorial, polygenic disease resulting from the interplay of genetic, environmental and lifestyle risk factors. Linkage studies have revealed that breast cancer tends to cluster in families and disease prevalence is two-fold higher among the first-degree relatives of affected individuals.1 Familial clustering is characterized by early onset of disease often mediated by high-to-moderate penetrance mutations in genes, such as those encoding breast cancer (BRCA1 and BRCA2),2, 3 ataxia telangiectasia mutated (ATM),4 cell cycle checkpoint kinase 2 (CHEK2),5 tumor protein 53 (TP53),6 partner and localizer of BRCA2,7 BRCA1-interacting protein C-terminal helicase 1 (BRIP1)8 and phosphatase and tensin homolog (PTEN).9 Nonetheless, these genes in aggregate account for <25% of the observed familial genetic risk.10 A polygenic model has been proposed to explain the remaining genetic risk in non-BRCA familial and sporadic breast cancer cases.11 Single-nucleotide polymorphisms (SNPs)-based genome-wide association studies (GWAS) have identified low-risk conferring common variants in several complex diseases. For European, Ashkenazi Jewish and Asian population-based GWAS, more than 40 breast cancer susceptibility loci in several genes and intergenic regions have already been reported and a subset of these associations have reached genome-wide significance level.12, 13, 14 These variants account for a small proportion of overall genetic risk of breast cancer, leaving open the question of hidden or missing heritability. Current debates suggest that this may be further explained by rare variants, epistasis, epigenetics, gene–environment interactions and copy number variations.15, 16

In a typical GWAS, the frequencies for each SNP (single-locus tests for association)17 are compared between cases and controls to catalogue polymorphisms potentially associated with the phenotype of interest. The most promising SNPs, sorted based on P-value ranking (highest significance) and/or showing significance in haplotype association analysis,18 are selected and replicated in a larger but independent set of cases and controls. In this process, SNPs that are not top ranked because of their modest P-values are ignored, and as a result potentially informative markers may have been missed. It has been proposed by others19, 20 that even modest associations (P-value based), if highly reproducible in independent cohorts, may still be pertinent to the phenotypes under investigation presumably through epistatic interactions (interactions of alleles or genes), a phenomenon strongly implicated in the etiology of breast cancer and the heritable component of genetic risk. Because the majority of the published GWAS concentrate on single-locus strategies to identify novel breast cancer susceptibility loci, a candidate gene approach restricted to specific pathway related gene polymorphisms to more effectively mine GWAS data is presented considering moderately associated SNPs. If reproduced in further independent studies, these may serve as putative candidates for epistatic effects.

Previously reported studies focused on common variants in the genes involved in DNA repair/metabolism pathways and cell cycle regulation, and the markers were selected based on candidate gene approaches.21, 22 In this study, we extend this premise using SNPs in or flanking the DNA repair, modifications and metabolism pathway-related genes from the Affymetrix 6.0 array (Santa Clara, CA, USA) (stage 1 of GWAS23) for independent replication, stage 2 of the association study design) to identify additional breast cancer susceptibility loci not previously reported.

Materials and methods

Study population and DNA isolation

We used stage 1 results of our published breast cancer GWAS, described elsewhere.23 Briefly, sporadic breast cancer cases (n=348), characterized by late onset of disease and controls (n=348) who had no documented history of breast cancer in the first- and second-degree relatives were selected for stage 1 of the GWAS.23 All subjects were predominantly of Caucasian origin. Breast cancer cases (median age=51 years; age range=26–90 years, with number of cases <40 years=35; 40–60 years=241; >60 years=72) were from Alberta, Canada, recruited by the PolyomX Program24 and the Canadian Breast Cancer Foundation-Tumor Bank, (CBCF-TB)24 during the years 2001–2005 and since 2005–2008, respectively. The two projects PolyomX Program and CBCF-TB are funded by different granting agencies, and nomenclature adopted merely indicates this and in no way reflects bias in sampling of population. All cases had a histologically confirmed diagnosis of invasive ductal breast carcinoma at the time of enrolment in the study. Gender-matched apparently healthy controls (median age=50 years; age range=36–70 years, with number of controls <40 years=50; 40–60 years=226; >60 years=72), also from Alberta, Canada (accessed from the Tomorrow Project25), were frequency matched to cases based on age. The proportions of cases and controls for three different age groups (<40, 40–60 and >60 years) were not statistically significant (two tailed z-test; data not shown). All control subjects’ enrolled here were free from cancer at the time of recruitment in the study. Potential population confounders were removed, leaving cases (n=302) and controls (n=321) for association analysis.23 Informed consents were obtained from all study participants, and the study was approved by Research Ethics Board of Alberta Health Services. Genomic DNA was extracted from the peripheral blood samples of both cases and controls using commercially available Qiagen (Mississauga, ON, Canada) DNA isolation kits.

SNP selection, genotyping and platform-specific genotype concordance

Data filtering and call rate clean up (Hardy–Weinberg equilibrium (HWE) P>0.001 and SNPs call rate >99%) were carried out as described earlier.23 Of the 906 600 SNPs genotyped using Affymetrix SNP 6.0, a total of 782 838 SNPs qualified for the downstream analysis. The associations of SNPs with breast cancer were evaluated using correlation/trend tests with one degree of freedom (df). Correlation/trend test is similar to χ2-test of independence, except that it is also believed to be a trend test that evaluates correlation of a minor allele with the case status using Pearson's correlation coefficient. The allelic tests with 782 838 SNPs (stage 1) showed that a total of 35 519 SNPs statistically significantly associated with breast cancer at P<0.05. Of the 35 519 SNPs, we identified 215 polymorphisms (minor allele frequency (MAF)>10%) within or in close proximity to 49 gene regions implicated in pathways or of relevance to DNA repair, modifications and metabolism based on National Center for Biotechnology Information human genome build 37. In all, six of 215 SNPs were statistically significantly associated with breast cancer at P<0.001 (correlation/trend tests with one df) and were included for stage 2 replication study. To reduce the redundancy among the remaining 209 SNPs, we then calculated the pairwise LD (r2) among the markers and found that 73 SNPs were strongly correlated (r2≥0.8). Of these 73 short-listed SNPs, 16 were in strong LD (r2≥0.8), with at least one SNP contained within the identified 3903 haplotype blocks (P<0.05) in haplotype association analysis. All haplotypes at a frequency threshold of 1% or more were tested together against the reference haplotype for their associations with breast cancer. The haplotype association analysis per se was carried out as described elsewhere.23 As our primary objective in this study was to evaluate the moderately associated SNPs from stage 1 GWAS results, we relaxed the significance threshold in haplotype association analysis to P<0.05 as compared with our previous study (P<0.001).23 Overall, we used allelic tests and haplotype association tests to select SNPs for replication study in an independent set of 1178 invasive breast cancer cases and 1314 apparently healthy individuals serving as controls (stage 2).

Genotyping assays were performed on Sequenom iPLEX Gold platform (San Diego, CA, USA) (services from the McGill University, Genome Quebec Innovation Center, Montreal, Canada). Within- (Sequenom only) and cross-platform (Affymetrix vs Sequenom) SNP concordances for 22 SNPs were assessed using 205 and 551 duplicate samples, respectively.

Statistical considerations

Allelic associations were evaluated using correlation/trend tests with one df, and their corresponding odds ratios (ORs) and 95% confidence intervals (CIs) were estimated using unconditional logistic regression implemented in the SNP & Variation Suite v7.3.1 (Helix Tree Software).26 Genotypic associations were also considered for gaining insights in to relative contributions from individual genotypes to breast cancer risk using unconditional logistic regression with two df using the freeware, SNPstats,27 and the results from codominant models were summarized in the study. A combined analysis with all samples from stages 1 and 2 (a total of 1480 cases and 1635 controls) was performed to increase the statistical power. The associations for the allelic tests in combined analysis were further examined with 1000-times permutation tests and false discovery rates (FDRs) to identify observations by chance alone (type 1 error) using Helix Tree software. Helix Tree calculates FDR using the original P-value times the number of tests divided by the number of tests minus the rank order of the original P-value in the descending order.

Subgroup analyses were attempted (correlation/trend tests with 1 df) to identify associations with subphenotypes within the combined breast cancer cases using a common reference (combined controls) as described previously.28 The subphenotypes examined were family history of breast cancer, menopausal status and luminal A status. Subgroup analyses help interrogate potential confounding influence of disease heterogeneity on the observed associations. Tumors were classified as luminal A based on estrogen and progesterone receptor status (ER+/PR+, ER/PR+ and ER+/PR) and human epidermal growth factor receptor-2 status (HER2).29 All the remaining cases were classified as non-luminal A tumors.

Our sample size conferred more than 80% power to detect associations using a codominant model for a SNP with 10% MAF, disease prevalence at 1/10 in population for breast cancer, a relative risk of 1.3, type I error of 0.05 and with the LD between markers at r2 of 0.8.30

The LD patterns for regions showing the strongest and consistent associations across stages 1 and 2 and combined analyses were examined using Haploview v4.2.31 For the three methyl-CpG-binding domain protein 2 (MBD2) SNPs, haplotype frequencies were estimated using SNPstats.27 The software implements the expectation-maximization algorithm coded into haplo.stats package to calculate the estimated relative frequencies for each haplotype.32 Haplotype association analyses for MBD2 SNPs were performed with unconditional logistic regression using the default setting of a log-additive model and expressed in terms of ORs and 95% CIs (feature available in SNPstats).


Initial assessment of the data quality

Of the 22 SNPs selected for replication in stage 2, genotyping for one SNP (rs17519016) was not successful. The cross-platform (Affymetrix vs Sequenom) SNP call concordance for the remaining 21 SNPs using 551 duplicate samples from stage 1 was more than 98%. Within-platform (Sequenom) SNP call concordance among the 205 duplicates used in stage 2 was more than 99.4%. Per sample and per SNP call rates for stage 2 were >98.3 and >98.4%, respectively, and all 21 SNPs were in HWE proportion at P>0.001 in controls (Table 1). Cross-platform and within-platform discordances were very low (<2%) and are in agreement with previously reported GWAS studies.12, 23 Further, the MAFs were consistent among the two stages and also comparable to HapMap Central Europeans (CEU) population (data not shown), indicating that the scope of false-positive associations due to genotyping errors (systematic or random) was effectively minimized.

Table 1 Characteristics of the SNPs used in the study

Stage 2 analysis

In stage 2, six SNPs showed suggestive associations with breast cancer (Table 2). Three SNPs (rs8094493, rs4041245 and rs7614) were from MBD2 gene regions and were marginally associated with reduced risk for breast cancer (ORs: 0.90, 0.91 and 0.92, respectively; Table 2). The other three SNPs rs13250873, rs1556459 and rs2297381 were located in or close proximity of RAD21 homolog (S. pombe; RAD21), O-6-methylguanine-DNA methyltransferase (MGMT) and RNA polymerase II-associated protein 1 (RPAP1) gene regions, respectively, and showed suggestive associations with increased risk for breast cancer.

Table 2 Six SNPs with the strongest and consistent associations with breast cancer susceptibility across stages 1, 2 and in combined analysis

The association test results for the remaining 15 SNPs are summarized in Supplementary Table 1. Fourteen of these showed no statistical significance and one SNP (rs7636114) showed suggestive association trend in stage 2 (but in opposite direction to the stage 1 results) and is therefore not considered for further analysis.

Combined analysis (stages 1 and 2)

We combined the results for six SNPs from stages 1 and 2, and conducted a combined analysis and found not only similar direction of risk but also stronger association signals for all six variants (Table 2). The MBD2 SNPs rs8094493 (OR: 0.85, P<0.0021), rs4041245 (OR: 0.86, P<0.0026) and rs7614 (OR: 0.86, P<0.0041) were significantly associated with reduced risk of breast cancer. The observed FDR of 0.045, 0.027 and 0.029, respectively, for the allelic associations in combined analysis provided confidence in the study findings. We also subjected the data to permutation testing (1000 times) and observed permutation P-values of 0.038, 0.048 and 0.069, respectively, an indication that the reported findings may not be attributed to associations by chance alone. The heterozygote and variant homozygote genotypes of MBD2 SNPs from codominant models also conferred similar trends of reduced risks of breast cancer (ORs: 0.76–0.79).

The remaining polymorphisms analyzed (rs13250873, rs1556459 and rs2297381, Table 2) also showed significant associations, except the direction of risk for breast cancer (allelic ORs: 1.13–1.20) was in opposite direction to the ones observed for MBD2 SNPs. The association signals for all three SNPs were characterized by low FDR values (0.023–0.054); the 1000-times permutation tests also showed marginal significance for rs13250873. In the codominant genotypic models, variant homozygotes (OR≥1.28) showed stronger associations than heterozygotes (OR: 1.07–1.14) in the combined analysis for rs13250873, rs1556459 and rs2297381.

Subgroup analyses

Owing to potential for genetic risk determinants to be associated with specific clinical and molecular subtypes of breast cancer, we reviewed clinicopathological characteristics of the cases in both stages 1 and 2, and conducted stratified analyses (Table 3). We evaluated allelic associations for six SNPs with the following subgroups: without and with family history of breast cancer, pre- and postmenopausal status and luminal A and non-luminal A (ie, good and poor prognostic groups, respectively) breast cancer status of the tumors, using correlation/trend tests with one df. We found associations between clinicopathological characteristics and the polymorphisms considered, and the observed ORs were consistent across subgroups (Table 3). None of the observed associations were stronger than the single-locus effects, and hence it is less likely that these clinicopathological characteristics (potential confounders) have significant effects on initial observed associations with unstratified cases (Table 2).

Table 3 Subgroup analyses based on family history of breast cancer, menopausal status and luminal A tumors

Pairwise LD profiling between markers

We examined LD profiles for the six identified variants (Table 2) using HapMap CEU genotype data (available from We found that three MBD2 SNPs (rs8094493, rs4041245 and rs7614) in intron 3, intron 6 and the 3′-untranslated region, respectively, were in strong LD with D′=1 (Figure 1a), and these profiles were also observed in our study population (Figure 1b). rs7614 and rs4041245 were located in a LD block spanning 6 kb region, and rs8094493 was located in a LD block of 9 kb region.

Figure 1
figure 1

Pairwise LD profiles between SNPs from MBD2 gene region. (a) LD profile of whole MBD2 isoform 1 spanning 70.58 kbp. The gene is in reverse orientation (3′–5′) on chromosome 18q arm. Five SNPs (three from our study that shown in black and two from Zhu et al35 that are shown in red) in MBD2 gene regions are shown based on their relative position on HapMap CEU data set (phase 1 and 2-full data set). LD blocks were defined using ‘CI’ method as explained by Gabriel et al.42 D′ values are given for LD between the markers. The darker the cell, the greater the D′ value between the SNPs. (b) LD profile for three MBD2 SNPs from our study based on our study population.

We also analyzed the remaining three SNPs (rs13250873, rs2297381 and rs1556459) that were associated with breast cancer in our study population (Table 2) and found that these SNPs belong to different blocks/regions and were not correlated with each other (data not shown). The LD blocks containing rs13250873 and rs1556459 did not show annotated genes. However, we observed UTP23 (19 kb downstream) and RAD21 (52 kbp downstream) as the nearest genes flanking rs13250873 and for rs1556459, the closet gene was MGMT at 450 kb upstream. On the other hand, the polymorphism rs2297381 was located in intron 5 of RPAP1 gene.

Haplotype analysis for MBD2 gene polymorphisms

We reasoned that the highly correlated SNPs from the MBD2 gene region may form distinct haplotypes that could potentially explain the population diversity. Polymorphisms rs8094493, rs4041245 and rs7614 formed two major haplotypes, one with common alleles (major allele) and other with variant alleles (minor allele). The common haplotype had a population frequency of 0.58 (0.60 for cases and 0.56 for controls), and the variant haplotype had a population frequency of 0.40 (0.38 for cases and 0.42 for controls). The variant form was significantly associated with the reduced risk of breast cancer (OR: 0.86, P<0.0029; Table 4). The population diversity that could be explained by the two major haplotypes identified in this analysis was 98%.

Table 4 Haplotypes for three MBD2 SNPs and their associations with breast cancer risk


In this study, we identified SNPs associated with breast cancer among genes related to DNA repair, modifications and metabolism. A total of six loci were identified using a two-stage association study design, and these were not previously reported in published GWAS for breast cancer12, 13, 14, 23 as putative markers for breast cancer susceptibility. The identified loci were highly reproducible in an independent study (stage 2), and the statistical significance of the findings was consistent across study stages, in the combined analysis and across clinicopathological subtypes of breast cancer. These loci are promising markers and warrant independent validation in Caucasian population or in diverse ethnic cohorts to evaluate the generalizability of our findings.

The six loci identified were from four chromosomes 18, 15, 10 and 8. Both single-locus and haplotype association analyses indicated that MBD2 gene loci (rs8094493, rs4041245 and rs7614) conferred protection against breast cancer. The magnitude and the direction of the association signals in both stages were consistent between allelic and genotypic models (Table 2). The allelic risk effects were enriched in combined analysis with stronger association of P-values of <10−3. Low FDR values and permutation testing provided further confidence in our findings by ruling out the observations as false positives. Mechanistic relationships to breast carcinogenesis are suggested because MBD2 is a well-characterized gene and the encoded protein binds to methylated promoter regions and mediates transcriptional repression of tumor suppressor genes.33 DNA (cytosine-5)-methyltransferase 1 (DNMT1) is reported to interact with the methyl-CpG-binding protein complex, MBD2 and MBD3 at late S-phase replication foci, and as such these interactions may direct DNMT1 to hemimethylated sequences following DNA replication and silencing of genes in the S phase.34

Earlier, Zhu et al35 reported the associations of two SNPs (rs1259938 and rs609791) in MBD2 gene regions with the reduced risk of breast cancer in premenopausal Caucasian women.35 We evaluated for possible LD between the distinct MBD2 SNPs reported here and those reported by Zhu et al.35 The polymorphisms reported by earlier investigators were not in LD with the SNPs reported here (Figure 1a). The notable differences between our study and those by Zhu et al35 are (i) the SNPs rs1259938 and rs609791 in the previous study did not show association with the breast cancer phenotype in unstratified cases, although they showed statistical significance when cases were stratified by pre- and postmenopausal status; (ii) we identified distinct MBD2 gene SNPs and these were all statistically significantly associated with breast cancer as a phenotype even in both unstratified (Table 2) and stratified cases (Table 3); and (iii) sample sizes were substantially larger in our study (total sample size of 1480 cases and 1615 controls) as opposed to 393 cases and 436 controls from the nested case–control study with a Caucasian population reported by Zhu et al.35 In summary, observations with a larger sample size (this study) showed association with breast cancer even without stratification of cases, and the haplotypes associated were also distinct. However, it is important to note that the magnitude and direction of risk and the gene identified are similar in both studies. We did not genotype the polymorphisms reported by Zhu et al35 at this time, and may therefore require independent validation. The SNPs analyzed by Zhu et al35 were not present in the Affymetrix SNP 6.0 array.

Other genes/loci were identified for breast cancer risk in this study. rs2297381 was located in intron 5 of RPAP1 and was associated with the risk of breast cancer. RPAP1 is a poorly understood gene possibly involved in the interaction of RNA polymerase II and its regulators of protein complex formation.36 To our knowledge, this is the first report on RPAP1 gene SNP associated with breast cancer risk. rs13250873 and rs1556459, located 52 kbp downstream of RAD21 and 454 kbp upstream of MGMT, respectively, were significantly associated with the risk of breast cancer across both stages and in combined analysis. Both RAD21 and MGMT are well-studied genes with significant roles in carcinogenesis. The RAD21 protein is involved in double-strand breaks repair as well as chromatid cohesion during mitosis.37, 38 Intronic polymorphisms in RAD21 gene have been associated with breast cancer in high-risk population.39 Similarly, MGMT repairs the alkylated guanine due to carcinogenic effects induced by alkylating agents.40 Coding SNPs of MGMT gene are reported to be associated with breast cancer risk.41 MGMT SNP reported in this study is 454 kb upstream of the MGMT gene. Although rs13250873 and rs1556459 were not located in the gene regions, further replication of these findings and fine mapping of these loci are required to determine whether the identified polymorphisms exert their action through regulation of the nearby RAD21 and MGMT genes.

None of the associations reached genome-wide significance level in this two-stage association study with the combined sample size of 1480 cases and 1635 controls. However, confidence in the reported associations stems from the stringent quality control parameters employed (>98% SNP and sample call rates, HWE P>0.001 in controls and >98% SNP concordance in replicates and good call rate concordance across platforms). Furthermore, the low FDR values and results from permutation testing should favor considering the reported polymorphisms for replication in independent studies. In summary, we identified additional breast cancer susceptibility loci in Caucasian women by focusing on genes related to DNA repair, modifications and metabolism. Our study supports the concept of investigating moderate association signals from stage 1 GWAS using a candidate gene approach restricted to specific pathway-related gene polymorphisms. In this study, we did not consider all related DNA repair/modifications/metabolism pathway gene polymorphisms or their potential associations with other subtypes of breast cancer (basal, HER2+ and luminal B) due to limitations in sample size. Other reported DNA repair/modifications/metabolism gene polymorphisms (which did not reach genome-wide significance) in previously published studies, if replicated in independent cohorts, should also be considered along with the six reported variants here as putative candidates for epistatic models to gain insights to the missing heritability of sporadic breast cancer.