Breast cancer is a complex disease, in which the characteristics of germline susceptibility loci as well as the spectrum of somatic alterations have begun to emerge, largely due to the capacity to conduct large-scale genome-wide studies. Genetic susceptibility loci implicated in breast cancer include highly penetrant rare variants in genes such as BRCA1 and BRCA2, moderately penetrant low-frequency variants in genes such as ATM, PALB2 and CHEK2, and multiple low-penetrant common variants identified more recently through genome-wide association studies (GWAS; Mavaddat et al, 2010). Specifically, GWAS have reported nearly 100 common susceptibility loci for breast cancer, marked by common single-nucleotide polymorphisms (SNPs; Long et al, 2012; Siddiq et al, 2012; Bojesen et al, 2013; Garcia-Closas et al, 2013; Michailidou et al, 2013; Cai et al, 2014; Milne et al, 2014; Michailidou et al, 2015). Although each SNP individually exhibits small effect size, in combination they can explain a substantial proportion of the variation of the familial risk for breast cancer, as well as risk in the general population (Easton et al, 2015; Mavaddat et al, 2015; Michailidou et al, 2015).

In parallel, whole-exome and -genome tumour sequencing studies have been conducted to define the landscape of somatic mutations of breast cancer (Banerji et al, 2012; Curtis et al, 2012; Ellis et al, 2012; Koboldt et al, 2012; Nik-Zainal et al, 2012; Shah et al, 2012; Stephens et al, 2012; Nik-Zainal et al, 2016), particularly focusing on identifying candidate driver genes, namely the genes harbouring mutations that confer selective growth advantage. In addition, studies of somatic mutation patterns for breast cancer have identified distinct mutational signatures across the genome. The number of somatic mutations (or mutational burden) in coding exons varies widely between breast tumours, and it has been related to age at diagnosis and to tumour grade (Stephens et al, 2012). Although genetic susceptibility variants show distinct associations with different pathological subtypes of breast cancer, particularly those defined by ER status (Mavaddat et al, 2010; Siddiq et al, 2012; Garcia-Closas et al, 2013), little is known of the relationship between the inherited and somatic genetic components. A recent survey of cancer predisposition genes, defined as genes harbouring high- or moderate-penetrant risk variants, has suggested that a large fraction of these genes could be oncogenic when mutates somatically (Rahman, 2014). In contrast, a study has reported no evidence that cancer susceptibility regions that harbour common low-penetrant susceptibility are preferentially selected for altered somatic mutation frequencies in cancer patients (Machiela et al, 2015).

In this report, using the breast cancer data from The Cancer Genome Atlas (TCGA) study (Koboldt et al, 2012), we examined associations between established breast cancer susceptibility loci and exome-wide single-nucleotide substitution counts observed in tumour tissues. As mutations carrying the aetiologic signature of tumours are expected to be present across the whole genome, and not only in the specific genes, examination of association of cancer risk factors and total somatic mutation count (TSMC) of substitutions in the whole exome, might be a powerful approach to explore relationships between germline variants and somatic substitutions. For example, biologic age, which is the strongest risk factor for many cancers and is often a surrogate for cumulative carcinogenic events, has been shown to be directly associated with TSMC across many cancers (DePinho, 2000). For other major risk factors, such as smoking for lung cancer and sun exposure for melanoma, specific signature mutations, as hallmarks of exposure, are observed across the whole genome (Pfeifer et al, 2002; 2005).


Data were extracted from germline genotypes generated using the Affymetrix Genome-Wide Human SNP Array 6.0 on circulating leucocyte DNA drawn from 638 breast cancer cases of European ancestry. To remove subjects who may not be Caucasians, cases were selected on the basis of principal component analysis, which combined common SNP genotypes with ones from HapMap (Altshuler et al, 2010b) reference samples (Supplementary Figure 1). We performed genotype imputation for restricted European ancestry samples using IMPUTE2 (Howie et al, 2009) with haplotypes generated by the 1000 Genome Project (Phase3; Altshuler et al, 2010a) as the reference.

Genotyped or imputed dosage data were available for 90 established SNPs representing common susceptibility loci with minor-allele frequencies (MAFs) >0.01. All loci had reported breast cancer risk associations below the threshold for genome-wide significance (P<5 × 10−8). We initially selected 94 SNPs from the study by Michailidou et al (2015) and removed two SNPs (rs7726159 and rs2380205) that were not genome-wide significant in the study by Michailidou et al (2013), and one SNP that is rare (rs17879961, MAF=0.0049). Except for one SNP not present in the 1000 Genomes Project reference panel, the remaining 90 breast cancer susceptibility SNPs passed quality filter with IMPUTE2 info score >0.8 (Supplementary Table 1).

Somatic mutation data were obtained from whole-exome sequencing of TCGA breast cancer tumour samples. Mutation counts were extracted from the Mutation Annotation File (version curated) generated by the Washington University Genome Institute. Details about sample preparation, sequencing protocol, and mutation calling pipeline are described elsewhere (Koboldt et al, 2012). For clinical information, we retrieved age at diagnosis, oestrogen-receptor (ER) status, progesterone-receptor (PR) status and tumour stage.

We used the somatic mutation burden, overall as TSMC or by mutation-specific types, as the outcome variable to perform linear regression analysis of association with SNP genotypes, individually or collectively as a polygenic risk score (PRS). Analyses were adjusted for subject and tumour characteristics, including age at diagnosis, ER status, PR status and tumour stage. Subjects with extremely low or high numbers of TSMC (bottom 1% and top 3% of subjects, respectively) were excluded as outliers. The mutation counts were log10 transformed and results were presented after standardising each type of log-transformed mutation count to have unit standard deviation so that effect-sizes are comparable across TSMC of different mutation types. At each locus, the genotype was coded based on the number of risk alleles (0, 1 or 2). For the current sample size, it was estimated that the study has 80% power to detect effect size of 0.5-s.d. unit change in mutation count (in log10 scale) per copy of a SNP allele with a population frequency of 0.33 at 5% type I error. Power curves for additional effect size and risk allele frequency combinations are illustrated in Supplementary Figure 2.

For each subject, the PRS reflected the total genetic susceptibility burden based on the 90 independent SNPs, and was defined as the weighted combination of each SNP genotype with the weights defined by previously reported log-odds-ratio of association of the SNPs with breast cancer (Supplementary Table 1). We used log-odds-ratio estimates for overall breast cancer and subtypes defined by ER status reported in the study by Michailidou et al (2015) for all non-correlated SNPs, and estimates in the study by Mavaddat et al (2015) for three correlated SNPs with conditional independent signals in 11q13 (rs554219, rs75915166 and rs78540526). In addition, we examined mutation burden with the following specific types: mutations from thymine (or adenine on the other strand) to other nucleotides, from cytosine (or guanine on the other strand) to other nucleotides, transition mutations, transversion mutations and APOBEC-mediated mutations, defined as cytosine to thymine and cytosine to guanine substitutions in the TCW motifs (W is either adenine or thymine; Roberts et al, 2013).

We further considered the somatic copy-number burden in the analysis. The processed segments of copy-number variation (CNV) were downloaded (in November 2015). Following previous work (Laddha et al, 2014), we used the magnitude 0.2 as the threshold to identify amplifications and deletions, and required at least 10 markers included in the CNV segment. The total number of CNV segments across the genome was calculated and treated as a covariate in the linear regression model.


The age at diagnosis of 638 breast cancer patients ranged from 26 to 90 years old with a median of 59.5 years. We first examined each characteristic without adjustment for other characteristics (Table 1). Specifically, TSMC was higher for patients with older age at diagnosis (P=4.02 × 10−4 for age groups, Table 1; P=5.04 × 10−4 for the trend of age, Supplementary Figure 3), low PRS (P=0.04 for PRS groups, Table 1; P=0.01 for the trend of PRS, Figure 1), negative vs positive ER status (P=1.74 × 10−10), negative vs positive PR status (P=2.87 × 10−15), and late vs early stages (P=4.08 × 10−3). In addition, TMSC were significantly associated with patient group defined by both ER and PR status (P=2.95 × 10−14). In an analysis of all characteristics (PRS, age at diagnosis, ER status, PR status and stage) simultaneously fitted in a linear regression model (Supplementary Table 4), we observed that the TMSC was associated with age at diagnosis (P=2.3 × 10−6), tumour stage (P=3.05 × 10−3 and 1.68 × 10−3 for stage II vs stage I and for stage III/IV vs stage I, respectively) and PR status (P=3.96 × 10−7), but not with ER status (P=0.18). Further, in stratified analyses by mutation type, we observed that both ER+ and PR+ tumours were significantly associated with a lower mutation count of thymine to other nucleotides, particularly ER+ tumours; whereas PR+, but not ER+, tumours were associated with lower mutation count of cytosine to other nucleotides (comparisons of positive vs negative ER and PR status; Supplementary Tables 5 and 6).

Table 1 Total in 638 breast tumours stratified by ER and PR status, PRS (by tertile), age at diagnosis and tumour stage
Figure 1
figure 1

Scatterplot for TSMC vs PRS.

In an analysis of association of the mutation count and individual breast cancer susceptibility SNPs, rs2588809 in RAD51B was inversely associated with TSMC (P=8.75 × 10−6, Table 2) with P-value of 0.001 adjusted for multiple comparisons using the Benjamini–Hochberg false-discovery rate (FDR; Benjamini and Hochberg, 1995). Statistical significance of the association was evident across all types of mutations and breast cancer subtypes (Table 2). Two other SNPs, rs11814448 in DNAJC1 and rs13387042, which localises to a gene-poor region of chromosome 2q35, also showed possible inverse associations (FDR=0.25) with TSMC (Supplementary Table 1 for all subjects, Supplementary Table 2 for ER+ subjects and Supplementary Table 3 for ER subjects).

Table 2 Association between somatic mutation phenotypes and SNP rs2588809 at RAD51B

PRS for overall breast cancer was inversely associated with TSMC (P=1.34 × 10−2, Figure 1) as well as for all different types of mutations (Table 3). We observed a significant trend (P=2.28 × 10−3, Figure 1 right panel) in that the strength of association of the individual SNPs with TSMC (measured by the regression coefficient) tended to be larger for those with larger reported odds ratio of association with breast cancer risk, but this trend was largely influenced by SNP rs11814448 in DNAJC1 (P=0.37 when excluding SNP rs11814448). The association between PRS and TSMC remained significant (P=3.81 × 10−2) after excluding rs2588809 in RAD51B in the calculation of PRS. Analysis of ER+-specific PRS in ER+ tumours and ER-specific PRS in ER tumours, showed a significant inverse association for ER+ but not ER tumours (P=0.01 for heterogeneity, Supplementary Figure 4). Further, the association observed for ER+ tumour appears to be to be present only for ER+PR+ tumours (P=7.24 × 10−3) and not for ER+PR tumour (P=0.95). The association patterns for TSMC with PRS and RAD51B SNP did not change when we additionally adjusted for total number of CNV segments observed for the patients as covariate in the respective regression models (Supplementary Table 7 for PRS and Supplementary Table 8 for RAD51B SNP).

Table 3 Association between somatic mutation phenotypes and PRS


We reported an inverse association between TSMC in breast tumours and genetic predisposition conferred by common breast cancer susceptibility SNPs. In particular, a highly significant inverse association was observed with respect to the germline risk variant defined by the SNP rs2588809 in the DNA-repair gene RAD51B. Moreover, a significant inverse association was also observed for a PRS for breast cancer that includes genetic predisposition of 90 breast cancer associated loci but this association was only evident among ER+ tumours with respect to ER+-specific PRS.

The reported inverse association may provide insight into links between germline risk variants and somatic mutations in breast cancer development. There are several possible underlying mechanisms by which the inverse association could arise. It has been previously shown that genetic susceptibility loci differentially influence distinct subtypes of breast cancer (Stephens et al, 2012), which, in turn, can be related to the number of somatic mutations. For example, many SNPs that have been reported to date from GWAS of breast cancer show differential associations with the risk of ER+ and ER breast cancer. Because the number of ER+ tumours included in GWAS to date has been substantially more than the number for ER tumours (Michailidou et al, 2015), GWAS discovery has preferentially identified SNPs related to ER+ tumours. As the number of total mutations tends to be larger in ER than ER+ tumours (Table 1), an inverse association between the SNPs and mutation counts may be observed if the analysis is not adjusted for ER status (Stephens et al, 2012).

We observed an inverse association between germline risk and TSMC after adjustment for age at diagnosis and tumour characteristics, including ER/PR status and stage. The association with the RAD51B SNP was present for both ER+ and ER tumours, although this SNP is only associated with the risk of ER+ tumours (Michailidou et al, 2013). In contrast, the association of TSMC with PRS was present only for ER+-specific PRS in ER+ tumours, and this association appeared to be strongest for the ER+PR+ subtype. Further tumour characteristics, such as the grade that could not be evaluated in this report due to lack of available data, could explain the reported inverse associations between PRS and TSMC. However, the distinct pattern of association seen for the RAD51B SNP and PRS are unlikely to be both explained by subtype heterogeneity.

We observed an association between higher TSMC with older age at diagnosis, ER, PR and higher stage. Older age at diagnosis was previously reported to be associated with cytosine to thymine substitution in ER, but not with TSMC, across breast cancer patients, whereas the observation of higher TSMC associated with higher stage is consistent with the previous finding (Stephens et al, 2012). To our best knowledge, we are not aware of other studies reporting an association between TSMC, overall and by mutation type, with respect to joint status ER, PR, stage and age at diagnosis. Our results suggested that although PR status, but not ER, is strongly predictive of overall TSMC, distinct mutation signatures could be associated with ER (thymine to other nucleotides) and PR status (cytosine to other nucleotides) when the characteristics were analysed jointly.

It is possible that the inverse association we observe between genetic risk and TSMC is a broader phenomenon that cannot be explained by subtype heterogeneity alone. The best known example of interaction of germline and somatic mutation is the ‘two-hit’ model for carcinogenesis in retinoblastoma (Knudson, 1971). Under this model, the first-hit could be either a germline susceptibility variant or a somatic mutation in an important cancer predisposition gene. Thus subjects with elevated genetic predisposition may require fewer stages to develop a malignancy of the breast than subjects at lower genetic risk. Therefore, it is possible that the observed inverse association is the result of an underlying continuous process of cancer development, in which both germline variants and somatic mutations contribute and perhaps overlap with respect to their relative contributions to development of breast cancer. To further understand the biological basis of our observations, it will be necessary to understand the causal mechanisms that underpin the relationship between common susceptibility alleles and TSMC, as a marker of mutational events critical for development of distinct subtypes of breast cancer. In addition, the current study only recorded presence or absence of ER or PR. By quantifying the magnitudes of ER or PR as quantitative traits in future studies, it may be possible to delineate the relationships between TSMC with ER or PR levels more precisely.