Breast cancer is one of the most common epithelial malignancies in women1,2. Ahmed et al.3 carried out a multi-stage genome wide association study (GWAS) for breast cancer susceptibility involving studies from the Cancer Genetic Markers of Susceptibility (CGEMS) and Breast Cancer Association Consortium (BCAC) and reported strong evidence for a susceptibility locus at 17q22 with single nucleotide polymorphism (SNP) rs6504950, OR = 0.95, 95% confidence interval (CI) 0.92–0.97, P = 1.4 × 10−8. Turnbull et al.4 found confirmatory evidence for association with SNPs at the same locus; they reported a breast cancer risk association with SNP rs1156287 (OR = 0.91; 95% CI 0.85–0.97; P = 5.8 × 10−3), which lies 20 kb from originally reported SNP rs6504950 (r2 = 0.91). Using data from the National Cancer Institute's Breast and Prostate Cancer Cohort Consortium (BPC3), Campa et al.5 also confirmed the association with rs6504950 (OR = 0.92; 95% CI 0.88–0.97; P = 5.83 × 10−4). Broeks et al.6 further investigated this association with respect to tumor estrogen receptor (ER) status, and reported that rs6504950 had stronger association with ER positive (ER+; OR = 0.93; 95% CI 0.90–0.95; P = 7.2 × 10−7) than ER negative disease (ER−; OR = 1.00; 95% CI 0.95–1.05; P = 0.94). Tang et al.7 conducted a meta-analysis which further confirmed the association with SNP rs6504950 (OR = 0.93; 95% CI 0.87–0.99). SNP rs6504950 lies in an intron of STXBP4 (Syntax binding protein 4) and two other genes are found within 200 kb including COX11 (cytochrome C assembly protein 11) and TOM1L1 (target of myb1-like1). As part of the Collaborative Oncological Gene-Environment Study (COGS), we conducted a comprehensive fine-scale mapping of this 17q22 breast cancer susceptibility locus using 517 SNPs chosen to give dense coverage across this locus. These were genotyped on a custom-designed Illumina iSelect genotyping array (iCOGS) in 50 studies participating in BCAC. We used these data to define the variants most strongly associated with risk, and combined these data with additional in-silico and functional data in an attempt to determine the most likely causal variants.

Material and Methods

Genetic Mapping

Tagging strategy for the fine-scale mapping

We defined the region for fine-mapping by identifying the flanking SNPs with minor allele frequency (MAF) > 2% and detectable correlation (r2 > 0.1) with rs6504950, based on the 1000 genomes project European population (March 2010 Pilot version 60 CEU project data). From this 468 kb interval we selected all SNPs correlated with rs6504950 at r2 > 0.1, plus a set of SNPs designed to tag all remaining SNPs with r2 > 0.9. We thus aimed to genotype 525 SNPs, between chromosome 17 positions 52,816,899 and 53,284,506 (NCBI build 37 assembly), that had an Illumina designability score (DS) > 0.9. Of these, 517 were successfully genotyped on the array and passed QC filters.

iCOGS genotyping and imputation

Case and control samples were drawn from studies participating in the BCAC, of which 41 (total: 46450 cases/42600 controls) were predominantly of European ancestry and nine (6269 cases/6624 controls) of Asian ancestry. We performed iCOGS genotyping in four centres, as part of the Collaborative Oncological Gene-Environment Study (COGS). All BCAC studies had local human ethical approvals as described previously8. We then used the genotype data from 517 SNPs that passed quality control to impute genotypes, among European subjects, at all additional known variants in the interval, using IMPUTE version 2.0 (IMPUTE2; without pre-phasing) and the 1000 genome project multi-population data (March 2012 version) as a reference panel9,10. IMPUTE2 was run with default parameters and "effective size" of the population Ne = 20,000. Using an imputation–r2 > 0.3 in Europeans, we successfully imputed 3,134 SNPs (MAF ≥ 1%).

Statistical Analysis

For each SNP, we estimated the per-allele log-odds ratio (OR) and standard error using logistic regression, including principal components and per-study fixed-effects to capture study-specific differences as previously described8. For the analyses of European subjects, we included the first six principal components as covariates, together with a seventh component derived specifically for one study (LMBC) for which there was substantial inflation not accounted for by the components derived from the analysis of all studies (this component was set to zero for all other studies). For the analysis of Asian subjects, we included two principal components8. We estimated per allele ORs under the assumption of a log-additive mode of inheritance, i.e. SNPs were coded according to the number of minor alleles 0, 1 or 2. We estimated main effects by subtype specific status (ER +/−) using case-control logistic regression and restricting the case sample to a specific subtype. We evaluated heterogeneity of association across tumour subtypes in a case-only analysis, treating subtype status as a dependent variable. We derived the P values by means of a likelihood-ratio test (one degree of freedom). Tests were two-sided. We carried out analyses separately among women of European and of Asian ancestry, defined by multiple dimensional scaling as previously described8. We performed multiple logistic regression analyses to identify SNPs independently associated with each phenotype. To identify the most parsimonious model, we included all SNPs with a P < 10−4 and MAF ≥ 2% in the single SNP analysis in forward selection regression analyses, utilizing the step function in R11 with penalty term set to 2012; we also used a joint analysis of all SNPs using a Bayesian-inspired penalised maximum likelihood approach (HyperLasso)13. To correctly account for uncertainty in the data resulting from the imputation process, we conducted analysis by regressing on the allele dosage for each genotype. For HyperLasso we utilized the most probable genotype as input, based on the posterior probability from the imputation algorithm (set to missing if all posterior probabilities were <0.9).

Bioinformatic analyses

We combined multiple sources of in silico annotation from public databases to help identify potential functional SNPs. To investigate functional elements enriched across the previously defined fine-mapped region, more specifically in the region encompassing the strongest candidate causal SNPs, we analysed chromatin biofeatures data from the Encyclopedia of DNA Elements (ENCODE) Project14 namely: Chromatin State Segmentation by Hidden Markov Models (chromHMM), DNase I hypersensitivity sites (DHS) and histone modifications of epigenetic markers H3K4, H3K9, and H3K27 in Human Mammary Epithelial Cells (HMEC) and MCF7 breast cancer cells. To identify putative target genes, we examined chromatin interactions between distal and proximal regulatory transcription-factor binding sites and gene promoters, using Chromatin Interaction Analysis by Paired End Tag (ChiA-PET) in MCF7 cells. This detects genome-wide interactions associated with CCCTC-binding factor (CTCF) and DNA polymerase II (Pol2) – both involved in transcriptional regulation15. Putative regulatory elements were determined using data from ENCODE, Roadmap Epigenomics16, the “Predicting Specific Tissue Interactions of Genes and Enhancers” (PreSTIGE)17 algorithm, Hnisz18 and FANTOM. Intersections between candidate causal variants and regulatory elements were identified using Galaxy, and visualised in the UCSC Genome Browser. We used the ENCODE RNAseq data to evaluate the expression of exons across the 17q22 locus in HMEC and MCF7 cell lines. The alignment files for HMEC (4 biological replicates) and MCF7 (19 biological replicates) were downloaded from ENCODE and the read count in the defined region was extracted and normalized in reads per million (RPM).

Allele specific expression (ASE) analysis

ASE analysis was performed using The Cancer Genome Atlas (TCGA) breast cancer data as described previously19. SNP rs2787481, genotyped on the Affymetrix SNP Array 6.0 was used as a representative SNP for the candidate causal variants (r2 = 0.90 with rs2787486). SNP rs2787481 genotype calls and the corresponding confidence scores were retrieved using level 2 TCGA SNP array Birdseed data downloaded from TCGA portal. Genotypes with confidence scores equal to or above 0.1 were excluded.

We utilised RNA-sequencing data from 742 breast cancer samples from women of Caucasian ancestry. The corresponding RNA-sequencing BAM files and metadata are available from the Cancer Genomics Hub (CGHub). Markers used to assess relative allelic expression were exonic SNPs located in KIF2B, TOM1L1, COX11, STXBP4, HLF, MMD, TMEM100, PCTP, and ANKFN1 extracted from dbSNP human Build 142. Homozygote marker SNPs, those with low coverage (less than 15x) and those within overlapping regions of the target genes, were removed. RNA-sequencing read counts on SNP sites for reference and alternative alleles were computed. The major allele fraction (μ), representing allelic imbalance for each marker SNP, was computed and an average of allelic imbalances for each gene was calculated for individual tumour samples. Marker SNPs with extreme μ values (μ > 0.75) were not included in the analysis. Level 3 SNP array data were downloaded from TCGA portal and GISTIC version 2.0.16 was used to identify copy number variations (CNVs) for each sample. Samples with low or high CNV levels, as presented in the gene-based GISTIC module report, were excluded from the analysis of the corresponding gene.

Allelic imbalance for the target transcripts was compared between rs2787481 heterozygote (CT) and homozygote (CC and TT) samples using Levene's Test for equality of variances. KIF2B, TMEM100, and ANKFN1 were excluded from the statistical analyses as they did not have enough informative marker SNPs left after applying the filtering criteria.

Local gene expression by SNP (eQTL) association analysis

We examined the association of all genotyped or imputed SNPs with expression of nine genes (KIF2B, TOM1L1, COX11, STXBP4, HLF, MMD, TMEM100, PCTP, and ANKFN1) in the 1 Mb region on either side of the fine-mapping interval, using data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) study. METABRIC comprises normal tissues adjacent to tumours from breast cancer patients genetically confirmed to be of European ancestry20. The samples (n = 135) were assayed for expression with the Illumina HT-12 v3 microarray. Matched germline SNP genotypes were derived using the Affymetrix SNP 6.0 array. Genotyping quality control and imputation for the METABRIC data are described in Guo et al.21. Association between genotype and expression was tested by linear regression with FDR control as implemented in the MatrixEQTL22 package in R11.

Four additional SNP-expression data sets were available and analysed separately: (1) NB116 consists of 116 Caucasian normal breast samples (the majority of Norwegian descent) with n = 10 tumour-adjacent normal biopsies. (2) BC241 consists of 241 Caucasian tumor (all stages) samples (the majority of Norwegian origin). These were both genotyped on the “iCOGS” SNP array, and gene expression levels were measured with Agilent 44 K23. (3) NB93 consists of 93 Caucasian adjacent normal breast samples from TCGA. Birdseed processed germline genotype data from the Affy6 SNP array were obtained from the TCGA dbGaP data portal24. (4) BC765 consists of 765 Caucasian breast tumour samples from TCGA24. Gene expression levels were assayed by RNA sequencing, RSEM (RNAseq by Expectation-Maximization25) normalized per gene or isoform, as obtained from the TCGA consortium24. Unexpressed and minimally expressed genes/isoforms whose sum in expression level was less than ten were excluded, and the data log2 transformed prior to analysis. The influence of SNPs on local gene expression (transcripts within 1 MB from the most strongly associated SNP) was assessed using a linear regression model, as implemented in the R11 library eMAP26. An additive effect was assumed by modelling the patient’s number of copies of the rare allele, i.e. 0, 1 or 2 for a given genotype. Correction for multiple testing was performed using the false discovery rate (FDR) as implemented in the p.adjust function in R.

eQTL data from the Genotype-Tissue Expression (GTEx) project27 were downloaded from the v6 release.


A total of 517 SNPs at chromosome 17 positions 52,816,899 to 53,284,506 (NCI build 37 assembly) were successfully genotyped using the iCOGs chip. Genotypes of other common variants across the region were imputed in the European studies using known genotypes in combination with a reference panel from the 1000 Genomes Project. 3,134 SNPs and insertion/deletion (indel) polymorphisms were reliably imputed (imputation r2 score > 0.3, MAF ≥ 0.01) and included in further analysis together with the 517 genotyped SNPs. In the European studies 139 genotyped or imputed SNPs were associated with overall risk of breast cancer (P values < 10−7) (Fig. 1). This set included SNPs rs6504950 and rs1156287 (r2 = 0.84), both previously reported3,4 to be associated with breast cancer risk among Europeans (Supplementary Table 1).

Figure 1: Association Results for Overall Breast Cancer Risk.
figure 1

Directly genotyped SNPs are shown as filled black circles, and imputed SNPs (r2 > 0.3, MAF > 0.02) are shown as open red circles, plotted as the negative log of the P value against relative position across the locus. A schematic of the gene structures is shown. Signals, encompassing all SNPs with a likelihood ratio of <1:100 compared with the most significant SNP, are labelled and are shown as grey regions. The dashed purple line represents genome-wide significance (P < 5 × 10−08).

Among the European ancestry studies the strongest association detected was with imputed SNP rs2787486 (OR [minor/major allele] = 0.92 [C/A]; 95% CI 0.90–0.94; P = 8.96 × 10−15), located in an intron of STXBP4 and strongly correlated with both previously reported GWAS hits (r2 = 0.83 with rs1156287; r2 = 0.73 with rs6504950). The strongest genotyped SNP association was rs244353 (OR = 0.92; 95% CI 0.90–0.94); P = 5.75 × 10−14) which lies ~15 kb from rs2787486 and is correlated with it (r2 = 0.99). A regression model suggests that both rs2787486 (P = 0.02) and rs244353 (P = 0.11) are detecting the same risk association. To dissect further the observed associations all SNPs displaying evidence for association (P < 10−4 and MAF ≥ 0.02) with overall breast cancer risk (228 SNPs, Supplementary Table 1) in European studies were included in a forward stepwise regression model. This analysis identified a single association signal marked by top imputed variant rs2787486 (that is, no further SNPs were associated after adjustment for rs2787486). We also utilized penalized logistic regression models (based on the normal exponential gamma probability density) implemented in HyperLasso13, including all typed and imputed variants with an specified lambda of 0.05 and a penalty of 491 for overall risk (based on the sample size and a type I error of 0.001)28.

In this analysis, the best fitting model also included just one SNP, rs2787486.

On the assumption of a single causal variant, we calculated the likelihood ratio of each SNP relative to rs2787486 with respect to overall risk and SNPs with a relative likelihood ratio of <1:100 were excluded from further consideration29. After this exclusion process 28 SNPs (17 genotyped and 11 imputed), spanning 52.3 Kb (positions 53,176,211 to 53,228,543), remained as candidate causal variants (Table 1, Supplementary Table 2). These SNPs have very similar allele frequencies and are strongly correlated with SNP rs2787486. The two SNPs first reported to be associated with breast cancer were both excluded from this set of 28 candidate causal variants by likelihood ratio tests relative to SNP rs2787486 (likelihood ratios: 1:177439 for rs6504950 and 1: 3271 for rs1156287). Caswell et al.30 subsequently identified marker rs11658717 as a potential causal candidate, but this variant is ranked 49th and has a likelihood ratio of 1: 3875 relative to lead SNP rs2787486 - hence this has also been ruled out as potential casual candidate by our analysis.

Table 1 Association result of independent signal among European and the set of highly correlated common variants.

Association with breast cancer subtypes

Based on data from European studies, 66 genotyped SNPs and 72 imputed SNPs were associated with risk of ER+ breast cancer (P values 10−7 to 10−14). The most strongly associated SNP for overall breast cancer (rs2787486) was also the most strongly associated for ER+ disease (OR = 0.91 (0.88–0.93), P = 1.39 × 10−14), but was more weakly associated with ER− disease (OR = 0.95 (0.91–0.99), P = 1.77 × 10−02,  = 0.017). The most strongly associated SNP for ER− disease was c17_pos53079506 (OR = 1.19 (1.07–1.33), P = 0.0017,  = 0.015) located ~130 kb from rs2787486.

To determine whether there were additional subtype-specific association signals, we included all SNPs displaying evidence for association with ER+ disease (345 SNPs, P < 10−4 and MAF ≥ 2%) in a separate forward stepwise regression model. The top associated SNP was rs2787486 (OR = 0.91 (0.88–0.93), P = 1.39 × 10−14) - the same SNP best-associated with overall risk and the same signal was localized by the HyperLasso search with penalty term set to 424. We calculated the likelihood ratio of each ER+ associated SNP (r2 > 0.6) relative to rs2787486 and retained a list of 37 markers (17 genotyped, 20 imputed) with a likelihood ratio of >1:100. This list included all 28 candidate causal SNPs for overall risk, except for imputed variant rs187242. No stepwise selection for ER- risk was performed as none of the markers fulfilled the inclusion criteria.

Overall breast cancer and subtype risk association in Asian studies

Among Asian studies, the strongest association with overall breast cancer risk was observed for genotyped SNP rs244353 (OR = 0.91 (0.85–0.93), P = 2.57 × 10−3). This SNP was one of the candidate causal SNPs in Europeans, and conferred a similar OR in both populations (Supplementary Table 3). Of the genotyped markers 299 (among Europeans) and 38 (among Asians) exhibit a marginal P-value ≤ 0.05; of which 27 were significant in both populations and 9 (Supplementary Table 4) were selected as potential candidates by relative likelihood filtering in the European population. No evidence of heterogeneity in tumour subtype OR was observed for this SNP. The strongest association with ER+ disease was with SNP rs7503456 (OR = 0.89 (0.84–0.95), P = 2.73 × 10−4), which showed no association in the European studies (OR = 1.00 (0.97–1.03), P = 0.775). For ER− disease the strongest association was found with c17_pos52831447 (OR = 1.91 (1.25–2.92), P = 3.8 × 10−3), which showed no association among the European studies (OR = 1.04 (0.98–1.09), P = 0.181).

Analyses of overlap between candidate causal variants and regulatory sites

The 28 candidate causal variants (Table 1, Supplementary Table 5) fall in a 53.2 kb region spanning two introns of STXBP4 (Fig. 1). We mapped these to regulatory annotations from ENCODE. Analysis of DNase hypersensitivity clusters indicates that SNP rs244353 overlaps with a DHS in 23 cell lines, while rs2787481 and rs244317 show overlap in one and three cell lines, respectively. However, none of these overlaps were observed in mammary cells. None of the candidate causal SNPs overlapped with histone modification marks (H3K4me1, H3K4me3, H3K9ac, H3K27ac) in the mammary cells line HMEC and MCF7 breast cancer cells (Fig. 2A). We analysed enhancer-promoter interactions using Chromatin Interaction Analysis by Paired End Tag (ChiA-PET) data for CCCTC-binding factor (CTCF) and DNA polymerase II (Pol2) in MCF7 breast tumour derived cells. Although multiple chromosomal interactions were observed across the locus for both Pol2 and CTCF in MCF7 cells there was a notable dearth of such interactions in the region encompassing the strongest candidate causal variants (Fig. 2B). No interactions were observed in Hi-C data from HMEC cells in this region (data not shown).

Figure 2: In silico analysis of the 17q22 locus.
figure 2

(A) Functional annotations using data from the ENCODE project. From top to bottom, epigenetic signals evaluated included DNase clusters, and ENCODE chromatin states (ChromHMM) and histone modifications in HMEC and MCF7 cell lines. The detailed colour scheme in chromatin states is described in the UCSC browser. All tracks were generated by the UCSC genome browser (hg 19). (B) Long-range chromatin interactions. From top to bottom, ChIA-PET data for Pol2 and CTCF in MCF7 cell lines. The ChIA-PET raw data available on GEO under the following accession (GSE63525.K56, GSE33664, GSE39495) were processed with the GenomicRanges package. -RNAseq data from MCF7 and HMEC cell lines. The value of the RNAseq analysis corresponds to the mean RPM value for COX11, TOM1L1 and STXBP4 from 4 HMEC and 19 MCF7 datasets, respectively. The annotation has been obtained through the Bioconductor annotation package TxDb.Hsapiens.UCSC.hg19.knownGene. The tracks have been generated using ggplot2 and ggbio library in R.

Data from Hnisz et al.18 indicates the existence of several enhancers across the region, including a small one predicted to target the STXBP4 gene (observed in both HUVEC and CD4 memory cells) that includes the candidate causal variant SNP rs244353 (Fig. 3). However, PreSTIGE17 indicates that an overlapping enhancer element (also containing rs244353) active in HepG2 cells may target the HLF gene. Another PreSTIGE element containing rs244336 and rs244337 is predicted to target HLF in colon crypt cells (Fig. 3).

Figure 3: Functional annotation of the 17q22 locus.
figure 3

Positions of candidate causal variants are shown as black tick marks in relation to local genes. Epigenomic marks associated with enhancers are shown as described by FANTOM, ENCODE, PreSTIGE and Hnisz et al.18. Enhancers which overlap candidate SNPs are coded in color to match their predicted target gene. The grey shaded region area depicts the region bound by un-excluded variants.

Local Gene Expression analyses

ENCODE RNA-seq data show that COX11 and TOM1L1 are highly expressed in both MCF7 and HMEC cells lines while STXBP4 shows much lower expression levels (Fig. 2B). We performed allele specific expression (ASE) analysis using RNAseq and SNP array genotype data from TCGA19. Allelic imbalance at marginal statistical significance (P =0.032) in COX11 expression was detected with the alleles of candidate causal SNP rs2787481 (r2 = 0.90 with rs2787486 ~1.3 Kb away) but not with any other genes within 1 Mb (TOM1L1, STXBP4, HLF, MMD, PCTP) using the same SNP (Supplementary Figure 3, Supplementary Table 6).

We also examined the associations of SNPs with the expression levels of the same local genes. In the normal tissue samples (n=135) from the METABRIC study the top breast cancer candidate causal variant was also associated with COX11 expression levels (Supplementary Table 7). The most significant breast cancer associated SNP, rs2787486, was associated with differential expression of COX11 (P = 0.00019, FDR corrected P = 0.05) but not significantly associated with expression of any other genes after FDR correction. However, other SNPs across this region were more significantly associated with COX11 expression (strongest association with SNP rs138326143, P = 1.4 × 10−7, FDR corrected P = 0.003,31) suggesting that the observed change in COX11 expression in normal breast tissue is unlikely to be the main driver of breast cancer risk. By contrast, no associations with COX11 expression were observed in the TCGA breast tumour samples with the top breast cancer risk SNPs (Supplementary Figure 5). However, in TCGA multiple SNPs associate with expression of the shortest isoform of STXBP4 (uc010dcc) with the top breast cancer risk SNP, rs2787486 having a FDR corrected P = 4.0 × 10−8 (r2 = 0.06, Supplementary Figure 4). Other SNPs, including rs244317 and rs11658717 displayed more significant associations with expression of this isoform (FDR corrected P = 3.8 × 10−9, r2 = 0.07, Supplementary Figure 4) than the top breast cancer risk SNP. The minor alleles are associated with increased expression of isoform uc010dcc, and explain 7% of the variation in its expression levels. Of note Caswell et al.30 reported that the A-G base change of SNP rs11658717 mediates the use of different splice junction between exons 5 and 6 of the STXBP4 gene and thus generates the shorter uc010dcc isoform. Our expression data thus support this report but our association evidence (likelihood ratio 1:3875 relative to lead SNP rs2787486) indicates that this SNP is unlikely to be a causal variant driving breast cancer risk.

We also interrogated candidate variants in the v6 data release from the Gene-Tissue Expression (GTEx) project32. We found a significant association between the minor allele of SNP rs244353 and decreased expression of their measured STXBP4 (full length) isoform in multiple tissues including breast (n = 183; P = 1.3 × 10−6; Supplementary Figure 6, Supplementary Table 8). These different METABRIC, TCGA and GTEX findings appear contradictory of each other: SNPs rs244353 and rs244317 are highly correlated (r2 = 0.90 and yet their minor alleles are significantly associated with decreased STXBP4 expression in GTEx but increased expression of isoform (uc010dcc) in TCGA. One possible explanation is that the STXBP4 full length transcript (measured in GTEx) and the short transcript (uc010dcc, measured in TCGA) are regulated by different mechanisms33.


In this - study, using more than 100,000 cases and controls of European and Asian ancestry participating in BCAC, we have confirmed previous reports of associations of SNPs in the 17q22 region with risk of breast cancer3,4. Moreover, we identified a set of 28 strong candidate causal variants, of which one or more is the likely driver of these reported associations. Of these, SNP rs2787486, which is correlated with previously reported candidates: rs6504950 (r2 = 0.73), rs1156287 (r2 = 0.83) and rs11658717 (r2 = 0.84)5,30; was the most strongly associated variant with overall risk (OR = 0.92 (95% CI: 0.90–0.94), P = 8.96 × 10−15). A similar magnitude of association was observed in both European and Asian women, consistent with the same causal variant mediating risk in both populations. The association was stronger for ER+ than ER− breast cancer.

All the remaining candidate causal variants lie in a 53 Kb region (positions 52,176,211 to 53,228,543) spanning two introns of the STXBP4 gene. None are predicted to alter the coding sequence of this gene and so it is most likely that the association is mediated through altering the regulation of one or more nearby genes. CHIA-Pet studies in the breast cancer MCF7 cell line reveal many chromatin interactions across the wider region (Fig. 2); however, there is a dearth of such interactions in the region encompassing the strongest candidate causal variants. Furthermore, in MCF7 or HMEC mammary cell lines there was no evidence of histone modification or open chromatin, indicative of the existence of regulatory regions, overlapping the best candidate causal variants, although such regions do exist in the wider region studied (Fig. 3). In this respect, this association signal differs from other breast cancer association signals in which strong evidence of regulatory elements in mammary cell lines has been observed34. An enhancer is predicted by FANTOM in many cell types while data from Hnisz et al.18 (Fig. 3) indicates the existence of a small enhancer region, targeting STXBP4 (observed in both HUVEC and CD4 memory cells) that overlaps with candidate causal variant rs244353. However, PreSTIGE data indicate that nearby enhancer elements may target HLF in HepG2 and colonic crypt cells (Fig. 3).

Of the candidate genes in the region, both COX11 and TOM1L1 are highly expressed in both the HMEC and MCF7 breast cancer cell lines, while STXBP4 shows much lower expression (detected by RNAseq in TCGA). In support of COX11 and TOM1L1 being the targets of this breast cancer susceptibility locus, eQTL analyses in normal breast tissue showed borderline significant associations of the risk alleles of top candidate causal SNP rs2787486 with increased expression levels of both TOM1L1 and COX11; candidate SNP rs2787481 also showed evidence of allelic imbalance in COX11 expression. COX11 encodes a cytochrome c oxidase copper chaperone – a nuclear-encoded protein component of a mitochondrial-membrane-embedded respiratory complex and TOM1L1 encodes a Target of myb1-like1 membrane trafficking protein35. Both genes are expressed in the majority of tissues examined in the Human Protein Atlas32. eQTL analysis in breast tumour tissues in TCGA find the risk allele of top candidate breast cancer risk SNP, rs2787486, to be significantly associated with increased STXBP4 expression, but not with COX11 (Supplementary Figures 4 and 5).

Furthermore, Hnisz et al.18 indicates the presence of an enhancer element that overlaps with candidate causal SNP rs244353 and potentially targets STXBP4 (observed in both HUVEC and CD4 memory cells). Consistent with this, TCGA eQTL studies in breast tumour tissues find the risk allele of top candidate breast cancer risk SNP, rs2787486, to be significantly associated with increased STXBP4 mRNA expression. The STXBP4 gene encodes Syntaxin binding protein 4, a scaffold protein, which has been shown to stabilise and prevent degradation of an isoform of p6336. P63 is, in turn, a member of the p53 tumour suppressor protein family and thus possibly a biologically more plausible candidate cancer gene than COX11 or TOM1L1.

We conclude that one or more of the 28 variants we identified is causally related to breast cancer risk, most likely through regulation of STXBP4, COX11 and TOM1L1, with the balance of the evidence favouring STXBP4 as the most important target. It remains possible, however, that the target gene(s) is more distant (>1 Mb) from the associated variants and so have not yet been considered. Further functional analyses will be required to determine the mechanism underlying this association and the downstream targets.

Additional Information

How to cite this article: Darabi, H. et al. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs). Sci. Rep. 6, 32512; doi: 10.1038/srep32512 (2016).