Introduction

Less than 10% of human breast cancers (BCs) show pronounced familiarity that can be explained by high penetrance germline variants, but sporadic BCs are also co-determined by low penetrance variants.1, 2, 3 Genome-wide association studies in BC cohorts compared with the general population have identified almost a 100 single-nucleotide variants (SNVs) associated with BC with odds ratio (OR) below 1.2 for most single SNVs.4, 5 However, differently from high penetrance variants, how SNVs functionally increase BC risk is difficult to establish and mostly unknown.

In sporadic BC, next-generation sequencing has revealed a mutational landscape characterised by a large number of somatic variants (SM), but few recurrently mutated genes: PIK3CA (25–35%), TP53 (20–30%), and to a lesser extent MAP3K1, GATA3 and CDH1, with a certain preference for specific BC (sub)types.6, 7, 8 Recurrent somatic variants drive tumour progression, but little is known about what leads to the development of tumours carrying these variants. We reasoned that risk-associated genetic variants might modify driver gene penetrance, and investigated the issue by analysing the association of a small series of known BC risk-associated SNVs with the occurrence of SM in the frequently mutated genes PIK3CA, TP53 and MAP3K1.

Methods

Twenty-one SNVs were selected (Supplementary Table S1): 11 from O’Brien et al.,9 rs889312, an expression quantitative trait locus (eQTL) SNV,10, 11 and three further SNVs in high linkage disequilibrium (LD) with the former all located at 5q11.2 (Supplementary Table S2), as well as seven further SNVs were selected from Rhie et al.12 We preferred SNVs associated with oestrogen receptor α-positive (ER+) BC risk (often higher than for all BC) and close to coding sequences, to maximise the use of genotyped data without imputations of allelic variant status.

Because no previous OR assessment was available, we started by analysing a small pilot data set powered enough to observe strong association, Ellis et al.,13 whole-genome and exome sequences of samples from 77 ER+ BC patients of CEU ethnicity (genomic data from dbGap14 repository http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000472.v1.p1, authorisation #7444; SM data from Ellis et al.13). To confirm the observed associations, we used the 754 ER+ BC patient data from the BRCA data set of the TCGA consortium collection7 (genomic data http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000178.v9.p8, available from CGHub15 repository, authorisation #7821; public data at https://tcga-data.nci.nih.gov, and many research portals such as CBIO (http://www.cbioportal.org/)16). Clinical and protected SRR/BAM files of normal blood samples, collecting selected sequences that included the 21 SNVs for the Ellis 2012 data set or the four 5q11.2 SNVs for the BRCA data set, were downloaded by using dbGap’s SRA-Toolkit service for Elllis 2012 and beta2 version of CGHub’s BAM-Slicer service for TCGA-BRCA. Selected sequence downloading allowed a crucial reduction of the multi-terabytes secure file handling required for full data sets; to our knowledge, this is the first report of an analysis entirely performed by using selected downloads. Data were processed by Python and R scripts for realignments, filtering and allele calls, then linked to public (level 3) SM data of BC samples. For the Ellis 2012 pilot data set, ORs were calculated for each allele (dominant model) of the 21 SNVs vs the SM status of each of PIK3CA, MAP3k1 and TP53. Uncorrected Fisher’s exact test P<0.1 and OR >3 were used to select the resulting associations. Allele correlations were assessed by the Pearson’s r coefficient. For the BRCA data set, ER+ samples were selected and a similar analysis was performed on the four 5q11.2 SNVs vs PIK3CA SM only. Fisher’s P were reported without multiple test correction owing to the high correlations among the four SNVs.17 Logistic regression and ethnicity strata calculation with forest plots and Breslow–Day homogeneity test were also performed. Public (level 3) log2 normalised gene expression data for the BRCA BC samples were merged with the SNV/SM/clinical data and difference of expression significances were assessed by using the Student’s t-test. Statistical analyses were performed by using R(x64) 3.1.3 (http://www.R-project.org). Result data were submitted to the GWAS Central repository (http://www.gwascentral.org/study/HGVST1843).

Results

In the Ellis 2012 pilot data set, we found strong indication that four SNVs in the 5q11.2 locus were associated with the mutational status of PIK3CA with high OR (Table 1). As expected by their LD, high correlation was found among these variants. Most of the other high OR SNV/SM associations had very low significance levels (full results in Supplementary Table S3). We focused on the associations of the four 5q11.2 SNVs with PIK3CA SM and verified them in the 754 ER+ BC patient data from the TCGA-BRCA data set. Two variants, rs331499 (hg19.chr5:g.56210923A>G) and rs252913 (hg19.chr5:g.56195846G>A), located in the boundary of MAP3K1 and SETD9 genes, were confirmed to be correlated with PIK3CA SM with high OR; a third, rs832552(hg19.chr5:g.56113850T>G) inside MAP3K1, had few valid samples but a similar OR trend (Table 1). High correlation (Pearson's r range: 0.73–0.94) among the variants was confirmed (Supplementary Table S4). In logistic regression, no evidence of significant heterogeneity for ethnicity was found (Supplementary Tables S4 and S5).

Table 1 Association of SNVs close to the MAP3K1 gene with somatic PIK3CA variants, two cohorts

The three 5q11.2 variants were found to be associated with the overexpression of one or both of their nearest genes MAP3K1 and SETD9; for MAP3K1, associations were stronger in PI3KCA SM than in wild type (WT; Table 2 and Supplementary Tables S6–S8). Furthermore, we found a direct association between both MAP3K1 and SETD9 overexpression, and PIK3CA SM status – MAP3K1 expression at PIK3CA SM/WT, difference of means: 0.38 (95% CI: 0.19, 0.57), P=4e−4; SETD9 expression at PIK3CA SM/WT, difference of means: 1.44 (95% CI: 1.24, 1.63), P=2e−16.

Table 2 5q11.2 SNV association with MAP3K1 and SETD9 gene expression in TCGA-BRCA (ER+) data set

Discussion

In this short report, we show that germline SNVs located near the MAP3K1/SETD9 genes associate with PIK3CA SM in ER+ BC with OR values (1.75 and 2.97 for rs331499 and rs252913, respectively) much higher than their OR of association with BC or BC subtypes (OR about 1.1,18 as the OR of most cancer-risk SNV4). SNV data are coherent with gene expression data: the SNV associations with MAP3K1/SETD9 overexpression are increased when the distance from the target gene is reduced, and, for MAP3K1, are stronger in PIK3CA SM BC samples. The overall picture is compatible with a MAP3K1/SETD9 variant-dependent overexpression affecting PIK3CA SM penetrance. Moreover, we found a clear direct association of PIK3CA SM with MAP3K1 and SETD9 overexpression. Indeed, inter-regulation between PI3K and MAP-kinase pathways has been described in in vitro experiments and computer simulation,19 and combination of drugs targeting both pathways is under clinical investigation.20 A possible SETD9 involvement is suggested by the strong SNV associations with SETD9 overexpression; moreover, 5q11.2 SNV eQTL to SETD9 has been reported also in normal blood.18 However, we found a synergy of PI3KCA SM and SNV only for MAP3K1 overexpression.

Two of our findings indicate that a complex BC risk SNV structure is present in the 5q11.2 region. First, only the SNV in the boundary of MAP3K1/SETD9 genes (but not the reference risk SNP rs889312, which they are in high LD with) were found associated with PI3KCA SM. Second, phasing data showed that the SNV alleles associated with increased PIK3CA SM (and MAP3k1/SETD9 overexpression) are actually correlated with the reduced BC risk allele (A) of rs88931211(Supplementary Table S10). Hence, their opposite alleles should be associated with BCs in which PIK3CA is not mutated to build up to their overall increased BC risk.18 This ‘reverse’ phase should not surprise because the SNV/PIK3CA SM associations found have an allele unbalancing effect one order of magnitude stronger than the reported SNV/BC risk. However, it predicts the presence of multiple classes of SNV BC risk in the 5q11.2 segment that split when probed for PIK3CA somatic variants.

This multiplicity could be a consequence of MAP3K1 ubiquitinase activity in addition to its kinase activity, which can therefore both activate and destabilise MAP-kinases.21, 22 The complex BC risk SNV structure has been confirmed by a recent fine scale analysis of 5q11.2 region in a large cohort of patients (not available when we started our investigation) that identified, by logistic regression, four BC risk-associated haplo-blocks.17 By analysing in the BRCA data set, four genotyped SNV representative of the haplo-blocks, we found that only one SNV allele correlated with enriched PIK3CA SM, and it was associated with a reduced BC risk (Supplementary Table S9).

In conclusion, the germline 5q11.2 variants, rs331499 allele A and rs252913 allele G, are associated with MAP3K1 and SETD9 overexpression, and correlate with increased PIK3CA SM frequency in ER+ BC. Genome-wide analysis of SNV/SM associations can increase our understanding of tumour biology with relevant information for precision medicine.