Polygenic risk scores and breast and epithelial ovarian cancer risks for carriers of BRCA1 and BRCA2 pathogenic variants

Purpose We assessed the associations between population-based polygenic risk scores (PRS) for breast (BC) or epithelial ovarian cancer (EOC) with cancer risks for BRCA1 and BRCA2 pathogenic variant carriers. Methods Retrospective cohort data on 18,935 BRCA1 and 12,339 BRCA2 female pathogenic variant carriers of European ancestry were available. Three versions of a 313 single-nucleotide polymorphism (SNP) BC PRS were evaluated based on whether they predict overall, estrogen receptor (ER)–negative, or ER-positive BC, and two PRS for overall or high-grade serous EOC. Associations were validated in a prospective cohort. Results The ER-negative PRS showed the strongest association with BC risk for BRCA1 carriers (hazard ratio [HR] per standard deviation = 1.29 [95% CI 1.25–1.33], P = 3×10−72). For BRCA2, the strongest association was with overall BC PRS (HR = 1.31 [95% CI 1.27–1.36], P = 7×10−50). HR estimates decreased significantly with age and there was evidence for differences in associations by predicted variant effects on protein expression. The HR estimates were smaller than general population estimates. The high-grade serous PRS yielded the strongest associations with EOC risk for BRCA1 (HR = 1.32 [95% CI 1.25–1.40], P = 3×10−22) and BRCA2 (HR = 1.44 [95% CI 1.30–1.60], P = 4×10−12) carriers. The associations in the prospective cohort were similar. Conclusion Population-based PRS are strongly associated with BC and EOC risks for BRCA1/2 carriers and predict substantial absolute risk differences for women at PRS distribution extremes.


Genotyping and SNP imputation
Genotyping was performed on one of two bespoke SNP arrays. The majority of the samples were genotyped using the OncoArray (15,679 BRCA1 and 10,981 BRCA2 carriers) [1][2][3]

. The
OncoArray is a custom genotyping array comprising approximately 533,000 SNPs, including a GWAS backbone component tagging common SNPs across the genome which accounted for approximately half of the SNPs on the array. The remaining 3,256 (17.2%) BRCA1 and 1,358 (11.0%) BRCA2 pathogenic variant carriers were genotyped using the iCOGS array 4,5 which included approximately 210,000 SNPs selected primarily on the basis of evidence of association with breast, ovarian and prostate cancers.
A standard quality control (QC) process was applied for samples genotyped on both arrays, which included assessment of the SNP call rate, allele frequency, genotyping intensity clustering metrics, Hardy-Weinberg equilibrium and SNP concordance in duplicate samples 2 . Two-stage imputation was performed using SHAPEIT software 6 for phasing and IMPUTE2 software 7 for imputation using the 1000 Genomes Project (Phase 3) reference 5 panel.
SNPs were included in the PRS if they were adequately imputed in the CIMBA data.

Principal components analysis
To adjust for potential (intra-continental) population stratification in the OncoArray dataset, principal components analysis was performed using data from 33,661 uncorrelated SNPs (which included 2,318 SNPs specifically selected on informativeness for determining continental ancestry) with a MAF of at least 0.05 and maximum correlation of 0.1 in the OncoArray dataset, using purpose-written software (http://ccge.medschl.cam.ac.uk/software/pccalc). A similar approach was used for the iCOGS dataset.

Breast cancer and epithelial ovarian cancer PRS
PRSs were constructed as the weighted sum of alleles for 313 SNPs for breast cancer and 30 SNPs for epithelial ovarian cancer (EOC), thus the PRS for each participant, i, was calculated as: where gij is the genotype or imputed dosage for variant j observed for individual i and βj is 6 weight for the j th SNP.
The weights for the breast cancer PRS were the log Odds Ratio (log-OR) estimates of association used to construct the 313 SNP PRS based on data from the general population and reported by Mavaddat et al 8

Calculating the theoretical PRS
The theoretical PRS distribution under a multiplicative model was used in comparisons against the PRS percentile specific association estimates. For the theoretical PRS, the variance attributable to SNP i was given by: where Ei is the expected value of β, given by: where βi is the per-allele log-OR and pi is the allele frequency for SNP i 11 and were obtained from the population-data used in the PRS construction for breast and ovarian cancer (Tables   S1 and S2) 3,8 . The mean PRS is then given by: and the theoretical PRS variance is given by: The allele frequencies were obtained from the 1000 Genomes Project European ancestry samples. The theoretical HRs at each percentile were calculated assuming the PRS is normally distributed with mean ̅̅̅̅̅̅ and variance V (i.e. the HRs were log-normally distributed).

Weighted cohort analysis
The retrospective cohort association analyses were undertaken using weighted Cox regression models 12 . These analyses accounted for the non-random sampling of BRCA1 and BRCA2 carriers with respect to their disease (breast cancer and ovarian cancer) status.
In such retrospective studies, affected carriers tend to be oversampled because BRCA1 and BRCA2 testing is targeted to affected individuals who may also be diagnosed at an early age. Therefore, the carriers in CIMBA retrospective study do not represent a true cohort of BRCA1 and BRCA2 carriers. We have previously shown that under these conditions, standard Cox regression analysis leads to biased estimates of the rate ratios 12,13 . To correct for this bias, we used the weighted cohort approach 12,13 . Briefly, this method involves assigning different weights to cancer cases and unaffected individuals which are age-and gene-specific, such that the weighted observed incidence rates are consistent with established incidence rates for carriers of pathogenic variants in BRCA1 and BRCA2 14 . This 8 approach has been shown in simulation studies to yield unbiased estimates of association 12,13 . The weighted cohort analysis was carried out in R "survival" library command coxph(model,robust=TRUE,weights=w) where w represents the age specific weights.

Model comparisons
Likelihood ratio tests (LRTs) were undertaken to determine whether the models which include interaction terms (age-varying PRS, PRS interaction with gene variant location and PRS interaction with gene variant class) fitted data better than the nested model that did not include the interaction term. Here we considered two models: (i) a model that includes the PRS interaction term, with a corresponding log-likelihood, LI and the nested model without the interaction term with log-likelihood LN. Hence, the LRT comparing these models has the form: where Δd denotes degrees of freedom, given by the difference in number of parameters estimated between the two models. The BCFR is a family cohort recruited from six sites from Australia, Canada and the USA. The families were followed-up regularly by annual contact of probands and systematic 5-year follow-up of families that collected demographic and epidemiological data from all study participants.
The kConFab recruited pathogenic variant carriers from multi-case families that had been ascertained since 1997 by family cancer clinics in Australia and New Zealand.
kConFab study participants were independent from BCFR participants from Australia. Study participants were systematically followed by the kConFab Follow-Up Study 15 using mailed questionnaires every three years, with self-reported cancers and prophylactic surgeries confirmed from medical records. BBCC follow-up ended in December 2013 16 .

Association analysis in prospective cohorts
To assess associations between the PRS and breast cancer risk, eligibility was restricted to female BRCA1 and BRCA2 carriers who at completion of the baseline questionnaire were free of any cancer diagnosis (excluding non-melanoma skin cancer) and had not undergone risk-reducing bilateral mastectomy. Study participants were followed from baseline until the first of: (i) age 80-years; (ii) death; (iii) completion of last follow-up questionnaire or last record linkage, whichever occurred last; (iv) risk-reducing bilateral mastectomy; or (v) diagnosis of any first cancer (apart from non-melanoma skin cancer). Participants diagnosed with a first breast cancer were considered affected.
To assess associations between the PRS and ovarian cancer risk, eligibility was restricted to women who had not been diagnosed with ovarian cancer and had not had RRSO at the time of baseline questionnaire completion. To maximise statistical power, carriers with a prior breast cancer or non-melanoma skin cancer diagnosis were retained for the prospective ovarian cancer analyses, but carriers with prior diagnoses of other cancers were excluded, in line with previous prospective studies of ovarian cancer risk for BRCA1/2 pathogenic variant carriers 16 . Participants were followed from baseline until the first of: (i) age 80-years; (ii) death; (iii) completion of last follow-up questionnaire or record linkage, whichever happened last; (iv) bilateral RRSO, or bilateral salpingectomy, or removal of both ovaries for any other reason; or (v) any cancer diagnosis (except breast or non-melanoma skin cancer). Carriers diagnosed with invasive ovarian, fallopian tube, or peritoneal cancer during the follow-up were considered affected.
Associations using the harmonised prospective cohorts were analysed using Cox regression, separately for BRCA1 and BRCA2 carriers. Statistical models were stratified by consortia (CIMBA or BBCC), birth cohort, country, and Ashkenazi Jewish ancestry, adjusted for family history of the appropriate cancer in first-and second-degree relatives. Robust variance estimates were calculated considering family membership.

Calculating absolute cancer risks by PRS
Breast cancer absolute risks were calculated by PRS category and also by PRS category in combination with variant locations in BRCA1 and BRCA2, and in combination with family history of breast cancer (absence or presence of family history).
For these calculations we assumed external estimates of overall breast cancer incidence and breast cancer incidence estimates for different pathogenic variant locations and cancer family history status. The external estimates were obtained from previously published prospective penetrance studies in BRCA1 and BRCA2 pathogenic variant carriers 16 . For all analyses, to obtain the breast cancer incidences for each PRS percentile category, we constrained the breast cancer incidences over all PRS categories to agree with the prospectively estimated breast cancer incidence rates using the constraining approach described previously 8,[16][17][18] . In this we assume that the breast cancer incidence for someone in PRS category c is given by 0 ( )exp ( ) where λ0(t) is the baseline incidence (for those in the baseline PRS category) which is unknown, and βc is the corresponding log-HR of association with breast cancer risk for a carrier in category c relative to the baseline category. Given this constraining, it was previously shown 16,17 that λ0(t) is given by: where i(t) is the assumed external incidence (i.e. the average over all PRS effects), τc is the proportion of carriers in PRS category c and Sc(t) is the probability of surviving the disease to age t in PRS category c. λ0(t) can be calculated iteratively assuming Sc(0)=1 over the ages t.
Once λ0(t) was calculated, the incidence for each PRS category is given by: 0 ( )exp( ).
This process was carried out assuming the external incidence estimates for overall breast cancer, incidences by pathogenic variant location or by family history separately 16 .

Calculating 10-year cancer risks
The 10-year risk of developing breast or ovarian cancer at age t was calculated as the risk difference between ages (t+10) and t, conditional on not developing cancer up to age t.
Mathematically this can be written as: 10 is the 10-year risk and P(t) is the cumulative disease risk at age t and is calculated using the PRS specific incidences calculated in the previous section.

Table S1
The 313 SNPs used to construct the breast cancer PRS 8 . The same set of 313 SNPs was used to construct the PRSER-and PRSER+. The ER-specific PRS used different SNP weights (log-ORs for ER-specific breast cancer) if they had a statistically significant different effect on ER-subtype from a population-based breast cancer case-only analysis.

Table S2
The 30 SNPs used to construct the ovarian cancer PRS. The 22 SNPs used to form the high-grade serous ovarian cancer PRS are highlighted in grey. The high-grade serous specific PRS was limited to SNPs that showed genome-wide statistical significance (P<5x10 -8 ) with any of the ten ovarian cancer subtypes, had concordant direction of effects between overall all invasive and high-grade serous disease, and exhibited nominal statistical significance (P<0.05) with high-grade serous ovarian cancer 3 . The "overall" and "high-grade serous" ovarian cancer (per-allele) effect sizes and P-values were taken from 3 and/or 9 .

Table S3
Retrospective cohort characteristics for 18,935 BRCA1 and 12,339 BRCA2 carriers recruited by the CIMBA. Breast cancer and ovarian cancer refer to the first cancer diagnosis.
Censoring ages are reported in years. Pathogenic variant classes: I = unstable or no protein; II = stable mutant protein; III = consequence unknown. Pathogenic variant locations are in base pairs (bp) within the BRCA1 and BRCA2 genes. ER-status is oestrogen receptor status of the breast tumour. Cancer family history is reported for the relevant cancer from first and second degree relatives. "Unknown" family history = reported unknown cancer family history, "missing" family history = family history data not collected. IQR = interquartile range; SD = standard deviation.

Table S4
14 Validation data summary statistics from prospective cohorts (CIMBA and BBCC). Validation data are presented for the breast cancer PRS and ovarian cancer PRS by disease status at censoring. The PRSER-is reported for BRCA1 carriers, whilst the PRSER+ is presented for BRCA2 carriers with respect to the breast cancer data. PRSHGS is shown for both BRCA1 and BRCA2 carriers for the ovarian cancer data. The median (IQR) age at start of follow-up, follow-up time and age at cancer diagnosis (years) are displayed. The mean and SD are shown for the appropriate PRS. IQR = interquartile range; N = sample size; SD = standard deviation.

Table S5
Assumed proportions and hazard ratios used to constrain the breast cancer incidences from the external BBCC prospective cohort study for breast cancer family history and gene variant location 16 . The absolute risks of breast cancer at the 5th, 50th and 95th percentiles of the PRS are shown (absolute risk curves are plotted in Figures S5-S9).

Figure S1
Histograms of imputation accuracy (r 2 statistics) for the 313 breast cancer PRS SNPs.
Imputations were performed separately for genotyping arrays (iCOGS or OncoArray) and separately for BRCA1 and BRCA2 carriers. All SNPs were well imputed (r 2 ≥0.49).

Figure S2
Histograms of imputation accuracy (r 2 statistics) for the 30 ovarian cancer PRS SNPs.
Imputations were performed separately for genotyping arrays (iCOGS or OncoArray) and separately for BRCA1 and BRCA2 carriers. All SNPs were well imputed (r 2 ≥0.64).

Figure S3
Forest plots of country specific PRS hazard ratios estimated using the CIMBA retrospective cohort. These models tested for heterogeneity in PRS effects across countries by fitting a PRS by country interaction term. The baseline country was assumed to be UK/Eire.                      Figure S9