Identification of men at greatest risk of developing prostate cancer remains an important challenge. Risk stratification using common genetic markers, such as single-nucleotide polymorphisms (SNPs), shows promise toward more effectively identifying men at greatest risk of developing aggressive or fatal prostate cancer [1, 2]. Analyses of benefit, harm, and cost-effectiveness support use of genomic risk stratification to guide prostate cancer screening [3, 4]. Ensuring these tools perform well in diverse populations is important to ensure risk stratification is optimal for all men and avoid exacerbating existing health disparities [5, 6].

We previously developed a polygenic hazard score (PHS) and demonstrated in an independent European dataset that the PHS was associated with age at diagnosis of clinically significant prostate cancer [1]. Risk stratification with this score also improved the accuracy of PSA testing [7, 8]. We then validated the PHS model (with 46 SNPs) for association with age at diagnosis and with prostate-cancer specific mortality in a dataset that included men of diverse descent, including European, African, and Asian ancestry [2]. We have also improved performance in men of African ancestry by searching for SNPs within that subpopulation [9, 10].

Recent meta-analyses have identified over 200 SNPs associated with prostate cancer, including some identified through subset analyses in men of non-European ancestry [11, 12]. Given the increasing number of SNPs associated with prostate cancer, we evaluated whether including more SNPs in a prostate cancer PHS would improve associations with clinically significant prostate cancer in multi-ancestry datasets.



Genotype and phenotype data, all de-identified, were obtained from the PRACTICAL consortium. Participants had previously been genotyped via the OncoArray [13] or the iCOGs [14] chips; 90,638 men were available for this analysis.

The available data were split into a training dataset and four testing datasets, taking into account prior power analyses and PHS association results [7, 15, 16]. The training dataset for the model included 72,181 men of European genetic ancestry genotyped via OncoArray (24,010 controls and 48,171 cases). The four testing datasets included: (1) men of African ancestry (n = 6253: 3013 controls and 3240 cases), (2) men of Asian ancestry (n = 2378: 1184 controls and 1194 cases), (3) the Cohort of Swedish Men (COSM) population-based cohort with long-term outcomes [17] (n = 3279: 1116 controls, 2163 cases, and 278 prostate cancer deaths), and (4) the ProtecT population-based prospective trial with screening (prostate-specific antigen, PSA) and biopsy outcomes for both cases and controls (n = 6411: 4828 controls and 1583 cases) [18].

Polygenic hazard score model development using LASSO regularization

We sought to develop an optimal, integrated PHS model with candidate SNPs chosen from those used in the prior PHS and from those identified as susceptibility loci for prostate cancer in a genome-wide trans-ancestry meta-analysis [11]. Candidate SNPs included the 46 from the original PHS development and 269 from the meta-analysis. A machine-learning, LASSO-regularized Cox proportional hazards model approach was used to objectively select SNPs and estimate weights, as described previously [8].

There were 299 unique candidate SNPs associated with prostate cancer consistent across the training and testing datasets used in the present work. We first identified SNP pairs (among the 299 candidates) that were highly correlated (r2 > 0.95). Each of these paired, correlated SNPs was tested in a univariable Cox proportional hazards model for association with age at prostate cancer diagnosis; the SNP with the larger p-value was eliminated from inclusion in the model. All other (unpaired) SNPs were included as candidates for the present PHS model.

The R (version 4.0.1) “glmnet” package was used to estimate the LASSO-regularized Cox proportional hazards model [19, 20]. Age at prostate cancer diagnosis was the time to event, and the predictor variables included the genotype allele counts of candidate SNPs and first four European ancestry principal components. Controls were censored at age of last follow-up. The LASSO-regularized model’s hyper-parameter (lambda) was selected using 10-fold cross-validation [19, 20]. The final form of the LASSO model was estimated using the lambda value that minimized the mean cross-validated error.

Association with prostate cancer

We evaluated the association of the adapted PHS with clinically significant prostate cancer, as well as any prostate cancer, via Cox proportional hazards models in each of the four testing datasets. Clinically significant prostate cancer was defined to be a prostate cancer case with Gleason score ≥7, PSA ≥ 10 ng/mL, T3-T4 stage, nodal metastases, or distant metastases [21].

As the COSM dataset had long-term follow-up data available [17], we additionally evaluated the adapted PHS for association with age at prostate cancer death [16]. There were 278 deaths from prostate cancer in the COSM dataset.

Hazard ratio performance

Hazard ratios between the top 5% and middle 40% (HR95/50), top 20% and middle 40% (HR80/50), bottom 20% and middle 40% (HR20/50), and top and bottom 20% (HR80/20) were estimated for any, for clinically significant, and for fatal prostate cancer. Percentiles of genetic risk were determined within the controls the training set with age less than 70 years [1, 2, 7, 8].

Family history

Given that family history of prostate cancer is currently one of the most useful clinical risk factors for the development of prostate cancer, we used Cox proportional hazards models to assess family history for association with any, with clinically significant, or with fatal prostate cancer. Family history of prostate cancer was defined as presence or absence of a first-degree relative diagnosed with prostate cancer. Multivariable models using both family history and the adapted PHS were compared to using family history alone via a log-likelihood test with α = 0.01. HRs were calculated for each variable: HRs for PHS in the multivariable models were estimated as the HR80/20 in each testing dataset (e.g., men in the highest vs. lowest quintile of genetic risk by PHS). HRs for family history of prostate cancer were estimated as the exponent of the beta from the multivariable Cox regression. As done previously [1, 7], p-values were truncated at <10−16.

Positive predictive value performance

Positive predictive value (PPV) performance of PSA testing was calculated using data from the population-based ProtecT screening study [18] (prostate biopsy results were available for both cases and controls with a positive PSA [≥3 ng/mL]). To estimate the PPV and confidence intervals, we generated 1000 bootstrap samples using ProtecT participants with positive PSA, while maintaining the 1:2 case:control ratio in the ProtecT dataset. PPV was calculated as the proportion of PSA-positive participants who were diagnosed with clinically significant prostate cancer on biopsy, looking at those participants in the top 5 (PPV95) or top 20 percentiles (PPV80) of PHS genetic risk.

Cumulative incidence curves for PHS

Genetic-risk-stratified cumulative incidence curves for prostate cancer were derived using previously described methods [7, 8]. Briefly, age-specific population data from the United Kingdom (Cancer Research UK [7]) were used to estimate prostate cancer incidence for men aged 40–70 years. Data from the population-based Cluster Randomized Trial of PSA Testing for Prostate Cancer (CAP) trial [7] were used to adjust this population incidence curve to reflect the age-specific cumulative incidence of clinically significant and non-clinically-significant prostate cancer. Genetic-risk-stratified cumulative incidence curves were then calculated for men in the upper 5 and 20 percentiles of PHS genetic risk by multiplying the prostate-cancer-specific cumulative incidence by the mean value of HR95/50 and HR80/50 in the testing dataset, respectively.


A total of 290 SNPs returned non-zero SNP coefficients using the regularization-weight selection and were included in the final model, called PHS290 (Supplementary Material).

HR performance of PHS290 demonstrates risk stratification across percentiles of genetic risk (Table 1). Comparing the top and bottom quintiles of genetic risk for clinically significant prostate cancer, men with high PHS had HRs of 13.73 [12.43–15.16], 7.07 [6.58–7.60], 10.31 [9.58–11.11], and 11.18 [10.34–12.09] in the ProtecT, African, Asian, and COSM datasets, respectively. Similar risk stratification was seen when evaluating risk of any prostate cancer. Finally, when comparing the top and bottom quintiles of genetic risk in the COSM dataset, men with high PHS had a HR of 7.73 [6.45–9.27] for prostate cancer death.

Table 1 Hazard ratio (HR) performance in the four testing datasets.

Family history and PHS290

Family history data based on self-report and were available for 89%, 89%, 43%, and 75% of individuals in the ProtecT, African, Asian, and COSM datasets, respectively; 7%, 18%, 9%, and 14% of individuals, respectively, reported having a first-degree relative diagnosed with prostate cancer. The combination of family history and PHS performed better than family history, alone, for clinically significant prostate cancer (and for any prostate cancer) in each of the four testing datasets (log-likelihood p < 10−16; Table 2). Additionally, family history and PHS together performed better than family history, alone, for fatal prostate cancer in the COSM dataset (log-likelihood p < 10−16).

Table 2 Multivariable Cox models with both PHS and family history of prostate cancer (defined as ≥1 first-degree relative affected) for association with any prostate cancer, with clinically significant prostate cancer, and with fatal prostate cancer.

Positive predictive value performance of PHS290

The PPV of PSA testing for clinically significant prostate cancer was 0.19 (0.15–0.22) for the top 20% of genetic risk (PPV80) and 0.26 (0.19–0.33) for the top 5% of genetic risk (PPV80; Fig. 1). Both were greater than the overall PPV of PSA alone, which was 0.12 (0.11–0.14).

Fig. 1: PPV performance in the ProtecT dataset for clinically significant prostate cancer, estimated using 3 approaches: standard (not using PHS), top 20% of PHS values (PPV80), and top 5% of PHS values (PPV95).
figure 1

Error bars are 95% bootstrap confidence intervals.

Cumulative incidence curves for PHS290

Genetic-risk-stratified cumulative incidence curves for clinically significant and non-clinically significant prostate cancer demonstrate greater prostate cancer incidence with higher genetic-adjusted risk (Fig. 2).

Fig. 2: Genetic-risk-adjusted cumulative incidence curves for PHS290.
figure 2

Curves are shown for the upper 5th (>95th) and upper 20th (>80th) percentile of PHS290 for clinically significant and non-clinically-significant prostate cancer. The reference curves represent the overall population average from the UK.


The improved PHS (PHS290) demonstrates excellent genetic risk stratification, including for clinically significant prostate cancer. This was true in four separate testing sets of varied genetic ancestry (Asian, African, European). Additionally, PHS290 was associated with lifetime prostate-cancer-specific mortality in a population-based cohort [16]. Hazard ratios with PHS290 are larger for each of these associations than those reported for previous versions of PHS [2, 8], demonstrating the value of incorporating SNPs from genome-wide meta-analysis and fine-mapping. The improvements demonstrated here are promising for implementing personalized approaches to prostate cancer screening decisions in diverse populations.

Health disparities are a major problem in prostate cancer. Given the exclusion of non-European data in most genome studies [5, 22,23,24], it is important that the pool of candidate SNPs here included those identified from a recent trans-ancestry meta-analysis [11]. Testing and improving performance of genomic risk scores in diverse populations is critical to equitable implementation of these new tools and important for avoiding exacerbation of existing disparities. PHS290 still performs better in men of European genetic ancestry—an expected result, given the much greater data availability in that population. Further genomic studies in diverse populations are essential, as diversity in model development improves performance in diverse populations [9, 10].

The intersection of social constructs like race/ethnicity and genomics also raises interesting and entangled challenges. Even availability of genomic data is only part of the problem, as disparities in health outcomes are rooted in systemic racism and inequities in access to healthcare [25, 26]. Genotypic ancestry may be a step toward biology, but the continental groups still represent an oversimplification of genetic diversity and a pre-determined assumption that socially defined categories have biological meaning in all contexts. We have previously shown that agnostic genetic clusters are informative for subgroup analyses [2], and this approach may be a better way forward, provided the genomic diversity of the whole population is represented in the available data. Here, we have used genotypic ancestry to evaluate the potential differential performance in groups historically excluded from large-scale genomic studies. We also note that local ancestry may be a critical consideration in admixed populations. We have found previously that PHS performance can vary by the makeup of a region of the genome, beyond what is explained by global ancestry categories assigned for an individual’s entire genome [10]. Despite these challenges and opportunities for future improvement, the current results demonstrate PHS290 does provide meaningful risk stratification in diverse datasets.

The cumulative incidence of clinically significant prostate cancer is heavily influenced by age-specific genetic risk, as demonstrated by genetic-risk-stratified cumulative incidence curves (Fig. 2). As men with high PHS290 age, the incidence curve for clinically significant prostate cancer increases dramatically, prominently separating from the incidence for non-clinically-significant prostate cancer. This effect is driven by the high HR for clinically significant prostate cancer in these men, combined with increasing incidence specifically of more clinically significant cancers as men age [7, 27]. Furthermore, we found that risk stratification with PHS290 improved accuracy of PSA testing, as assessed by probability of a positive PSA test leading to a diagnosis of clinically significant cancer on biopsy. Consistent with a prior study [8], this improvement in PPV of PSA testing was not better when using PHS290 than when using PHS46 in this dataset [2]. PPV analyses in larger datasets could permit finer granularity for age-specific genetic risk to assess whether the increased HRs of PHS290 might translate to better performance of PSA testing than that achieved already with PHS46.

The HRs reported here suggest clinical relevance for PHS290. Predictive tools in routine clinical use for other diseases (e.g., breast cancer, diabetes, and cardiovascular disease) have reported HRs of ~1–3 for endpoints of interest [28,29,30,31]. Current guidelines recommend earlier and more frequent consideration of prostate cancer screening for men with a family history of prostate cancer or African ancestry, citing an elevated risk 28–80% above that of men without these risk factors [32,33,34]. Guidelines more strongly recommend earlier and more frequent screening in men with germline mutations in BRCA2, which are rare but are estimated to infer up to 3-fold increased risk [33,34,35]. In the present study, men in the top 20% for PHS290 (compared to men with average risk) had HRs for clinically significant prostate cancer of ~2.8–3.9. For men in the top 5% for PHS290, those HRs increase to 4.3–6.9, depending on ancestry. While individuals with high polygenic risk may also develop low-grade prostate cancer in their lifetime, the time-to-event analysis applied here shows that high genetic risk confers a greater hazard for prostate cancer death. This finding is consistent with prior reports, though the effect size is larger with PHS290 [2, 16, 33].

Family history data were not uniformly available across source studies or testing datasets and were notably less available in the Asian dataset (43%, compared to ≥75% in the other testing datasets). The proportion of individuals with positive family history—among those with available family history data—also varied across datasets, ranging from 7% (ProtecT) to 18% (African). The interplay of family history and polygenic risk warrants further investigation in more complete cohorts. It is also worth noting that family history availability is not always available (or reliable) in clinical practice. Nonetheless, both family history and polygenic risk appear important for assessing individual risk of prostate cancer. Possibly, family history represents not only inherited genetic risk but also shared environmental exposures.

If this technology is to be implemented in clinical practice, it is important to consider reliability. Genotype arrays are known to have excellent reproducibility, with concordance of 99.40–99.87% on repeat testing with the same array [36]. Concordance across genotyping platforms is also excellent at 98.80% [36].

Our work has some limitations. First, the weights were calculated in men of European genetic ancestry alone, although SNP candidate selection was performed in multi-ancestry analyses. Future studies evaluating PHS for prostate cancer risk stratification will include non-Europeans in SNP weight calculations. The available data did not permit testing of PPV or association with fatal disease in non-European populations. Moreover, the cumulative incidence curves here are specific to the UK, where we had the most robust population age-specific incidence data for clinically significant prostate cancer. The testing sets used in the present study did represent a very small proportion of the data used for candidate SNP identification in the prior genome-wide association meta-analysis (as opposed to the development of PHS46, which was performed in a dataset completely independent of the validation dataset) [2, 11]. However, the training and testing sets were kept separate in the present study, and the use of the LASSO-regularized Cox model reduces over-fitting [37].

The PHS290 described here has the strongest reported association with prostate cancer in men of European, African, and Asian genetic ancestry. The score was also associated with lifetime prostate-cancer-specific mortality in a population-based cohort. A performance gap remains between genetic ancestry groups that might be closed through development using more data from men of Asian or African ancestry. Nonetheless, the results here suggest PHS290 may improve prostate cancer risk-stratification efforts in multi-ancestry populations.