We determined the effect of sample size on performance of polygenic hazard score (PHS) models in prostate cancer. Age and genotypes were obtained for 40,861 men from the PRACTICAL consortium. The dataset included 201,590 SNPs per subject, and was split into training and testing sets. Established-SNP models considered 65 SNPs that had been previously associated with prostate cancer. Discovery-SNP models used stepwise selection to identify new SNPs. The performance of each PHS model was calculated for random sizes of the training set. The performance of a representative Established-SNP model was estimated for random sizes of the testing set. Mean HR98/50 (hazard ratio of top 2% to average in test set) of the Established-SNP model increased from 1.73 [95% CI: 1.69–1.77] to 2.41 [2.40–2.43] when the number of training samples was increased from 1 thousand to 30 thousand. Corresponding HR98/50 of the Discovery-SNP model increased from 1.05 [0.93–1.18] to 2.19 [2.16–2.23]. HR98/50 of a representative Established-SNP model using testing set sample sizes of 0.6 thousand and 6 thousand observations were 1.78 [1.70–1.85] and 1.73 [1.71–1.76], respectively. We estimate that a study population of 20 thousand men is required to develop Discovery-SNP PHS models while 10 thousand men should be sufficient for Established-SNP models.
Polygenic risk models have been studied extensively for several diseases such as prostate cancer , breast cancer , type 2 diabetes , dementia , and atherosclerosis . Polygenic scores in the context of survival models are a more recent advancement in the field, but have been garnering interest in the Alzheimer’s disease  and prostate cancer . The steady increase in genetic testing [8, 9], both in public and clinical domains, suggests that survival models could be applied to new diseases. The largest obstacle to the development of these models is the large number of study subjects, often in the tens of thousands , which are required for robust training and testing.
Our aim was to quantify the effect of sample size on the performance of a polygenic survival model. This was explored through a specific disease condition that is expected to be representative, namely prostate cancer. We investigated two potential model development strategies. For the ‘Established-SNP’ model, we selected single-nucleotide polymorphisms (SNPs) that had previously been shown to be associated with prostate cancer, and estimated the coefficients for these SNPs in a Cox proportional hazards framework. For the ‘Discovery-SNP’ model, we implemented the SNP selection technique described by Seibert et al.  to identify SNPs in our genotyping data for inclusion in the Cox proportional hazards framework. The Established (EST) SNP and Discovery (DIS) SNP represent two strategies that researchers could employ to build a polygenic survival model. In order to simulate samples of different sizes, we randomly sampled our training and testing sets. The results of this work will help inform the design of future studies to develop polygenic survival models for other diseases.
Materials and methods
Training and testing set
As previously described , we obtained genotype and age data from 21 studies included in the Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome (PRACTICAL) consortium. We analyzed data from 40,861 men consisting of 20,551 individuals with prostate cancer and 20,310 individuals without. For analysis, the age for each man was recorded as either their age at prostate cancer diagnosis (cases) or at interview (controls). Genotype data were restricted to SNPs with missing value rates <5%, resulting in 201,590 SNPs available for analysis. Missing calls were assigned the mean value for that SNP . The genotype data had been assayed using a custom iCOGS chip (Illumina, San Diego, CA) the details for which are elaborated elsewhere . The sample was split into training (34,444 men, consisting of 18,962 cases and 15,482 controls) and testing (6417 men consisting of 1589 cases and 4828 controls) sets. The testing set was selected using men who were enrolled in the Prostate testing for cancer and Treatment (ProtecT ) trial. ProtecT (ClinicalTrials.gov: NCT02044172) is a large, multicenter trial within the United Kingdom, which aims to investigate the effectiveness of treatments for localized prostate cancer. The ProtecT study group was chosen for testing as it represented a well-characterized group of individuals that had been used for measuring testing performance for our earlier work. The Data Availability Statement describing how readers can gain access to the PRACTICAL dataset is provided in the Supplementary information.
The present study used only de-identified data from the PRACTICAL consortium. All studies contributing data have the relevant Institutional Review Board approval in each country in accordance with the Declaration of Helsinki . The details of each study set, including the consent and accrual process are previously published .
A list of 65 SNPs  was chosen to represent those on the iCOGS array that had been published as associated with prostate cancer. The coefficients of the SNPs within the EST-SNP model were then estimated using the “coxphfit” function in MATLAB (Mathworks, Natwick, MA). It should be noted that the 65 SNPs used were discovered, in large part, using the data presently defined as the test set. The effect allele for all 65 SNPs was defined as “A” to simplify analysis.
For every SNP, a trend test was used to check for associations between SNP count and the binary classification of individuals with or without prostate cancer. The SNP selection pool was then reduced to those whose trend test p value was less 1 × 10−6. In order of increasing p value, each SNP was tested in a multiple logistic regression model for association with the binary classification of men as with or without prostate cancer, after adjusting for age, six principal components based upon genetic ancestry, and previously selected SNPs. If the p value of the coefficient of the tested SNP was <1 × 10−6, it was selected for the final Cox proportional hazard model estimation. The coefficients of the selected SNP pool within the DIS-SNP model were estimated as previously described .
Polygenic hazard score (PHS)
PHS performance metrics
Several performance metrics for PHS models were investigated, and are described in Table 1. In each case, the PHS for each test subject was calculated as the dot product of SNP coefficients, either EST or DIS, and SNP counts. A Cox proportional hazards model was then fit using PHS as the sole predictor of age in the test set. The z-score and beta of this Cox proportional hazards model relate to how well PHS was associated with age within the test set. The hazard ratios were calculated as the exponential of the differences in predicted log-relative hazards of different groups within the test set. The groups were defined using centile cut-points for those controls within the training set whose age was <70 years. This list of performance metrics expands on those (z-score and HR98/50) that were used in our earlier work . In addition, sample-weight performance metrics were estimated using a weighted Cox proportional hazard model [7, 14, 15] with PHS as the sole predictor of age in the test set. The weighting factor for the cases and controls were estimated using published prevalence data from the ProtecT randomized phase 3 trial .
Random sampling of training set
Random sampling of the training set was performed with replacement while ensuring equal proportions of men with and without prostate cancer. The training set was randomly sampled to include 1, 5, 10, 15, 20, 25, and 30 thousand total observations. Performance of the EST- and DIS-SNP models using random samples of the training data was measured in the entire test set.
A sub-analysis investigating the effect of the percentage of cases in the training set was conducted using the EST-SNP model with 5000 and 25,000 random samples of the training set. The results are presented in Supplementary Fig. 5.
Random sampling of the testing set
Random sampling of the testing set was performed with replacement while ensuring equal proportion of men with and without prostate cancer. The testing set was randomly sampled to include 0.5, 1, 2, 3, 4, 5, and 6 thousand total observations. Performance in the randomly sampled testing sets was performed using a representative EST-SNP model. The representative model was chosen as that whose parameters were estimated using a training sample size of 30 thousand total observations, and whose performance metrics were the shortest Euclidean distance to the average performance across all EST-SNP models using a training sample size of 30 thousand.
Established- vs. Discovery-SNP model performance
Histogram comparisons of performance metrics of EST- and DIS-SNP models are illustrated in Fig. 1. The performance metrics are shown for 50 random samplings of the training set using a sample size of 30 thousand total observations. Qualitatively, there appears to be more variability in performance metrics associated with the DIS process.
Coefficients of Established-SNP model
The mean coefficients for the 65 SNPs used in the EST-SNP model are plotted in Supplementary Fig. 1.
Effect of training set sample size on performance
Box plots of the performance metrics of the EST-SNP and DIS-SNP models for random samples of the training set are shown in Figs. 2 and 3, respectively. The mean values of HR98/50, HR20/50, HR98/20, HR80/20, z-score, and beta using a random training sample of 1 thousand total observations in the EST-SNP model were 1.73 [95% CI: 1.69–1.76], 0.71 [0.71–0.73], 2.42 [2.35–2.50], 1.96 [1.92–2.01], 9.92 [9.57–10.28], and 0.45 [0.43–0.47], respectively. The corresponding values using a random training sample of 30 thousand total observations were 2.41 [95% CI: 2.40–2.43], 0.60 [0.60–0.60], 4.04 [4.02–4.07], 2.86 [2.84–2.87], 15.1 [15.04–15.16], and 1.18 [1.17–1.18], respectively.
The mean values of HR98/50, HR20/50, HR98/20, HR80/20, z-score, and beta using a random training sample of 1 thousand total observations in the DIS-SNP model were 1.05 [0.93–1.18], 0.98 [0.89–1.07], 1.07 [0.91–1.24], 1.08 [0.91–1.24], 1.06 [−1.20 to 3.31], and 0.17 [−0.23 to 0.65], respectively. The corresponding performance values using a training sample size of 30 thousand observations were 2.20 [2.16–2.23], 1.60 [1.59–1.62], 3.47 [3.39–3.56], 2.53 [2.49–2.58], 13.19 [12.96–13.41], and 0.87 [0.85–0.89], respectively.
Box plots of the sample-weight corrected performance metrics for the EST-SNP and DIS-SNP model are shown in Supplementary Figs. 2 and 3, respectively. The trends observed in the sample-weight corrected performance metrics are identical to those observed in the raw, uncorrected metrics.
Effect of testing set sample size on performance
Box plots of the performance metrics of the representative EST-SNP model for random samples of the testing set are shown in Fig. 4. Box plots of the corresponding sample-weight corrected performance metrics are presented in Supplementary Fig. 4. The mean values of HR98/50, HR20/50, HR98/20, HR80/20, z-score, and beta using a random testing sample of 0.5 thousand total observations in the representative EST-SNP model were 1.78 [1.71–1.85], 0.73 [0.71–0.74], 2.50 [2.33–2.66], 1.99 [1.89–2.09], 3.82 [3.57–4.08], and 0.76 [0.70–0.82], respectively. The corresponding values using a testing sample of 6 thousand observations were: 1.73 [1.72–1.76], 0.73 [0.72–0.73], 2.39 [2.34–2.44], 1.93 [1.90–1.96], 13.07 [12.80–13.32], and 0.74 [0.72–0.76], respectively.
We identified several trends in the effect of training and testing sample size on the performance of PHS models in prostate cancer using SNP genetic variants. When using SNPs that had already been associated with prostate cancer risk, our analysis suggests that very little improvement in performance can be achieved once the training sets become larger than 10–15 thousand observations. When attempting to discover SNPs, a similar plateau in performance was observed from training sets larger than 20–25 thousand observations. Apart from z-scores, the performance metrics of the chosen Cox proportional hazards model did not vary with testing sample size. However, we did observe that the distribution of performance metrics narrows until a testing sample size of 3 to 4 thousand observations, after which the distribution remains relatively stable.
Our results may be used to inform researchers on the approximate number of subjects needed to develop PHS models using SNP counts. A dataset of 20 thousand observations may be the minimum needed to accurately estimate the PHS coefficients of SNPs that have been previously discovered in the setting of a logistic model. Such a dataset would allow for the accurate estimation of SNP coefficients as well as the testing of model performance in an independent holdout set. Based on our results, this number would have to be increased to roughly 30 thousand observations if the researchers intend on discovering the SNPs from scratch using the approach described here.
The PHS model developed by Desikan et al.  to estimate age-associated risk of Alzheimer’s disease used a training set with roughly 55,000 individuals. A similarly structured model developed by Seibert et al.  to guide screening for aggressive prostate cancer was developed with roughly 31,000 men. Studies such as these require large investments in time, money, and resources in order to acquire the genetic data needed for the analysis. The results of our analysis help elucidate that the minimum sample size needed to translate this technology to other diseases and processes may be lower than what has been used so far in previous studies. This seems to be particularly true if the researchers use SNPs that have already been discovered and validated as associated with the process of interest.
The results of this study must be considered in the context of its limitations. The list of EST-SNPs was previously selected from a larger dataset that included the sample patients used in the test set in the present study. As such, there is leakage of information from the test set to the development of the EST-SNP model. Therefore, the performance metrics of the EST-SNP model should not be directly compared with those of the DIS-SNP model, as the values of the former may be inflated.
In addition, we have chosen to focus on only two of countless possible model development schemes. The role of sample size in other development strategies—such as regularized Cox proportional models, parametric survival functions, or random survival forests—is yet to be explored. Finally, the analysis is limited to prostate cancer and to the SNPs available on the iCOGS array. Future studies to investigate the influence of additional SNPs, such as those on HapMap 3 or 1000 Genomes, on the performance of PHS models are underway at our institution.
In conclusion, we have studied the effect of sample size on the performance of PHS models to study the association between SNPs and the age at diagnosis of prostate cancer. We have determined that models require roughly 20 to 30 thousand samples before their performance would not be improved greatly by expansion of the training set. Using SNPs that have already been established in the literature may help reduce the number of training samples required to reach this performance plateau by almost 10 thousand samples.
Aly M, Wiklund F, Xu J, Isaacs WB, Eklund M, D’Amato M, et al. Polygenic risk score improves prostate cancer risk prediction: results from the Stockholm-1 cohort study. Eur Urol. 2011;60:21–8.
Machiela MJ, Chen C, Chanock SJ, Hunter DJ, Kraft P. Evaluation of polygenic risk scores for predicting breast and prostate cancer risk. Genet Epidemiol. 2011;35:506–14.
Vassy JL, Hivert MF, Porneala B, Dauriz M, Florez JC, Dupuis J, et al. Polygenic type 2 diabetes prediction at the limit of common variant detection. Diabetes. 2014;63:2172–82.
Marden JR, Walter S, Tchetgen Tchetgen EJ, Kawachi I, Glymour MM. Validation of a polygenic risk score for dementia in black and white individuals. Brain Behav. 2014;4:687–97.
Natarajan P, Young R, Stitziel NO, Padmanabhan S, Baber U, Mehran R, et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation. 2017;135:2091–101.
Desikan RS, Fan CC, Wang Y, Schork AJ, Cabral HJ, Cupples LA, et al. Genetic assessment of age-associated Alzheimer disease risk: development and validation of a polygenic hazard score. PLoS Med. 2017;14:1–17.
Seibert TM, Fan CC, Wang Y, Zuber V, Karunamuni R, Parsons JK, et al. Polygenic hazard score to guide screening for aggressive prostate cancer: development and validation in large scale cohorts. BMJ. 2018;360:1–7.
Chatterjee N, Shi J, García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat Rev Genet. 2016;17:392–406.
Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018;19:581–90.
Eeles RA, Olama AAAl, Benlloch S, Saunders EJ, Leongamornlert DA, Tymrakiewicz M, et al. Identification of 23 new prostate cancer susceptibility loci using the iCOGS custom genotyping array. Nat Genet. 2013;45:385–91.
Lane JA, Donovan JL, Davis M, Walsh E, Dedman D, Down L, et al. Active monitoring, radical prostatectomy, or radiotherapy for localised prostate cancer: study design and diagnostic and baseline results of the ProtecT randomised phase 3 trial. Lancet Oncol. 2014;15:1109–18.
Kote-Jarai Z, Easton DF, Stanford JL, Ostrander EA, Schleutker J, Ingles SA, et al. Multiple novel prostate cancer predisposition loci confirmed by an international study: the PRACTICAL consortium. Cancer Epidemiol Biomarkers Prev. 2008;17:2052–61.
Szulkin R, Whitington T, Eklund M, Aly M, Eeles RA, Easton D, et al. Prediction of individual genetic risk to prostate cancer using a polygenic score. Prostate. 2015;75:1467–74.
Minh-Phuong H-L, Chieh Fan C, Karunamuni R, Martinez ME, Eeles RA, Kote-Jarai Z, et al. Polygenic hazard score predicts aggressive and fatal prostate cancer in multi-ethnic populations. medRxiv. 2019. https://doi.org/10.1101/19012237.
Therneau TM, Li H. Computing the Cox Model for case cohort designs. Lifetime Data Anal. 1999;5:99–112.
This study was funded in part by a grant from the United States National Institute of Health/National Institute of Biomedical Imaging and Bioengineering (#K08EB026503, TMS), United States Department of Defense (#W81XWH-13-1-0391, AMD, TMS), the Research Council of Norway (#223273, OAA), KG Jebsen Stiftelsen (OAA) and South East Norway Health Authority (OAA). Funding support for the PRACTICAL consortium, from which the data were obtained, is detailed in the Supporting information A2.
Conflict of interest
TMS reports honoraria from Multimodal Imaging Services Corporation for imaging segmentation, honoraria from WebMD, Inc. for educational content, as well as a past research grant from Varian Medical Systems unrelated to this work. OAA reports speaker honoraria from Lundbeck and consultant for HealthLytix, Inc. OAA has a patent application # U.S. 20150356243 pending; AMD also applied for this patent application and assigned it to UC San Diego. AMD has additional disclosures outside the present work: founder, equity holder, and advisory board member for CorTechs Labs, Inc.; founder and equity holder in HealthLytix, Inc., advisory board member of Human Longevity, Inc.; recipient of nonfinancial research support from General Electric Healthcare.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Karunamuni, R.A., Huynh-Le, MP., Fan, C.C. et al. The effect of sample size on polygenic hazard models for prostate cancer. Eur J Hum Genet 28, 1467–1475 (2020). https://doi.org/10.1038/s41431-020-0664-2
This article is cited by
Comparing Pruning and Thresholding with Continuous Shrinkage Polygenic Score Methods in a Large Sample of Ancestrally Diverse Adolescents from the ABCD Study®
Behavior Genetics (2023)
Performance of African-ancestry-specific polygenic hazard score varies according to local ancestry in 8q24
Prostate Cancer and Prostatic Diseases (2022)
Prostate cancer risk stratification improvement across multiple ancestries with new polygenic hazard score
Prostate Cancer and Prostatic Diseases (2022)
Common genetic and clinical risk factors: association with fatal prostate cancer in the Cohort of Swedish Men
Prostate Cancer and Prostatic Diseases (2021)
Prostate Cancer and Prostatic Diseases (2021)