Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Sun, Ting-Hsuan; Wang, Chia-Chun; Liu, Ting-Yuan; Lo, Shih-Chang; Huang, Yi-Xuan; Chien, Shang-Yu; Chu, Yu-De; Tsai, Fuu-Jen; Hsu, Kai-Cheng

doi:10.1038/s41467-024-47472-5

Download PDF

Article
Open access
Published: 12 April 2024

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Nature Communications volume 15, Article number: 3168 (2024) Cite this article

6362 Accesses
29 Altmetric
Metrics details

Subjects

Abstract

Polygenic scores estimate genetic susceptibility to diseases. We systematically calculated polygenic scores across 457 phenotypes using genotyping array data from China Medical University Hospital. Logistic regression models assessed polygenic scores’ ability to predict disease traits. The polygenic score model with the highest accuracy, based on maximal area under the receiver operating characteristic curve (AUC), is provided on the GeneAnaBase website of the hospital. Our findings indicate 49 phenotypes with AUC greater than 0.6, predominantly linked to endocrine and metabolic diseases. Notably, hyperplasia of the prostate exhibited the highest disease prediction ability (P value = 1.01 × 10⁻¹⁹, AUC = 0.874), highlighting the potential of these polygenic scores in preventive medicine and diagnosis. This study offers a comprehensive evaluation of polygenic scores performance across diverse human traits, identifying promising applications for precision medicine and personalized healthcare, thereby inspiring further research and development in this field.

Electronic health records and polygenic risk scores for predicting disease risk

Article 31 March 2020

Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers

Article 07 April 2020

Polygenic scores in cancer

Article 21 July 2023

Introduction

Significant progress in genetics has deepened our understanding of the genetic underpinnings of complex traits and diseases. A notable development is the creation and application of polygenic scores (PGSs), which predict an individual’s risk for specific traits or diseases based on their genetic profile^1,2. The theory of polygenic inheritance, which posits that traits or diseases result from the interaction of multiple genes, has been a topic of discussion for many years. However, it was the introduction of genome-wide association studies (GWASs) in the early 2000s that brought the concept of PGSs into broader use³. GWASs have empowered researchers to identify thousands of genetic variants linked to various traits and diseases by examining the entire genome of large populations.

The initial PGSs were computed using a straightforward approach known as the “burden test” or the single nucleotide polymorphism (SNP)-based method. This method involves tallying the total number of risk alleles (i.e., genetic variants associated with increased risk) for an individual across multiple loci to generate a composite score^4,5. However, this approach does not consider the varying effect sizes of different genetic variants, resulting in scores with limited predictive accuracy. As a result, more sophisticated statistical methods, such as the “weighted method”⁶ or linkage disequilibrium score regression⁷, have been developed. These methods account for the effect sizes of different variants and the correlations between variants through the measurement of linkage disequilibrium. By incorporating additional information, such as the effect sizes and frequencies of variants, these methods can produce accurate and robust PGSs^8,9.

PGSs have found extensive application in various traits, including height, body mass index¹⁰, and intelligence¹¹, as well as diseases such as cardiovascular disease^12,13, cancer^14,15,16, and psychiatric disorders^17,18,19. PGSs have facilitated investigations into the genetic foundations of complex traits and diseases, the identification of individuals at high risk for certain diseases and conditions²⁰, and the exploration of gene-environment interactions. The Polygenic Score Catalog (https://www.pgscatalog.org/)²¹ was developed to streamline the distribution of PGSs. This catalog, adhering to standardized procedures for quality control, data curation, and metadata annotation, serves as a centralized resource enabling researchers and clinicians to access and utilize PGSs for various applications, including risk prediction, personalized medicine, and genetic research.

In this study, we procured SNP array data from a cohort of 276,712 individuals, whose data were stored in the electronic health record system of China Medical University Hospital (CMUH) in Taiwan. Utilizing PGS files from the PGS Catalog, which contain data on genetic variants and their corresponding weights, we calculated PGSs for 457 disease traits. We evaluated their predictive performance using logistic regression models. The results were subsequently stored in the CMUH GeneAnaBase, a platform that enables population health research and facilitates investigations into the genetic basis of diseases, novel genetic associations, and heritability. This study offers valuable insights into the genetic similarities of diseases in Taiwan and contributes to our understanding of disease genetics in the context of population health.

Results

Distribution of performance metrics and ancestry cohorts

A comprehensive analysis was conducted on a total of 13,097 performance records available in the PGS Catalog, examining various conditions, including consideration of covariates and different ancestry cohorts (Fig. 1B). Among these records, three primary performance measurements were extracted, constituting 27.27% (3572 records) of the total. The most frequently utilized measurement was the AUC, comprising 2194 records, followed by the odds ratio, with 1513 records, and the hazard ratio, with 419 records. The remaining 73.73% (9657 records) utilized alternative calculation methods such as R², Nagelkerke’s R², the z-test, and Youden’s index (Fig. 1A).

**Fig. 1: Distribution of performance measurements and the number of individuals in the Polygenic Score Catalog.**

Regarding the sample cohort used to develop PGS, a total of 2153 records were available. Since 60.94% (1312 records) of the data lacked case/control values, the number of individuals was used for statistical analysis. From the data distribution (Fig. 1D), it was observed that 50% of PGS were developed using samples of less than 23,072 individuals. Following an initial screening step, 507 PGS were retained, and the cumulative distribution plot (Fig. 1E) illustrates that 50% of PGS used sample sizes larger than 269,704. This highlights a trend in our process to retain PGS with relatively larger sample sizes. Similar results were observed in the final 201 PGS used for optimized models.

In our study, we consistently employed the AUC as the primary evaluation metric, utilizing four covariate inclusion strategies during model training: age and sex, PGS alone, PGS combined with sex and age, and PGS combined with sex, age, and the first four principal components. Evaluating the outcomes for 457 phenotypes based on the AUC achieved by the PGS model (Fig. 2A), we found that the majority of models exhibited enhanced AUC values with the addition of covariates. Setting a threshold of AUC > 0.6 to indicate effective model performance, we observed an increase in the number of phenotypes surpassing this threshold as more covariates were incorporated. Specifically, 24 phenotypes achieved an AUC > 0.6 for models trained with age and sex, 26 phenotypes for PGS alone, and 47 phenotypes for both models trained with PGS combined with age and sex and PGS combined with age, sex, and the first four principal components.

**Fig. 2: Comparison of model performance with different covariate inclusion strategies.**

The distribution of data based on AUC values in the PGS catalog is illustrated in Fig. 2B. Examining the ancestry cohorts, we found that 58.352% of the data originated from calculations conducted on individuals of European descent, with East Asia and South Asia contributing 11.616% and 11.481% of the data, respectively. Among the 2194 records with AUC values, 1927 had covariates included in the calculations, while covariates were not considered in 267 records. Notably, regardless of the use of covariates, the disease identification effectiveness of the PGS model predominantly fell within the AUC range of 0.6–0.7 in the PGS catalog, whereas our models fell within the AUC range of 0.5–0.6 (Table 1, Fig. 1C).

Table 1 AUC distribution in terms of covariate usage

Full size table

We explored whether changes in AUC across the three strategies for covariate inclusion are limited to specific disease classifications (Fig. 2B). The model trained with PGS, age, sex, and the first four principal components exhibited the highest performance, encompassing 213 out of 457 (46.61%) phenotypes. Following this, the model trained with PGS alone covered 157 out of 457 (34.35%) phenotypes, while the model trained with PGS, age, and sex covered 48 out of 457 (10.50%) phenotypes, and the model trained with age and sex covered 39 out of 457 (8.53%) phenotypes. This observed trend persisted across various disease classifications.

Correlations between sample prevalence rates and other factors in the analysis

Our understanding of the allelic architecture in complex human diseases pertains to the patterns of genetic variations and their roles in influencing the risk or susceptibility of developing specific complex diseases. Studies have shown that low-frequency variants tend to exhibit greater penetrance in rare diseases, while common variants often display lower penetrance and require the presence of multiple variants, gene-gene interactions, or environmental factors to manifest as disease. Consequently, we hypothesized that as the prevalence rate of a disease in a sample increases, the complexity of predictive models for that disease is likely to increase as well, along with the need for optimal covariance in disease prediction models.

Based on our dataset, there is a significant association between the sample prevalence rate and the number of SNPs used for PGS calculations, with a Pearson correlation coefficient of 0.19 and a P value of 9.43 × 10⁻⁵ (Fig. 3A). Similarly, a significant association was observed between the sample prevalence rate and the -log10(p) obtained from the Wilcoxon rank sum test of PGS distributions between case and control populations, with a Pearson correlation coefficient of 0.13 and a P value of 7.26 × 10⁻³ (Fig. 3B). Furthermore, AUC values for the 457 phenotypes showed a more substantial association, with a Pearson correlation coefficient of 0.41 and a P value of 6.31 × 10⁻²⁰ (Fig. 3C).

**Fig. 3: Sample prevalence of the disease in CMUH correlation comparison.**

Although the p-values obtained from the Wilcoxon rank sum test <2.5 × 10⁻⁶ mainly fell within circulatory system diseases (n = 33/46) and endocrine/metabolic diseases (n = 30/36; Fig. 3D), a distinct pattern was still evident. Diseases with lower prevalence tend to have a lower -log10(p) value from Wilcoxon rank-sum test results, with optimal models relying largely on PGS alone. Conversely, diseases with higher prevalence typically yield a higher -log10(p) value from Wilcoxon rank sum results, necessitating additional covariates for the development of an optimal model.

Performance of the PGS model in the CMUH dataset

Among the 457 phenotypes analyzed, 192 exhibited significant differences in distribution, with P values obtained from the Wilcoxon rank sum test that were less than 2.5 × 10⁻⁶. We identified a notable positive correlation between the AUC values and the −log10(p) values obtained from the Wilcoxon rank sum test, with a Pearson correlation coefficient of 0.65 and a P value of 1.06 × 10 ⁻⁵⁵ (Fig. 4A). Although the majority of AUC values fell between 0.5 and 0.6, four phenotypes achieved AUC values between 0.7 and 0.8, and seven phenotypes achieved AUC values between 0.8 and 0.9 (Fig. 4B). The lowest predictive performance was observed for oral aphthae, with an AUC of 0.504 (Fig. 4C), while the highest predictive performance was recorded for hyperplasia of the prostate, with an AUC of 0.874 (Fig. 4D).

**Fig. 4: Differential relationship between PGS performance and disease prevalence.**

Upon examining the distribution of PGSs in individuals affected by the disease, we observed a normal distribution in hyperplasia of the prostate but a skewed distribution in oral aphthae, characterized by sudden increases or decreases. We further investigated the relationship between PGS percentiles and patient prevalence, calculated using 100 equally sized quantiles in PGSs. For hyperplasia of the prostate, the mean patient prevalence at each percentile increased with rising PGS percentiles, indicating an S-shaped curve and suggesting a strong nonlinear relationship between PGS percentiles and patient prevalence. Higher PGS quantiles corresponded to significantly elevated disease risk, highlighting the utility of PGS in identifying individuals at heightened risk of hyperplasia of the prostate. Conversely, plots for oral aphthae did not reveal a discernible relationship between PGS percentiles and patient prevalence at each percentile.

Comprehensive comparison with other evaluation metrics

In addition to assessing PGSs using the Wilcoxon rank sum test and AUC, we calculated sensitivity, specificity, accuracy, precision, and recall to provide a more comprehensive evaluation of the model’s performance. Among the 49 phenotypes identified with an AUC > 0.6 in the logistic regression model, six phenotypes did not meet the threshold of 2.5 × 10⁻⁶ in the Wilcoxon rank sum test (Table 2), highlighting exceptions between statistical and practical significance. The lower P value in the Wilcoxon rank sum test did not necessarily translate to superior model outcomes in the logistic regression model, and vice versa. Across the phenotypes, the logistic regression model exhibited high accuracy but low precision. Sensitivity, specificity, and recall showed varying results with no clear trends.

Table 2 Results of evaluation metrics for 49 phenotypes

Full size table

During the comparison of model validation results across different racial populations from the PGS Catalog records, we identified six diabetes-related phenotypes in our dataset that utilized the same PGS score file. Notably, the predictive performance for type 2 diabetes achieved an impressive AUC of 0.825, surpassing results in the PGS Catalog records for the East Asian dataset, which had an AUC of 0.810. However, the predictive power of other complications stemming from type 2 diabetes ranged from 0.65 to 0.75, indicating the need for additional genetic factors or information such as age of onset, duration time, and degree of control of type 2 diabetes to enhance predictive accuracy further.

Furthermore, regarding type 1 diabetes, our model achieved an AUC of 0.671. Interestingly, comparison with results from the PGS catalog records revealed that the model in the catalog, trained with European data, performed better on the East Asian population (AUC = 0.893) than on the European population (AUC = 0.705) in the testing dataset. Upon closer examination, we noted a significant difference in the dataset sizes used for these assessments. The dataset used for calculating AUC in the East Asian population included 5 cases and 1699 controls, while the European dataset comprised 186 cases and 24,719 controls. This discrepancy indicates that the assessments were strongly influenced by randomness or sampling bias.

Discussion

In recent years, PGSs have gained popularity in genomic research, with researchers typically allocating 60%–80% of a dataset for GWAS and the remaining 20%–40% for PGS calculations to develop personalized risk models^8,9. However, evaluating the effects of PGS in independent datasets has been challenging due to the necessity of a sufficiently large sample size²². Fortunately, the introduction of the PGS Catalog²¹ in 2021 has addressed this issue, allowing for exploration of the clinical significance of PGS.

The PGS Catalog team curates PGSs from published studies using standardized formats and ontologies to ensure the consistency and comparability of PGS data. This enables researchers to compare different PGSs for the same trait, evaluate overall predictive performance, and assess applicability in new populations and contexts. The first study to construct PGSs using weights curated from the PGS Catalog was published in Epidemiology and Global Health Genetics and Genomics in March 2023²³. This study focused on PGSs for breast, prostate, colorectal, and lung cancers in 21,694 East Asian individuals. While the PGSs demonstrated predictive power, with AUCs ranging from 0.58 to 0.70, the study indicated that appropriate correction factors may be necessary to improve calibration.

To explore whether the predictive power of PGSs extends beyond cancer, we systematically calculated PGSs across 457 phenotypes using score files from the PGS Catalog. Our findings revealed a positive correlation between the ability of PGSs to predict disease risk and the prevalence rate of the relevant disease in a population. This correlation may stem from larger sample sizes, which enhance statistical power to unveil associations between genetic variants and diseases, thereby constructing more reliable PGSs²⁴. Additionally, diseases with a high prevalence rate may exhibit a genetic architecture influenced by specific traits or factors that affect the development and efficacy of PGSs²⁵. Certain polygenic structures may be associated with multiple generalized traits or diseases.

When comparing two common metrics for evaluating PGS, namely the P value obtained from a Wilcoxon rank sum test and the AUC of a logistic regression model, a strong correlation was observed between the two. However, differences were noted for certain phenotypes, likely attributed to underlying test assumptions, effect sizes, sample sizes, and methods of constructing PGSs or adjusting for covariates in the logistic regression model²⁶. Currently, no established criteria exist for determining the suitability of a PGS model for clinical use. Therefore, multiple matrices should be considered to provide a reliable assessment of PGS performance.

Among the 457 analyzed phenotypes, AUC values ranged from 0.504 to 0.874, with 408 (89.28%) falling within the 0.5–0.6 range. These findings indicate challenges in using PGSs in other studies, primarily due to most PGSs in the PGS Catalog being derived from individuals of European descent, raising concerns about heterogeneity across different populations. Unlike genome-wide association study (GWAS) meta-analyses, where data harmonization and integration across diverse populations are common practices, the SNPs recorded in these PGS score files have undergone various filtering procedures. These processes include addressing factors such as population stratification, linkage disequilibrium trimming, aggregation of summary statistics, and applying significance thresholding. As a result, the data in these PGS score files are optimized for their original purposes but may pose challenges when attempting to incorporate new data or make further adjustments with different population groups. Furthermore, although the PGS Catalog includes models incorporating environmental factors or gene-environment interactions, users can only obtain PGS scoring formulas without the impact/weight related to environmental factors, potentially complicating model application and reducing AUC²⁷.

Despite these challenges, we identified 49 phenotypes with AUC values > 0.6, indicating that certain genetic variants have consistent effects on traits or diseases across different populations. Notable examples include type 2 diabetes, type 1 diabetes, pathological obesity, gout, chronic thyroiditis, myocardial infarction, atrial fibrillation, ankylosing spondylitis, prostate cancer, thyroid cancer, and breast cancer, among others (Table 2). Previous research from our team has demonstrated significant associations between genetic variants and conditions such as gout²⁸, hyperthyroidism²⁹, obesity³⁰, and various types of cancer³¹. These findings underscore the potential for further investigation to identify and characterize these genetic variants and explore their implications for disease risk. Addressing population-specificity issues may involve exploring additional integration or adjustment methods. Incorporating environmental factors and gene-environment interactions into PGS models could enhance the accuracy and robustness of risk predictions. Further research is essential to fully leverage the potential of PGSs in advancing precision medicine and improving public health outcomes.

Methods

Study population and genetic data quality control

From 2018 to 2021, a cohort of 347,954 patients was recruited from the outpatient department of CMUH with the approval of the Institutional Review Board (IRB number: CMUH107-REC3-058 (AR−1); date of approval: 07/20/2018). Age of participants was calculated as of December 31, 2021, using the formula: Age = December 31, 2021 minus date of birth. The sex of participants was determined based on the SNP array genotyping result. Genotyping was performed using the Axiom Genome-Wide 1.0 customized array plate (Affymetrix, Santa Clara, CA, USA), in accordance with their guidelines and regulations (Santa Clara, CA, USA). Pre-imputation quality control of the genotype data was conducted using PLINK 2.0 (https://www.cog-genomics.org/plink2)³². SNPs and individuals were excluded if the missing rate was >10%, the minor allele frequency was <0.01, the Hardy–Weinberg equilibrium exact test was less than 1 × 10⁻⁶, or the total call rate was <0.98³³. Phased genotype data were subsequently imputed using beagle 5.4 (version: 22 Jul 22.46e) (https://faculty.washington.edu/browning/beagle/beagle.html)³⁴ with a Taiwan population-specific reference panel containing 1495 whole-genome sequencing data. After quality control and imputation, a total of 276,712 individuals with 14,029,683 variants were included for analysis.

Phenotype identification and PGS trait paring

Since 1980, CMUH has provided treatment to over 3 million patients. Patient data, encompassing demographic information and clinical details such as medical history, medication history, and diagnostic test results, is systematically recorded in an electronic health record system. To identify phenotypes for analysis, we sourced the International Classification of Diseases version 9th and 10th (ICD) codes from medical records, along with patient demographics like age, sex, and other identifying information (Fig. 5). Utilizing the createPhenotypes function in PheWAS (https://github.com/PheWAS/PheWAS)³⁵, we grouped ICD codes to identify 1836 phenotypes (Supplementary Data 1). The inclusion and exclusion criteria were established using the Clinical Classification Software grouping schema and the incidence of codes in the electronic health records of several medical facilities, accessible at https://www.phewascatalog.org/phecodes.

**Fig. 5: Data collection and processing workflow for data from the China Medical University Hospital and the PGS Catalog.**

We gathered variant and weight data files for 544 PGS traits from the PGS Catalog²¹. To link the phenotypes and PGS traits, we employed the surjective pairing method using the BeautifulSoup function (https://git.launchpad.net/beautifulsoup)³⁶, a widely-used Python library for web scraping. BeautifulSoup enabled us to search for and extract phenotype-related keywords from the HTML content of the PGS Catalog website, including specific HTML elements like tables or div tags. This effort resulted in a total of 362,605 phenotype−PGS pairs (Supplementary Data 2). However, only 100,869 of these phenotype−PGS pairs met the threshold for PGS performance analysis, based on the requirement of a sufficient sample size ( > 1000 cases) (Fig. 5). All the code used in this section is recorded in Supplementary Software 1.

We further scrutinized the retained phenotype-PGS pairs based on the P-value derived from the Wilcoxon rank sum test. This allowed for an initial screening of PGS traits, with up to 5 candidates considered per phenotype. This rigorous process led to the identification of 1730 phenotype-PGS pairs, which constituted the foundational dataset for our study.

PGS model construction and predictive performance

The PGS Catalog offers an online user interface (https://www.pgscatalog.org/) and provides score files for various traits, featuring uniformly formatted columns for variations, alleles, and weights. The PGSs were computed using PRSice-2 (https://choishingwan.github.io/PRSice/)³⁷. A weighted PGS model was employed, expressed as follows³⁸:

$${{PRS}}_{w}={\hat{\beta }}_{1}{G}_{1}+\ldots,{\hat{\beta }}_{K}{G}_{K}$$

(1)

where ${G}_{k}\left(k=1,\ldots,K\right)$ represents the number of risk alleles for each genetic variant, which are coded as 0, 1, or 2 under the additive genetic model. The estimate of marginal genetic effects in the weighted SNP list is denoted by ${\hat{\beta }}_{k}(k=1,\ldots,K)$.

Following the computation, PGS distribution plots, stratified by disease case status (case-control groups), were generated using the ggplot2 R package. To assess the predictive capability of the PGS model, we implemented age- and sex-matching procedures at ratios of 1:8 for case-control pairs (refer to Supplementary Result, Supplementary Fig. 1), facilitated by the MatchIt package in R³⁹. Subsequently, we conducted statistical tests to evaluate differences in PGS distributions between cases and controls. This involved performing a two-sided Welch’s two-sample t-test and a two-sided Wilcoxon rank sum test. The dataset was then split into training and testing sets in an 8:2 ratio, employing four different covariate inclusion strategies for training models: age and sex, PGS alone, PGS combined with sex and age, and PGS combined with sex, age, and the first four principal components. To assess the significance of the area under the curve (AUC), Delong’s method⁴⁰ was used in conjunction with Youden’s index (J) to determine the optimal J⁴¹ cutoff point for the PGS. In the context of survival analysis, Cox proportional hazards models^42,43 were employed, utilizing age as the time scale to investigate the association between PGSs and disease endpoints⁴⁴. Additionally, disease distribution plots were created, stratifying individuals based on their PGS percentiles. These plots were generated using the ggplot2 package in R (version 4.1.1) and compiled into the GeneAnaBase website (https://pgscatalog.azure.nihxcmuh.org/#/) for review (Fig. 6). All the code used in this section is recorded in Supplementary Software 1.

**Fig. 6: Calculation, measurement, and display of the polygenic risk score model evaluation results.**

Evaluate the consistency of PGSs in different SNP detection methods

During the calculation of PGS using SNP array data, we encountered a challenge as 14,029,683 variants did not meet the required criteria for the PGS score files. To ensure the integrity of PGS values, we conducted a comparative analysis involving 353 individuals who had both whole-genome sequencing and SNP array data.

In brief, the in-house whole-genome sequencing data was obtained using a 30X depth 150 bp paired-end sequencing method. This was followed by alignment to the GRCh38 human genome and completed variant calling using the DRAGEN genome pipeline (Illumina, Inc., San Diego, CA, USA). PGSs were calculated using PRSice-2, the same method used for the SNP array data. To assess the reliability of the PGS scoring from SNP array data, we utilized the intraclass correlation coefficient (ICC) approach, denoted by the formula⁴⁵:

$${ICC}=\frac{{MSBS}-{MSE}}{{MSBS}+(k-1){MSE}}$$

(2)

where MSBS represents the mean square between subjects, MSE represents the mean square error, and K represents the number of methods under consideration. This approach enables the evaluation of agreement between the two methods while treating them as fixed effects, thereby eliminating systematic errors and focusing on the random residual error. All the code used in this section is recorded in Supplementary Software 1.

The PGS score files exhibited a wide range of required SNPs, with an average of 28147.41 ± 73980.64 and a median of 420.00 variants. The average SNP missing rate in the SNP array data was 16.52% ± 14.05%, with a median of 11.28%. The average ICC score was computed at 0.82 ± 0.12, with a median value of 0.82. In our study, this ICC score ranged between 0.75 and 1, indicating a high level of consistency across most comparisons.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw SNP array data are protected and are not available due to data privacy laws. However, we are committed to fostering collaboration and promoting transparency in scientific research. As such, we welcome collaborative projects and discussions with fellow researchers who may require access to the data. Please provide your identity, employer, purpose of data access, and IRB approval to the mailbox of the corresponding author, Dr. F.-J.T. (000704@tool.caaumed.org.tw). We will meet within one month to discuss and respond. The statistical results generated in this study are provided on the GeneAnaBase website, which can be found at https://pgscatalog.azure.nihxcmuh.org/#/. Source data are provided with this paper.

Code availability

All codes used for data download, processing, calculation, and graphing are recorded in the Supplementary Software. Please refer to this document for detailed information.

References

Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med 12, 44 (2020).
Article PubMed PubMed Central Google Scholar
Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet 28, R133–R142 (2019).
Article CAS PubMed Google Scholar
Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet 101, 5–22 (2017).
Article CAS PubMed PubMed Central Google Scholar
Pergament, E. et al. Single-nucleotide polymorphism-based noninvasive prenatal screening in a high-risk and low-risk cohort. Obstet. Gynecol. 124, 210–218 (2014).
Article CAS PubMed PubMed Central Google Scholar
Conran, C. A. et al. Population-standardized genetic risk score: the SNP-based method of choice for inherited risk assessment of prostate cancer. Asian J. Androl. 18, 520–524 (2016).
Article CAS PubMed PubMed Central Google Scholar
So, H. C. & Sham, P. C. Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Sci. Rep. 7, 41262 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Newcombe, P. J., Nelson, C. P., Samani, N. J. & Dudbridge, F. A flexible and parallelizable approach to genome-wide polygenic risk scores. Genet Epidemiol. 43, 730–741 (2019).
Article PubMed PubMed Central Google Scholar
Choi, S. W., Mak, T. S.-H. & O’Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 15, 2759–2772 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chatterjee, N., Shi, J. & Garcia-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet 17, 392–406 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bann, D., Wright, L., Hardy, R., Williams, D. M. & Davies, N. M. Polygenic and socioeconomic risk for high body mass index: 69 years of follow-up across life. PLoS Genet 18, e1010233 (2022).
Article CAS PubMed PubMed Central Google Scholar
Allegrini, A. G. et al. Genomic prediction of cognitive traits in childhood and adolescence. Mol. Psychiatry 24, 819–827 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sharifi, M., Futema, M., Nair, D. & Humphries, S. E. Polygenic hypercholesterolemia and cardiovascular disease risk. Curr. Cardiol. Rep. 21, 43 (2019).
Article PubMed PubMed Central Google Scholar
Zhang, J., Johnsen, S. P., Guo, Y. & Lip, G. Y. H. Epidemiology of atrial fibrillation: geographic/ecological risk factors, age, sex, genetics. Card. Electrophysiol. Clin. 13, 1–23 (2021).
Article PubMed Google Scholar
Oh, J. J. & Hong, S. K. Polygenic risk score in prostate cancer. Curr. Opin. Urol. 32, 466–471 (2022).
Article PubMed Google Scholar
Junior, H. L. R., Novaes, L. A. C., Datorre, J. G., Moreno, D. A. & Reis, R. M. Role of polygenic risk score in cancer precision medicine of non-european populations: a systematic review. Curr. Oncol. 29, 5517–5530 (2022).
Article PubMed PubMed Central Google Scholar
Song, S. H. & Byun, S. S. Polygenic risk score for genetic evaluation of prostate cancer risk in Asian populations: A narrative review. Investig. Clin. Urol. 62, 256–266 (2021).
Article PubMed PubMed Central Google Scholar
Ni, G. et al. A comparison of ten polygenic score methods for psychiatric disorders applied across multiple cohorts. Biol. Psychiatry 90, 611–620 (2021).
Article PubMed PubMed Central Google Scholar
Wang, S. C., Chen, Y. C., Lee, C. H. & Cheng, C. M. Opioid addiction, genetic susceptibility, and medical treatments: a review. Int J. Mol. Sci. 20, 4294 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wahbeh, M. H. & Avramopoulos, D. Gene-environment interactions in schizophrenia: a literature review. Genes (Basel) 12, 1850 (2021).
Article CAS PubMed Google Scholar
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet 50, 1219–1224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lambert, S. A. et al. The polygenic score catalog as an open database for reproducibility and systematic evaluation. Nat. Genet 53, 420–425 (2021).
Article CAS PubMed Google Scholar
Wang, Y., Tsuo, K., Kanai, M., Neale, B. M. & Martin, A. R. Challenges and opportunities for developing more generalizable polygenic risk scores. Annu Rev. Biomed. Data Sci. 5, 293–320 (2022).
Article PubMed PubMed Central Google Scholar
Ho, P. J. et al. Polygenic risk scores for the prediction of common cancers in East Asians: A population-based prospective cohort study. Elife 12, e82608 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet 9, e1003348 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, L. et al. Disease risk factors identified through shared genetic architecture and electronic medical records. Sci. Transl. Med 6, 234ra57 (2014).
Article PubMed PubMed Central Google Scholar
Clémençon, S., Vayatis, N. & Depecker, M. AUC optimization and the two-sample problem in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. 360–368 (Curran Associates, Inc., 2009).
Saurabh, R., Fouodo, C. J. K., Konig, I. R., Busch, H. & Wohlers, I. A survey of genome-wide association studies, polygenic scores and UK Biobank highlights resources for autoimmune disease genetics. Front Immunol. 13, 972107 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chang, Y.-S. et al. Polygenic risk score trend and new variants on chromosome 1 are associated with male gout in genome-wide association study. Arthritis Res. Ther. 24, 229 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, T. Y. et al. Genome-wide association study of hyperthyroidism based on electronic medical record from Taiwan. Front Med (Lausanne) 9, 830621 (2022).
Article PubMed Google Scholar
Chiou, J.-S. et al. Your height affects your health: genetic determinants and health-related outcomes in Taiwan. BMC Med. 20, 250 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bau, D.-T. et al. A genome-wide association study identified novel genetic susceptibility loci for oral cancer in taiwan. Int. J. Mol. Sci. 24, 2789 (2023).
Article CAS PubMed PubMed Central Google Scholar
Marees, A. T. et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J. Methods Psychiatr. Res 27, e1608 (2018).
Article PubMed PubMed Central Google Scholar
Liu, T. Y. et al. Comparison of multiple imputation algorithms and verification using whole-genome sequencing in the CMUH genetic biobank. Biomedicine (Taipei) 11, 57–65 (2021).
Article PubMed Google Scholar
Loh, P. R. et al. Insights into clonal haematopoiesis from 8342 mosaic chromosomal alterations. Nature 559, 350–355 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Article CAS PubMed PubMed Central Google Scholar
Richardson, L. Beautiful soup documentation. April (2007).
Choi, S. W. & O’Reilly, P. F. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience 8, giz082 (2019).
Article PubMed PubMed Central Google Scholar
Liu, W., Zhuang, Z., Wang, W., Huang, T. & Liu, Z. An improved genome-wide polygenic score model for predicting the risk of type 2 diabetes. Front Genet 12, 632385 (2021).
Article PubMed PubMed Central Google Scholar
Zhao, Q. Y. et al. Propensity score matching with R: conventional methods and new features. Ann. Transl. Med 9, 812 (2021).
Article CAS PubMed PubMed Central Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article CAS PubMed Google Scholar
Fluss, R., Faraggi, D. & Reiser, B. Estimation of the Youden Index and its associated cutoff point. Biom. J. 47, 458–472 (2005).
Article MathSciNet PubMed Google Scholar
Schemper, M. Cox Analysis of Survival Data with Non-Proportional Hazard Functions. J. R. Stat. Soc. Ser. D. (Statistician) 41, 455–465 (1992).
Google Scholar
Therneau, T. A Package for survival analysis in r. r package version 3.1–12 https://CRAN.R-project.org/package=survival (2020).
Vaura, F. et al. Polygenic risk scores predict hypertension onset and cardiovascular risk. Hypertension 77, 1119–1127 (2021).
Article CAS PubMed Google Scholar
Liljequist, D., Elfving, B. & Skavberg Roaldsen, K. Intraclass correlation - A discussion and demonstration of basic features. PLoS One 14, e0219854 (2019).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Thanks for allowing us to use the genotyping microarray data from the Million Person Precision Medicine Initiative in China Medical University Hospital. We are grateful to the Health Data Science Center, China Medical University Hospital, for providing administrative and technical support. This study was supported in part by the Taiwan Ministry of Health and Welfare Clinical Trial Center (MOHW112-TDU-B-212−144004), and was funded by China Medical University and China Medical University Hospital (DMR−112−124, DMR−111-227, MOST 110-2314-B-039-010-MY2, MOST 111-2321-B-039-005, MOST 111-2321-B-030-004, and MOST 111-2622-8-039-001-IE).

Author information

These authors contributed equally: Fuu-Jen Tsai, Kai-Cheng Hsu.

Authors and Affiliations

Artificial Intelligence Center, China Medical University Hospital, Taichung, 40447, Taiwan
Ting-Hsuan Sun, Chia-Chun Wang, Shih-Chang Lo, Yi-Xuan Huang, Shang-Yu Chien, Yu-De Chu & Kai-Cheng Hsu
Million-person Precision Medicine Initiative, Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan
Ting-Yuan Liu
Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan
Fuu-Jen Tsai
School of Chinese Medicine, China Medical University, Taichung, 40402, Taiwan
Fuu-Jen Tsai
Division of Pediatric Genetics, Children’s Hospital of China Medical University, Taichung, 40447, Taiwan
Fuu-Jen Tsai
Department of Biotechnology and Bioinformatics, Asia University, Taichung, 41354, Taiwan
Fuu-Jen Tsai
Department of Neurology, China Medical University Hospital, Taichung, 40447, Taiwan
Kai-Cheng Hsu
Department of Medicine, China Medical University, Taichung, 40402, Taiwan
Kai-Cheng Hsu

Authors

Ting-Hsuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Chun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ting-Yuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shih-Chang Lo
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Xuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shang-Yu Chien
View author publications
You can also search for this author in PubMed Google Scholar
Yu-De Chu
View author publications
You can also search for this author in PubMed Google Scholar
Fuu-Jen Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Cheng Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.-J.T. and K.-C.H. equally corresponded to the study’s conception and provided essential resources. T.-H.S. originated and formulated the study design. C.-C.W. conducted data analysis. T.-H.S. and C.-C.W. authored the manuscript. T.-Y.L. and Y.-X.H. were involved in the data collection and preprocessing process. S.-C.L. and Y.-D.C. offered specialized knowledge in website and database development. S.-Y.C. facilitated research recommendations. All the authors reviewed and provided feedback for each draft of the manuscript, and all the authors have read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Fuu-Jen Tsai or Kai-Cheng Hsu.

Ethics declarations

Competing interests

The authors declare no competing interests. All authors declare that they have no known competing financial interests or non-financial interests that could have appeared to influence the work reported in this paper.

Peer review

Peer review information

Nature Communications thanks Shing Wan Choi and Jibril Hirbo for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Software 1

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, TH., Wang, CC., Liu, TY. et al. Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling. Nat Commun 15, 3168 (2024). https://doi.org/10.1038/s41467-024-47472-5

Download citation

Received: 07 June 2023
Accepted: 29 March 2024
Published: 12 April 2024
DOI: https://doi.org/10.1038/s41467-024-47472-5

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.