Abstract
Polygenic risk scores suffer reduced accuracy in non-European populations, exacerbating health disparities. We propose PolyPred, a method that improves cross-population polygenic risk scores by combining two predictors: a new predictor that leverages functionally informed fine-mapping to estimate causal effects (instead of tagging effects), addressing linkage disequilibrium differences, and BOLT-LMM, a published predictor. When a large training sample is available in the non-European target population, we propose PolyPred+, which further incorporates the non-European training data. We applied PolyPred to 49 diseases/traits in four UK Biobank populations using UK Biobank British training data, and observed relative improvements versus BOLT-LMM ranging from +7% in south Asians to +32% in Africans, consistent with simulations. We applied PolyPred+ to 23 diseases/traits in UK Biobank east Asians using both UK Biobank British and Biobank Japan training data, and observed improvements of +24% versus BOLT-LMM and +12% versus PolyPred. Summary statistics-based analogs of PolyPred and PolyPred+ attained similar improvements.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Reply to: Multivariate BWAS can be replicable with moderate sample sizes
Nature Open Access 08 March 2023
-
Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics
Nature Communications Open Access 14 February 2023
-
Polygenic Health Index, General Health, and Pleiotropy: Sibling Analysis and Disease Risk Reduction
Scientific Reports Open Access 28 October 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
Access to the UK Biobank resource is available via application (http://www.ukbiobank.ac.uk). PRS coefficients generated in the present study are available for public download at http://data.broadinstitute.org/alkesgroup/polypred_results. Summary LD information of n = 337,000 British-ancestry UK Biobank individuals for 2,763 overlapping 3-Mb loci is available at https://data.broadinstitute.org/alkesgroup/UKBB_LD. Summary LD information of n = 50,000 UK Biobank individuals for SBayesR is available at https://zenodo.org/record/3350914. Summary LD information used by PRS-CS is available at https://github.com/getian107/PRScs. Baseline-LF v.2.2.UKB annotations and LD scores for UK Biobank SNPs are available at https://data.broadinstitute.org/alkesgroup/LDSCORE/baselineLF_v2.2.UKB.tar.gz. Source data are provided with this paper.
Code availability
PolyPred and PolyPred+ are provided as part of the open-source software package PolyFun, which is freely available at https://doi.org/10.5281/zenodo.6139679 (ref. 89) and https://github.com/omerwe/polyfun. BOLT-LMM is available at https://data.broadinstitute.org/alkesgroup/BOLT-LMM. SBayesR is available at https://cnsgenomics.com/software/gctb. PRS-CS is available at https://github.com/getian107/PRScs.
References
Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).
Khera, A. V. et al. Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell 177, 587–596 (2019).
Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 104, 21–34 (2019).
Li, R., Chen, Y., Ritchie, M. D. & Moore, J. H. Electronic health records and polygenic risk scores for predicting disease risk. Nat. Rev. Genet. 21, 493–502 (2020).
Márquez-Luna, C., Loh, P.-R. & South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium & Price, A. L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017).
Grinde, K. E. et al. Generalizing polygenic risk scores from Europeans to Hispanics/Latinos. Genet. Epidemiol. 43, 50–62 (2019).
Peterson, R. E. et al. Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations. Cell 179, 589–603 (2019).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 10, 3328 (2019).
Gurdasani, D., Barroso, I., Zeggini, E. & Sandhu, M. S. Genomics of disease risk in globally diverse populations. Nat. Rev. Genet. 20, 520–535 (2019).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 11, 3865 (2020).
Amariuta, T. et al. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements. Nat. Genet. 52, 1346–1354 (2020).
Marnetto, D. et al. Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat. Commun. 11, 1628 (2020).
Bitarello, B. D. & Mathieson, I. Polygenic scores for height in admixed populations. G3 10, 4027–4036 (2020).
Chen, M.-H. et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations. Cell 182, 1198–1213 (2020).
Mahajan, A. et al. Trans-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation. Preprint at medRxiv https://www.medrxiv.org/content/10.1101/2020.09.22.20198937v1 (2020).
Cavazos, T. B. & Witte, J. S. Inclusion of variants discovered from diverse populations improves polygenic risk score transferability. Hum. Genet. Genom. Adv. 2, 100017 (2021).
Mills, M. C. & Rahal, C. The GWAS diversity monitor tracks diversity by disease in real time. Nat. Genet. 52, 242–243 (2020).
Lehmann, B. C., Mackintosh, M., McVean, G. & Holmes, C. C. High trait variability in optimal polygenic prediction strategy within multiple-ancestry cohorts. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.01.15.426781v2 (2021).
Ji, Y. et al. Incorporating European GWAS findings improve polygenic risk prediction accuracy of breast cancer among East Asians. Genet. Epidemiol. https://doi.org/10.1002/gepi.22382 (2021).
Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Preprint at medRxiv https://www.medrxiv.org/content/10.1101/2020.12.27.20248738v2 (2020).
Cai, M. et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2021.03.002 (2021).
Huang, Q. Q. et al. Transferability of genetic loci and polygenic scores for cardiometabolic traits in British Pakistanis and Bangladeshis. Preprint at medRxiv https://www.medrxiv.org/content/10.1101/2020.12.27.20248738v2 (2021).
Durvasula, A. & Lohmueller, K. E. Negative selection on complex traits limits phenotype prediction accuracy between populations. Am. J. Hum. Genet. 108, 620–631 (2021).
Coram, M. A., Fang, H., Candille, S. I., Assimes, T. L. & Tang, H. Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am. J. Hum. Genet. 101, 218–226 (2017).
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).
Shi, H. et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun. 12, 1098 (2021).
Kuchenbaecker, K. et al. The transferability of lipid loci across African, Asian and European cohorts. Nat. Commun. 10, 4330 (2019).
Mostafavi, H. et al. Variable prediction accuracy of polygenic scores within an ancestry group. eLife 9, e48376 (2020).
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Nagai, A. et al. Overview of the BioBank Japan project: study design and profile. J. Epidemiol. 27, S2–S8 (2017).
Asiki, G. et al. The general population cohort in rural south-western Uganda: a platform for communicable and non-communicable disease studies. Int. J. Epidemiol. 42, 129–141 (2013).
Heckerman, D. et al. Linear mixed model for heritability estimation that explicitly addresses environmental variation. Proc. Natl Acad. Sci. USA 113, 7377–7382 (2016).
Duan, S., Zhang, W., Cox, N. J. & Dolan, M. E. FstSNP-HapMap3: a database of SNPs with high population differentiation for HapMap3. Bioinformation 3, 139–141 (2008).
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Stahl, E. A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 44, 483–489 (2012).
Gazal, S. et al. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat. Genet. 50, 1600–1607 (2018).
Lam, M. et al. Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet. 51, 1670–1678 (2019).
Nievergelt, C. M. et al. International meta-analysis of PTSD genome-wide association studies identifies sex- and ancestry-specific genetic risk loci. Nat. Commun. 10, 4558 (2019).
Sakaue, S. et al. Trans-biobank analysis with 676,000 individuals elucidates the association of polygenic risk scores of complex traits with human lifespan. Nat. Med. 26, 542–548 (2020).
Vuckovic, D. et al. The polygenic and monogenic basis of blood traits and diseases. Cell 182, 1214–1231.e11 (2020).
Guo, J. et al. Global genetic differentiation of complex traits shaped by natural selection in humans. Nat. Commun. 9, 1865 (2018).
Sved, J. A., McRae, A. F. & Visscher, P. M. Divergence between human populations estimated from linkage disequilibrium. Am. J. Hum. Genet. 83, 737–743 (2008).
Budin-Ljøsne, I. et al. Data sharing in large research consortia: experiences and recommendations from ENGAGE. Eur. J. Hum. Genet. 22, 317–321 (2014).
Surakka, I. et al. The impact of low-frequency and rare variants on lipid levels. Nat. Genet. 47, 589–597 (2015).
Horikoshi, M. et al. Discovery and fine-mapping of glycaemic and obesity-related trait loci using high-density imputation. PLoS Genet. 11, e1005230 (2015).
Pain, O. et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet. 17, e1009021 (2021).
Chung, W. et al. Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes. Nat. Commun. 10, 569 (2019).
Chun, S. et al. Non-parametric polygenic risk prediction via partitioned GWAS summary statistics. Am. J. Hum. Genet. 107, 46–59 (2020).
Im, C. et al. Generalizability of ‘GWAS hits’ in clinical populations: lessons from childhood cancer survivors. Am. J. Hum. Genet. 107, 636–653 (2020).
Daetwyler, H. D., Villanueva, B. & Woolliams, J. A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PloS ONE 3, e3395 (2008).
Visscher, P. M. & Hill, W. G. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).
Galinsky, K. J. et al. Estimating cross-population genetic correlations of causal effect sizes. Genet. Epidemiol. 43, 180–188 (2019).
Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88 (2016).
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 50, 746–753 (2018).
Schoech, A. P. et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun. 10, 790 (2019).
Zhang, Q., Privé, F., Vilhjálmsson, B. & Speed, D. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat. Commun. 12, 4192 (2021).
Hu, Y. et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput. Biol. 13, e1005589 (2017).
Márquez-Luna, C. et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 12, 6052 (2021).
Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X. & Sham, P. C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41, 469–480 (2017).
Yang, S. & Zhou, X. Accurate and scalable construction of polygenic scores in large biobank data sets. Am. J. Hum. Genet. 106, 679–693 (2020).
Qian, J. et al. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet. 16, e1009141 (2020).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–75 (2007).
Akiyama, M. et al. Characterizing rare and low-frequency height-associated variants in the Japanese population. Nat. Commun. 10, 4393 (2019).
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Sakaue, S. et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat. Genet. 53, 1415–1424 (2021).
Gurdasani, D. et al. Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell 179, 984–1002 (2019).
Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).
Lloyd-Jones, L. GCTB SBayesR shrunk sparse linkage disequilibrium matrices for HM3 variants, summary statistics and predictors generated from ‘Improved polygenic prediction by Bayesian multiple regression on summary statistics’ by Lloyd-Jones, Zeng et al. 2019. Zenodo https://doi.org/10.5281/ZENODO.3350914 (2019).
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Gazal, S., Marquez-Luna, C., Finucane, H. K. & Price, A. L. Reconciling S-LDSC and LDAK functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
Purcell, S. & Chang, C. PLINK v2.00a3LM www.cog-genomics.org/plink/2.0/
The UK10K Consortium et al.The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Weissbrod, O. Source code for PolyFun. Zenodo https://doi.org/10.5281/zenodo.6139679 (2022).
Acknowledgements
We thank A. Schoech and C. Márquez-Luna for helpful discussions. This research was conducted using the UK Biobank resource under application no. 16549 and was funded by the National Institutes of Health (NIH; grant nos. U01 HG009379, U01 HG012009, R37 MH107649, R01 MH101244 and R01 HG006399). M.K. was supported by a Nakajima Foundation Fellowship and the Masason Foundation. W.J.P. was supported by an NWO Veni grant (no. 91619152). A.R.M. was supported by the National Institute of Mental Health (grant no. K99/R00MH117229). H.K.F. was supported by E. and W. Schmidt. A.V.K. was supported by grants (nos. 1K08HG010155 and 1U01HG011719) from the National Human Genome Research Institute and a sponsored research agreement from IBM Research. Y.O. was supported by JSPS KAKENHI (grant nos. 19H01021 and 20K21834) and AMED (grant nos. JP21km0405211, JP21ek0109413, JP21ek0410075, JP21gm4010006 and P21km0405217) and JST Moonshot R&D (grant nos. JPMJMS2021 and JPMJMS2024). Computational analyses were performed on the O2 High-Performance Compute Cluster at Harvard Medical School.
Author information
Authors and Affiliations
Consortia
Contributions
O.W., M.K., H.S. and A.L.P. designed the study. O.W., M.K., H.S. and S.G. analyzed the data. O.W., M.K., H.S. and A.L.P. wrote the manuscript with assistance from S.G., W.J.P., A.V.K, Y.O., A.R.M. and H.K.F.
Corresponding authors
Ethics declarations
Competing interests
O.W. is an employee and holds equity in Eleven Therapeutics. H.S. is an employee of Genentech and holds stock in Roche. A.V.K. is an employee and holds equity in Verve Therapeutics, and has served as a scientific advisor to Sanofi, Amgen, Maze Therapeutics, Navitor Pharmaceuticals, Sarepta Therapeutics, Novartis, Silence Therapeutics, Korro Bio, Veritas International, Color Health, Third Rock Ventures, Foresite Labs and Columbia University (NIH); A.V.K. received speaking fees from Illumina, MedGenome, Amgen and the Novartis Institute for Biomedical Research, and also received a sponsored research agreement from IBM Research. All other authors declare no competing interests.
Peer review
Peer review file
Nature Genetics thanks Marylyn Ritchie and Vincent Plagnol for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Cross-population PRS results for real UK Biobank traits, using summary statistics from a meta-analysis of many cohorts.
We report average prediction accuracy (relative-R2, but computed with respect to PRS-CS instead of BOLT-LMM; see main text), meta-analyzed across 4 well-powered, approximately independent traits, for PRS trained in European Network for Genetic and Genomic Epidemiology (ENGAGE) samples (average N = 61,365) and applied to four UK Biobank populations. Target population sample sizes are indicated in parentheses; PolyPred and its summary statistic-based analogues used 500 additional training samples from each target population to estimate mixing weights. Asterisks above each bar denote statistical significance of the difference vs. PRS-CS, with red asterisks denoting a disadvantage (*P < 0.05; **P < 0.001). P-values were computed using a two-sided Wald test and were not adjusted for multiple comparisons. Errors bars denote standard errors. Numerical results, results for all 4 traits analyzed, absolute prediction accuracies (R2), and P-values of relative improvements vs. PRS-CS are reported in Supplementary Table 5 and Supplementary Table 8.
Supplementary information
Supplementary Information
Supplementary Tables 1–11 and Note.
Supplementary Tables
Supplementary Tables 1–11.
Source data
Source Data Fig. 3
Source data for Fig. 3.
Source Data Fig. 4
Source data for Fig. 4.
Source Data Fig. 5
Source data for Fig. 5.
Source Data Fig. 6
Source data for Fig. 6.
Source Data Extended Data Fig. 1
Source data for Extended Data Fig. 1.
Rights and permissions
About this article
Cite this article
Weissbrod, O., Kanai, M., Shi, H. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat Genet 54, 450–458 (2022). https://doi.org/10.1038/s41588-022-01036-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-022-01036-9
This article is cited by
-
Reply to: Multivariate BWAS can be replicable with moderate sample sizes
Nature (2023)
-
Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics
Nature Communications (2023)
-
Cross-ancestry genomic research: time to close the gap
Neuropsychopharmacology (2022)
-
Polygenic Health Index, General Health, and Pleiotropy: Sibling Analysis and Disease Risk Reduction
Scientific Reports (2022)
-
Gene-based polygenic risk scores analysis of alcohol use disorder in African Americans
Translational Psychiatry (2022)