Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Improving polygenic prediction in ancestrally diverse populations

An Author Correction to this article was published on 04 July 2022

This article has been updated

Abstract

Polygenic risk scores (PRS) have attenuated cross-population predictive performance. As existing genome-wide association studies (GWAS) have been conducted predominantly in individuals of European descent, the limited transferability of PRS reduces their clinical value in non-European populations, and may exacerbate healthcare disparities. Recent efforts to level ancestry imbalance in genomic research have expanded the scale of non-European GWAS, although most remain underpowered. Here, we present a new PRS construction method, PRS-CSx, which improves cross-population polygenic prediction by integrating GWAS summary statistics from multiple populations. PRS-CSx couples genetic effects across populations via a shared continuous shrinkage (CS) prior, enabling more accurate effect size estimation by sharing information between summary statistics and leveraging linkage disequilibrium diversity across discovery samples, while inheriting computational efficiency and robustness from PRS-CS. We show that PRS-CSx outperforms alternative methods across traits with a wide range of genetic architectures, cross-population genetic overlaps and discovery GWAS sample sizes in simulations, and improves the prediction of quantitative traits and schizophrenia risk in non-European populations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of polygenic prediction methods.
Fig. 2: Prediction accuracy of single-discovery and multi-discovery polygenic prediction methods in simulations.
Fig. 3: Relative prediction accuracy for quantitative traits in each target population.
Fig. 4: Prediction accuracy of schizophrenia risk in EAS cohorts.

Similar content being viewed by others

Data availability

Publicly available data are available from the following sites: 1KG Phase 3 reference panels: https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html; Genetic map for each subpopulation: ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates; UKBB summary statistics: http://www.nealelab.is/uk-biobank (‘GWAS round 2’ was used in this study); BBJ summary statistics were downloaded from PheWeb: https://pheweb.jp; PAGE summary statistics were downloaded from the GWAS Catalog: https://www.ebi.ac.uk/gwas/downloads/summary-statistics; PGC wave 2 schizophrenia GWAS (49 EUR cohorts): https://www.med.unc.edu/pgc/download-results/; leave-one-out schizophrenia EAS summary statistics are available upon request to the Schizophrenia Working Group of the PGC (https://www.med.unc.edu/pgc/pgc-workgroups/schizophrenia/). These leave-one-out summary statistics are under controlled access per the data use limitation imposed by compliance, participant consent and/or national laws. Application to access such data requires a short research proposal that will go through review and approval process of the PGC. This process takes 2 weeks. Individual-level schizophrenia data of East Asian ancestry are available upon application to the Stanley Global Asia Initiatives: SGAI@broadinstitute.org. These data must be under controlled access due to the data use limitation imposed by the compliance, participant consent and national laws. Application to access such data requires a short research proposal that will be reviewed by principal investigator of the constituent study and, if necessary, by the respective ethic committee. The principal investigator review process takes 2 weeks. TWB data used in this study contain protected health information and are thus under controlled access. Application to access such data can be made to the TWB (https://www.twbiobank.org.tw/new_web_en/). Posterior SNP effect size estimates generated by PRS-CSx for the traits examined in this work: https://github.com/getian107/PRScsx.

Code availability

The code used in this study is available from the following websites: PRS-CSx: https://github.com/getian107/PRScsx (https://doi.org/10.5281/zenodo.5893746); PRS-CS: https://github.com/getian107/PRScs (https://doi.org/10.5281/zenodo.5893748); LDpred2: https://privefl.github.io/bigsnpr/articles/LDpred2; PRSice-2: https://www.prsice.info; HAPGEN2: https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html; PLINK 1.9: https://www.cog-genomics.org/plink; PLINK 2.0: https://www.cog-genomics.org/plink/2.0/; LD score regression: https://github.com/bulik/ldsc; POPCORN: https://github.com/brielin/Popcorn; Interpolation of genetic maps: https://github.com/joepickrell/1000-genomes-genetic-maps; Population assignment: https://github.com/Annefeng/PBK-QC-pipeline.

Change history

References

  1. Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Khera, A. V. et al. Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell 177, 587–596.e9 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).

    Article  CAS  PubMed  Google Scholar 

  4. Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zheutlin, A. B. et al. Penetrance and pleiotropy of polygenic risk scores for schizophrenia in 106,160 patients across four health care systems. Am. J. Psychiatry 176, 846–855 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 28, R133–R142 (2019).

    Article  CAS  PubMed  Google Scholar 

  7. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 11, 3865 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 10, 1–9 (2019).

    Article  CAS  Google Scholar 

  11. Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Hindorff, L. A. et al. Prioritizing diversity in human genomics research. Nat. Rev. Genet. 19, 175–185 (2018).

    Article  CAS  PubMed  Google Scholar 

  14. Peterson, R. E. et al. Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations. Cell 179, 589–603 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Lam, M. et al. Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet. 51, 1670–1678 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Shi, H. et al. Localizing components of shared transethnic genetic architecture of complex traits from GWAS summary data. Am. J. Hum. Genet. 106, 805–817 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Shi, H. et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun. 12, 1098–15 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Privé, F., Arbel, J. & Vilhjalmsson, B. J. LDpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2020).

    Article  PubMed Central  Google Scholar 

  21. Vilhjalmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X. & Sham, P. C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41, 469–480 (2017).

    Article  PubMed  Google Scholar 

  24. Coram, M. A., Fang, H., Candille, S. I., Assimes, T. L. & Tang, H. Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am. J. Hum. Genet. 101, 218–226 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Grinde, K. E. et al. Generalizing polygenic risk scores from Europeans to Hispanics/Latinos. Genet. Epidemiol. 43, 50–62 (2019).

    Article  PubMed  Google Scholar 

  26. Marquez-Luna, C., Loh, P.-R., South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium, & Price, A. L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Weissbrod, O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet. 54, 450–458 (2022).

    Article  CAS  PubMed  Google Scholar 

  28. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Kanai, M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet. 50, 390–400 (2018).

    Article  CAS  PubMed  Google Scholar 

  30. Sakaue, S. et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat. Genet. 53, 1415–1424 (2021).

    Article  CAS  PubMed  Google Scholar 

  31. Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Chen, C.-Y. et al. Analysis across Taiwan Biobank, Biobank Japan and UK Biobank identifies hundreds of novel loci for 36 quantitative traits. Preprint at medRxiv https://doi.org/10.1101/2021.04.12.21255236 (2021).

  33. Feng, Y.-C. A. et al. Taiwan Biobank: a rich biomedical research database of the Taiwanese population. Preprint at medRxiv https://doi.org/10.1101/2021.12.21.21268159 (2021).

  34. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).

    Article  PubMed Central  Google Scholar 

  35. International Schizophrenia Consortium et al.Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).

    Article  PubMed Central  Google Scholar 

  36. 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  37. Su, Z., Marchini, J. & Donnelly, P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304–2305 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Gelman, A. & Rubin, D. B. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–472 (1992).

    Article  Google Scholar 

  39. Ge, T. et al. Validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Preprint at medRxiv https://doi.org/10.1101/2021.09.11.21263413 (2021).

  40. Majara, L. et al. Low generalizability of polygenic scores in African populations due to genetic and environmental diversity. Preprint at bioRxiv https://doi.org/10.1101/2021.01.12.426453 (2021).

  41. Atkinson, E. G. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat. Genet. 53, 195–204 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Berisa, T. & Pickrell, J. K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283–285 (2016).

    CAS  PubMed  Google Scholar 

  44. Choi, S. W. & O’Reilly, P. F. PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience 8, 2091 (2019).

    Article  Google Scholar 

  45. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 360, 1411–1753 (2018).

    Google Scholar 

  47. Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Speed, D., Holmes, J. & Balding, D. J. Evaluating and improving heritability models using summary statistics. Nat. Genet. 52, 458–462 (2020).

    Article  CAS  PubMed  Google Scholar 

  49. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank B. Neale, M. Daly, R. Do and A. Bloemendal for helpful discussions. We thank the Neale Laboratory and BBJ for releasing the genome-wide association summary statistics from UKBB and BBJ. Individual-level phenotypes and genotypes for UKBB samples were obtained under application 32568. We thank the Schizophrenia Working Group of the PGC for providing the GWAS summary statistics for schizophrenia. T.G. is supported by National Institute on Aging (NIA) K99/R00AG054573, National Human Genome Research Institute (NHGRI) U01HG008685 and NHGRI U01HG011723. H.H. acknowledges supports from National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) K01DK114379, National Institute of Mental Health (NIMH) U01MH109539, Brain and Behavior Research Foundation Young Investigator Grant (28450), the Zhengxu and Ying He Foundation, and the Stanley Center for Psychiatric Research. L.H. and S.Q. are supported by Shanghai Municipal Science and Technology Major Project (2017SHZDZX01). A.R.M. is supported by NIMH K99/R00MH117229. A.S. is supported by NIMH P50MH094268. Y.A.F. is supported by the ‘National Taiwan University Higher Education Sprout Project (NTU-110L8810)’ within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. Y.F.L. is supported by the National Health Research Institutes (NP-109-PP-09), and the Ministry of Science and Technology (109-2314-B-400-017) of Taiwan.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

H.H. and T.G. designed the project; T.G. developed the statistical methods and programmed the code for PRS-CSx. Y.R. and T.G. conducted simulation studies. Y.R. and T.G performed the analysis in the UK Biobank; Y.-F.L. performed the analysis in the Taiwan Biobank. Y.R. performed the analysis in the schizophrenia cohorts. Y.-C.A.F. assigned the UKBB samples into superpopulation groups. C.-Y.C. provided critical suggestions for the study design. M.L. took part in the testing of the code and preprocessed schizophrenia East Asian cohorts. Z.G., L.H., A.S. and S.Q. contributed to the generation and preprocessing of schizophrenia East Asian data. Y.R., H.H. and T.G. wrote the manuscript; Y.-C.A.F., C.-Y.C. and A.R.M. provided critical revision for the manuscript. All the authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Shengying Qin, Hailiang Huang or Tian Ge.

Ethics declarations

Competing interests

C.Y.C. is an employee of Biogen. The other authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Yixuan Ye, Shing Wan Choi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Prediction accuracy of different polygenic prediction methods across different genetic architectures.

Phenotypes were simulated using 0.1%, 1% or 10% of randomly sampled causal variants (shared across populations), a cross-population genetic correlation of 0.7, and SNP heritability of 50%. PRS were trained using 100 K EUR samples and 20 K non-EUR (EAS or AFR) samples. Numerical results are reported in Supplementary Table 2.

Extended Data Fig. 2 Prediction accuracy of different polygenic prediction methods across different cross-population genetic correlations.

Phenotypes were simulated using 1% of randomly sampled causal variants (shared across populations), a cross-population genetic correlation of 0.4, 0.7 or 1.0, and SNP heritability of 50%. PRS were trained using 100 K EUR samples and 20 K non-EUR (EAS or AFR) samples. Numerical results are reported in Supplementary Table 3.

Extended Data Fig. 3 Prediction accuracy of different polygenic prediction methods across different discovery GWAS sample sizes.

Phenotypes were simulated using 1% of randomly sampled causal variants (shared across populations), a cross-population genetic correlation of 0.7, and SNP heritability of 50%. PRS were trained using 50 K EUR and 10 K non-EUR (EAS or AFR) samples, 100 K EUR and 20 K non-EUR samples, 200 K EUR and 40 K non-EUR samples, or 300 K EUR and 60 K non-EUR samples. Numerical results are reported in Supplementary Table 4.

Extended Data Fig. 4 Prediction accuracy of different polygenic prediction methods across different ratios of EUR vs. non-EUR GWAS sample sizes.

Phenotypes were simulated using 1% of randomly sampled causal variants (shared across populations), a cross-population genetic correlation of 0.7, and SNP heritability of 50%. PRS were trained using 120 K EUR samples without non-EUR samples, 100 K EUR and 20 K non-EUR (EAS or AFR) samples, 80 K EUR and 40 K non-EUR samples, or 60 K EUR and 60 K non-EUR samples. Numerical results are reported in Supplementary Table 5.

Extended Data Fig. 5 Prediction accuracy of different polygenic prediction methods across different SNP heritability.

Phenotypes were simulated using 1% of randomly sampled causal variants (shared across populations) and a cross-population genetic correlation of 0.7. SNP heritability was fixed at 50% in each population, 50% in the EUR population and 25% in the non-EUR population, or 25% in the EUR population and 50% in the non-EUR population. PRS were trained using 100 K EUR samples and 20 K non-EUR (EAS or AFR) samples. Numerical results are reported in Supplementary Table 6.

Extended Data Fig. 6 Prediction accuracy of different polygenic prediction methods across different proportions of shared causal variants between populations.

Phenotypes were simulated using 1% of randomly sampled causal variants. 100%, 70% or 40% of the causal variants were shared across populations. Shared causal variants had a cross-population genetic correlation of 0.7. SNP heritability was fixed at 50%. PRS were trained using 100 K EUR samples and 20 K non-EUR (EAS or AFR) samples. Numerical results are reported in Supplementary Table 7.

Extended Data Fig. 7 Prediction accuracy of different polygenic prediction methods when SNP effect sizes are minor allele frequency (MAF) and linkage disequilibrium (LD) dependent.

Phenotypes were simulated using 1% of randomly sampled causal variants (shared across populations), a cross-population genetic correlation of 0.7, and SNP heritability of 50%. SNP effect sizes were dependent on MAF and LD scores such that SNPs with lower MAF and located in lower LD regions tended to have larger effect sizes. PRS were trained using 100 K EUR samples and 20 K non-EUR (EAS or AFR) samples. Numerical results are reported in Supplementary Table 8.

Extended Data Fig. 8 Relative prediction accuracy for quantitative traits across target populations.

Relative prediction performance for single-discovery and multi-discovery PRS construction methods using discovery GWAS summary statistics a, from UKBB and BBJ, across 33 traits, in different UKBB target populations (EUR, EAS and AFR); b, from UKBB and BBJ, across 21 traits, in the Taiwan Biobank (TWB); c, from UKBB, BBJ and PAGE, across 14 traits, in different UKBB target populations (EUR, EAS and AFR). Each data point shows the relative increase of prediction performance, defined as R2/R2PRS-CS (UKBB)-EUR - 1, in which R2PRS-CS (UKBB)-EUR is the R2 of the trait in the EUR population using PRS-CS trained on the UKBB GWAS summary statistics. In UKBB target populations (panels a and c), R2 were averaged across 100 random splits of the target samples into validation and testing datasets. The crossbar indicates the median of the relative increase of predictive performance across the traits examined. ‘median N’ indicates the median sample size across the respective discovery GWAS.

Extended Data Fig. 9 Trace plots and autocorrelation functions (ACFs) for assessing the convergence and mixing of the Gibbs sampler used in PRS-CSx.

Left panels: Trace plots, after discarding the burn-in iterations and thinning the Markov chain by a factor of 5, for the posterior effects of rs7412 on low-density lipoprotein cholesterol when integrating UKBB, BBJ and PAGE GWAS summary statistics using PRS-CSx. Right panels: The autocorrelation functions (ACFs) for the traces shown on the left.

Supplementary information

Supplementary Information

Supplementary Methods, Discussions and Tables 1–18.

Reporting Summary

Peer Review File

Supplementary Table 1

18 Supplementary Tables

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ruan, Y., Lin, YF., Feng, YC.A. et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet 54, 573–580 (2022). https://doi.org/10.1038/s41588-022-01054-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-022-01054-7

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics