Abstract
Polygenic indexes (PGIs) are DNA-based predictors. Their value for research in many scientific disciplines is growing rapidly. As a resource for researchers, we used a consistent methodology to construct PGIs for 47 phenotypes in 11 datasets. To maximize the PGIs’ prediction accuracies, we constructed them using genome-wide association studies—some not previously published—from multiple data sources, including 23andMe and UK Biobank. We present a theoretical framework to help interpret analyses involving PGIs. A key insight is that a PGI can be understood as an unbiased but noisy measure of a latent variable we call the ‘additive SNP factor’. Regressions in which the true regressor is this factor but the PGI is used as its proxy therefore suffer from errors-in-variables bias. We derive an estimator that corrects for the bias, illustrate the correction, and make a Python tool for implementing it publicly available.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Genetic propensity, socioeconomic status, and trajectories of depression over a course of 14 years in older adults
Translational Psychiatry Open Access 23 February 2023
-
Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics
Nature Communications Open Access 14 February 2023
-
Neuroanatomical correlates of genetic risk for obesity in children
Translational Psychiatry Open Access 03 January 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout



Data availability
For how to access the repository PGIs and other data from each participating dataset, see Supplementary Note; an up-to-date list of participating datasets and data access procedures is maintained at https://www.thessgac.org/pgi-repository. For each phenotype that we analyse, we report GWAS and MTAG summary statistics and PGI (LDpred) weights for all SNPs from the largest discovery sample for that analysis, unless the sample includes 23andMe. SNP-level summary statistics from analyses based entirely or in part on 23andMe data can only be reported for up to 10,000 SNPs. Therefore, if the largest GWAS or MTAG analysis for a phenotype includes 23andMe, we report summary statistics for only the genome-wide significant SNPs from that analysis. In addition, we report summary statistics for all SNPs from a version of the largest GWAS analysis that excludes 23andMe. Finally, we also report summary statistics and PGI (LDpred) weights on which the ‘public PGIs’ are based. These summary statistics and PGI weights can be downloaded from https://www.thessgac.org/pgi-repository. The data underlying Fig. 1 are also available at https://www.thessgac.org/pgi-repository. Researchers at non-profit institutions can obtain access to the genome-wide summary statistics from 23andMe used in this paper by completing the 23andMe publication dataset access request form, available at https://research.23andme.com/dataset-access/. Source data are provided with this paper.
Code availability
The software used for the measurement-error correction is available at https://github.com/JonJala/pgi_correct. The code for constructing PGIs and principal components, the code for the illustrative application and the code for analysing the data displayed in Fig. 1 are at https://www.thessgac.org/pgi-repository.
References
Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17, 1520–1528 (2007).
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Rietveld, C. A. et al. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science 340, 1467–1471 (2013).
Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018).
Cesarini, D. & Visscher, P. M. Genetics and educational attainment. npj Sci. Learn. 2, 4 (2017).
Wray, N. R., Kemper, K. E., Hayes, B. J., Goddard, M. E. & Visscher, P. M. Complex trait prediction from genome data: contrasting EBV in livestock to PRS in humans. Genetics 211, 1131–1141 (2019).
Green, E. D. & Guyer, M. S. Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204–213 (2011).
Wray, N. R. et al. Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 14, 507–515 (2013).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Belsky, D. W. & Harden, K. P. Phenotypic annotation: using polygenic scores to translate discoveries from genome-wide association studies from the top down. Curr. Dir. Psychol. Sci. 28, 82–90 (2019).
Benjamin, D. J. et al. The promises and pitfalls of genoeconomics. Annu. Rev. Econ. 4, 627–662 (2012).
Freese, J. The arrival of social science genomics. Contemp. Sociol. A J. Rev. 47, 524–536 (2018).
Belsky, D. W. et al. The genetics of success: how single-nucleotide polymorphisms associated with educational attainment relate to life-course development. Psychol. Sci. 27, 957–972 (2016).
Harden, K. P. et al. Genetic associations with mathematics tracking and persistence in secondary school. npj Sci. Learn. 5, 1 (2020).
Robinson, M. R. et al. Genetic evidence of assortative mating in humans. Nat. Hum. Behav. https://doi.org/10.1038/s41562-016-0016 (2017).
Yengo, L. et al. Imprint of assortative mating on the human genome. Nat. Hum. Behav. 2, 948–954 (2018).
Abdellaoui, A. et al. Genetic correlates of social stratification in Great Britain. Nat. Hum. Behav. 3, 1332–1342 (2019).
Domingue, B. W., Rehkopf, D. H., Conley, D. & Boardman, J. D. Geographic clustering of polygenic scores at different stages of the life course. RSF Russell Sage Found. J. Soc. Sci. 4, 137–149 (2018).
Papageorge, N. W. & Thom, K. Genes, education, and labor market outcomes: evidence from the Health and Retirement Study. J. Eur. Econ. Assoc. 18, 1351–1399 (2020).
Rietveld, C. A. et al. Replicability and robustness of genome-wide-association studies for behavioral traits. Psychol. Sci. 25, 1975–1986 (2014).
Hewitt, J. K. Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behav. Genet. 42, 1–2 (2012).
Duncan, L. & Keller, M. A critical review of the first 10 years of candidate gene-by-environment interaction research in psychiatry. Am. J. Psychiatry 168, 1041 (2011).
Beauchamp, J. P. Genetic evidence for natural selection in humans in the contemporary United States. Proc. Natl Acad. Sci. USA 113, 7774–7779 (2016).
Kong, A. et al. Selection against variants in the genome associated with educational attainment. Proc. Natl Acad. Sci. USA 114, E727–E732 (2017).
Tucker-Drob, E. M. Measurement error correction of genome-wide polygenic scores in prediction samples. Preprint at bioRxiv https://doi.org/10.1101/165472 (2017).
DiPrete, T. A., Burik, C. A. P. & Koellinger, P. D. Genetic instrumental variable regression: explaining socioeconomic and health outcomes in nonexperimental data. Proc. Natl Acad. Sci. USA 115, E4970–E4979 (2018).
Health and Retirement Study. Polygenic Score Data (PGS). Genetic Data Products. https://hrs.isr.umich.edu/data-products/genetic-data/products#pgs (Univ. Michigan, 2020).
Lambert, S. A. et al. The Polygenic Score Catalog: an open database for reproducibility and systematic evaluation. Nat. Genet. 53, 420–425 (2021).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).
Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Loh, P. R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).
Rosner, B., Spiegelman, D. & Willet, W. C. Correction of logistic regression relative risk estimates and confidence intervals for random within-person measurement error. Am. J. Epidemiol. 136, 1400–1403 (1992).
Hughes, M. Regression dilution in the proportional hazards model. Biometrics 49, 1056–1066 (1993).
Okbay, A. et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533, 539–542 (2016).
Stergiakouli, E. et al. Association between polygenic risk scores for attention-deficit hyperactivity disorder and educational and cognitive outcomes in the general population. Int. J. Epidemiol. 46, dyw216 (2016).
Elliott, M. L. et al. A polygenic score for higher educational attainment is associated with larger brains. Cereb. Cortex 29, 3496–3504 (2018).
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Day, F. R. et al. Shared genetic aetiology of puberty timing between sexes and with health-related outcomes. Nat. Commun. https://doi.org/10.1038/ncomms9842 (2015).
Lo, M.-T. et al. Genome-wide analyses for personality traits identify six genomic loci and show correlations with psychiatric disorders. Nat. Genet. 49, 152–156 (2016).
Sanchez-Roige, S. et al. Genome-wide association study of alcohol use disorder identification test (AUDIT) scores in 20 328 research participants of European ancestry. Addict. Biol. 24, 121–131 (2019).
Sanchez-Roige, S. et al. Genome-wide association study of delay discounting in 23,217 adult research participants of European ancestry. Nat. Neurosci. 21, 16–20 (2018).
Warrier, V. et al. Genome-wide analyses of self-reported empathy: correlations with autism, schizophrenia, and anorexia nervosa. Transl. Psychiatry 8, 1–10 (2018).
Hu, Y. et al. GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person. Nat. Commun. 7, 1–9 (2016).
Hinds, D. A. et al. A genome-wide association meta-analysis of self-reported allergy identifies shared and allergy-specific susceptibility loci. Nat. Genet. https://doi.org/10.1038/ng.2686 (2013).
Ferreira, M. A. et al. Shared genetic origin of asthma, hay fever and eczema elucidates allergic disease biology. Nat. Genet. 49, 1752–1757 (2017).
Demontis, D. et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat. Genet. 51, 63–75 (2019).
Pasman, J. A. et al. GWAS of lifetime cannabis use reveals new risk loci, genetic overlap with psychiatric traits, and a causal influence of schizophrenia. Nat. Neurosci. https://doi.org/10.1038/s41593-018-0206-1 (2018).
Liu, M. et al. Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nat. Genet. 51, 237–244 (2019).
Hyde, C. L. et al. Identification of 15 genetic loci associated with risk of major depression in individuals of European descent. Nat. Genet. 48, 1031–1036 (2016).
Pickrell, J. K. et al. Detection and interpretation of shared genetic influences on 42 human traits. Nat. Genet. 48, 709–717 (2016).
Karlsson Linnér, R. et al. Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences. Nat. Genet. 51, 245–257 (2019).
Winkler, T. W. et al. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 9, 1192–1212 (2014).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Abecasis, G. R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Bulik-Sullivan, B. K. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
Perry, J. R. B. et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 514, 92–97 (2014).
Stringer, S. et al. Genome-wide association study of lifetime cannabis use based on a large meta-analytic sample of 32 330 subjects from the International Cannabis Consortium. Transl. Psychiatry 6, e769 (2016).
Furberg, H. et al. Genome-wide meta-analyses identify multiple loci associated with smoking behavior. Nat. Genet. 42, 441–447 (2010).
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
Doherty, A. et al. GWAS identifies 14 loci for device-measured physical activity and sleep duration. Nat. Commun. 9, 1–8 (2018).
van den Berg, S. M. et al. Meta-analysis of genome-wide association studies for extraversion: findings from the Genetics of Personality Consortium. Behav. Genet. 46, 170–182 (2016).
de Moor, M. H. M. et al. Meta-analysis of genome-wide association studies for neuroticism, and the polygenic association with major depressive disorder. JAMA Psychiatry 72, 642–650 (2015).
de Moor, M. H. M. et al. Meta-analysis of genome-wide association studies for personality. Mol. Psychiatry 17, 337–349 (2012).
Okbay, A. et al. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat. Genet. 48, 624–633 (2016).
Locke, A. E. A. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).
Wood, A. R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–1186 (2014).
Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat. Genet. 51, 414–430 (2019).
Trampush, J. W. et al. GWAS meta-analysis reveals novel loci and genetic correlates for general cognitive function: a report from the COGENT consortium. Mol. Psychiatry 22, 336–345 (2017).
Barban, N. et al. Genome-wide analysis identifies 12 loci influencing human reproductive behavior. Nat. Genet. 48, 1462–1472 (2016).
Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50, 229–237 (2018).
Daetwyler, H. D., Villanueva, B. & Woolliams, J. A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE 3, e3395 (2008).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
The International HapMap 3 Consortium. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 1–16 (2015).
Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
Savage, J. E. et al. Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence. Nat. Genet. 50, 912–919 (2018).
Howard, D. M. et al. Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat. Neurosci. 22, 343–352 (2019).
Jones, S. E. et al. Genome-wide association analyses of chronotype in 697,828 individuals provides insights into circadian rhythms. Nat. Commun. 10, 1–11 (2019).
Nagel, M. et al. Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat. Genet. 50, 920–927 (2018).
Acknowledgements
The authors thank C. Shulman for helpful comments. This research was carried out under the auspices of the SSGAC. This research was conducted using the UKB resource under application number 11,425. J.B. was supported by the Pershing Square Fund of the Foundations of Human Behavior, awarded to D.L.; H.J., M.B., D.C. and P.T. by the Ragnar Söderberg Foundation (E42/15), to D.C.; C.A.P.B., P.K. and A.O. by an ERC Consolidator Grant (647648 EdGe), to P.K.; H.J., M.B., A.Y., J.P.B., M.N.M., D.C., D.J.B. and P.T. by Open Philanthropy (010623-00001), to D.J.B.; C.A.P.B., R.A. and S.O. by Riksbankens Jubileumsfond (P18-0782:1), to S.O.; C.A.P.B. and S.O. by the Swedish Research Council (2019-00244), to S.O.; G.G., N.W. and D.J.B. by the NIA/NIH (R24-AG065184 and R01-AG042568), to D.J.B.; D.J.B. by the NIA/NIH (R56-AG058726), to T. Galama; T.T.M., K.P.H., and E.M.T.-D. by the NIH/NICHD R01-HD083613 to E.M.T.-D. and R01-HD092548 to K.P.H.; P.T. by the NIA/NIMH (R01-MH101244-02 and U01-MH109539-02), to B. Neale. The study was also supported by the NIA/NIH (K99-AG062787-01, P.T.); Netherlands Organisation for Scientific Research VENI (016.Veni.198.058, A.O.); the Swedish Research Council (421-2013-1061, M.J.); the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-152, J.P.B.); the Social Sciences and Humanities Research Council of Canada (J.P.B.); the European Union (MP1GI18418R, T.E.); the Estonian Research Council (PRG1291, T.E.); the National Health and Medical Research Council (GNT113400, P.M.V.); and the Australian Research Council (P.M.V.). The authors thank the following consortia for sharing GWAS summary statistics: Reproductive Genetics (ReproGen) Consortium for age at first menses; Genetics of Personality Consortium (GPC) for neuroticism, extraversion and openness; Psychiatric Genomics Consortium (PGC) for ADHD and depressive symptoms; Tobacco and Alcohol Genetics (TAG) Consortium for cigarettes per day and ever smoker; International Genomics of Alzheimer’s Project (IGAP) for Alzheimer’s disease; GWAS & Sequencing Consortium of Alcohol and Nicotine Use (GSCAN) for cigarettes per day, ever smoker and drinks per week; Genetic Investigation of Anthropometric Traits (GIANT) Consortium for height and body mass index; and Cognitive Genomics (COGENT) Consortium for cognitive performance. The authors thank the Neale Lab for making UKB GWAS results available for asthma, cannabis use, COPD, hayfever, life satisfaction (family, finance, friend and work), loneliness, migraine, nearsightedness, number ever born (men, women), religious attendance, self-rated health and subjective well-being. The authors thank the research participants and employees of 23andMe for making this work possible. A full list of acknowledgements is provided in the Supplementary Note.
Author information
Authors and Affiliations
Consortia
Contributions
D.J.B., D.C., A.O., and P.T. designed and oversaw the study. A.O. supervised all analyses and led the writing of the manuscript. J.B. was the lead analyst, responsible for the GWAS and MTAG analyses, quality control of GWAS summary statistics and the PGI validation analyses. C.A.P.B. was responsible for quality control of genotype data and the construction of PGIs. G.G., N.W., H.J. and M.B. assisted with analyses. G.G. conducted the illustrative application and wrote the Python code. N.W. designed and implemented the algorithm used to generate Fig. 1. R.K.L. ran a meta-analysis of general risk tolerance omitting validation datasets. P.T. derived the measurement-error-correction estimator. A.K., D.A.H. and the 23andMe Research Group conducted genome-wide association analyses for 23andMe. The following authors shared genotype data to enable dataset participation in the Repository: K.M.H. for Add Health; D.W.B., A.C., D.L.C., T.E.M., R.P., K.S. and B.S.W. for Dunedin and E-Risk; A.S. and O.A. for ELSA; L.M. and T.E. for ECGUT; W.G.I. and M.M. for MCTFR; R.A. and P.K.E.M. for STR; T.T.M., K.P.H. and E.M.T.-D. for TTP; and P.H. and J.F. for WLS. More details about dataset-level contributions are given in the Supplementary Note. D.W.B. conducted PGI validations analyses in Dunedin and E-Risk. R.A., A.Y., J.P.B., P.K., S.O., M.J., P.M.V., M.N.M., and D.L. contributed to study design. All authors contributed to and critically reviewed the manuscript. D.J.B., A.O., D.C. and P.T. made especially major contributions to the writing and editing.
Corresponding authors
Ethics declarations
Competing interests
D.A.H., A.K. and members of the 23andMe Research Team are current or former employees of 23andMe, Inc. and hold stock or stock options in 23andMe. The authors declare no other competing interests.
Additional information
Peer review information Nature Human Behaviour thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Methods, Supplementary Note and Supplementary Fig. 1.
Supplementary data 1
Source data for Becker et al. Supplementary Fig. 1 (predictive power of repository multi-trait PGIs).
Supplementary Table 1
Supplementary Tables 1–13 for Becker et al.
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 3
Statistical source data.
Rights and permissions
About this article
Cite this article
Becker, J., Burik, C.A.P., Goldman, G. et al. Resource profile and user guide of the Polygenic Index Repository. Nat Hum Behav 5, 1744–1758 (2021). https://doi.org/10.1038/s41562-021-01119-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41562-021-01119-3
This article is cited by
-
Genetic propensity, socioeconomic status, and trajectories of depression over a course of 14 years in older adults
Translational Psychiatry (2023)
-
Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics
Nature Communications (2023)
-
Neuroanatomical correlates of genetic risk for obesity in children
Translational Psychiatry (2023)
-
High polygenic predisposition for ADHD and a greater risk of all-cause mortality: a large population-based longitudinal study
BMC Medicine (2022)
-
Validating and automating learning of cardiometabolic polygenic risk scores from direct-to-consumer genetic and phenotypic data: implications for scaling precision health research
Human Genomics (2022)