Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations

Article metrics


A key public health need is to identify individuals at high risk for a given disease to enable enhanced screening or preventive therapies. Because most common diseases have a genetic component, one important approach is to stratify individuals based on inherited DNA variation1. Proposed clinical applications have largely focused on finding carriers of rare monogenic mutations at several-fold increased risk. Although most disease risk is polygenic in nature2,3,4,5, it has not yet been possible to use polygenic predictors to identify individuals at risk comparable to monogenic mutations. Here, we develop and validate genome-wide polygenic scores for five common diseases. The approach identifies 8.0, 6.1, 3.5, 3.2, and 1.5% of the population at greater than threefold increased risk for coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer, respectively. For coronary artery disease, this prevalence is 20-fold higher than the carrier frequency of rare monogenic mutations conferring comparable risk6. We propose that it is time to contemplate the inclusion of polygenic risk prediction in clinical care, and discuss relevant issues.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Study design and workflow.
Fig. 2: Risk for CAD according to GPS.
Fig. 3: Risk gradient for disease according to the GPS percentile.


  1. 1.

    Green, E. D. & Guyer, M. S., National Human Genome Research Institute. Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204–213 (2011).

  2. 2.

    Fisher, R. A. The correlation between relatives on the supposition of Mendelian inheritance. Proc. R. Soc. Edinb. 52, 99–433 (1918).

  3. 3.

    Gibson, G. Rare and common variants: twenty arguments. Nat. Rev. Genet. 13, 135–145 (2012).

  4. 4.

    Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl Acad. Sci. USA 111, E5272–E5281 (2014).

  5. 5.

    Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature 536, 41–47 (2016).

  6. 6.

    Abul-Husn, N. S. et al. Genetic identification of familial hypercholesterolemia within a single U.S. health care system. Science 354, pii: aaf7000 (2016).

  7. 7.

    Nordestgaard, B. G. et al. Familial hypercholesterolaemia is underdiagnosed and undertreated in the general population: guidance for clinicians to prevent coronary heart disease: consensus statement of the European Atherosclerosis Society. Eur. Heart J. 34, 3478–3490a (2013).

  8. 8.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

  9. 9.

    Estrada, K. et al. Association of a low-frequency variant in HNF1A with type 2 diabetes in a Latino population. JAMA 311, 2305–2314 (2014).

  10. 10.

    Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat. Genet. 45, 400–405 (2013).

  11. 11.

    Zhang, Y. et al. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits and implications for the future. Preprint at (2017).

  12. 12.

    Ripatti, S. et al. A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. Lancet 376, 1393–1400 (2010).

  13. 13.

    Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic scores. Am. J. Hum. Genet. 97, 576–592 (2015).

  14. 14.

    Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

  15. 15.

    Bycroft, C. et al. Genome-wide genetic data on ~500,000 UK Biobank participants. Preprint at (2017).

  16. 16.

    Nikpay, M. et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).

  17. 17.

    Tada, H. et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur. Heart J. 37, 561–567 (2016).

  18. 18.

    Abraham, G. et al. Genomic prediction of coronary heart disease. Eur. Heart J. 37, 3267–3278 (2016).

  19. 19.

    Khera, A. V. et al. Genetic risk, adherence to a healthy lifestyle, and coronary disease. N. Engl. J. Med. 375, 2349–2358 (2016).

  20. 20.

    Mega, J. L. et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials. Lancet 385, 2264–2271 (2015).

  21. 21.

    Natarajan, P. et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation 135, 2091–2101 (2017).

  22. 22.

    January, C. T. et al. 2014 AHA/ACC/HRS guideline for the management of patients with atrial fibrillation: a report of the American College of Cardiology/American Heart Association Task Force on practice guidelines and the Heart Rhythm Society. Circulation 130, e199–e267 (2014).

  23. 23.

    GBD 2015 Disease and Injury Incidence and Prevalence Collaborators. Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet 388, 1545–1602 (2016).

  24. 24.

    Knowler, W. C. et al. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. N. Engl. J. Med. 346, 393–403 (2002).

  25. 25.

    Abraham, C. & Cho, J. H. Inflammatory bowel disease. N. Engl. J. Med. 361, 2066–2078 (2009).

  26. 26.

    Pharoah, P. D., Antoniou, A. C., Easton, D. F. & Ponder, B. A. Polygenes, risk prediction, and targeted prevention of breast cancer. N. Engl. J. Med. 358, 2796–2803 (2008).

  27. 27.

    Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).

  28. 28.

    Khera, A. V. & Kathiresan, S. Is coronary atherosclerosis one disease or many? Setting realistic expectations for precision medicine. Circulation 135, 1005–1007 (2017).

  29. 29.

    Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).

  30. 30.

    Christophersen, I. E. et al. Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat. Genet. 49, 946–952 (2017).

  31. 31.

    Scott, R. A. et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes 66, 2888–2902 (2017).

  32. 32.

    Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).

  33. 33.

    Michailidou, K. et al. Association analysis identifies 65 new breast cancer risk loci. Nature 551, 92–94 (2017).

  34. 34.

    The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  35. 35.

    Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).

  36. 36.

    Ganna, A. et al. Ultra-rare disruptive and damaging mutations influence educational attainment in the general population. Nat. Neurosci. 19, 1563–1565 (2016).

Download references


UK Biobank analyses were conducted via application 7089 using a protocol approved by the Partners HealthCare Institutional Review Board. The analysis was supported by a KL2/Catalyst Medical Research Investigator Training award from Harvard Catalyst funded by the National Institutes of Health (TR001100 to A.V.K.), a Junior Faculty Research Award from the National Lipid Association (to A.V.K.), the National Heart, Lung, and Blood Institute of the US National Institutes of Health under award numbers T32 HL007208 (to K.G.A.), K23HL114724 (to S.A.L.), R01HL139731 (to S.A.L.), RO1HL092577 (to P.T.E.), R01HL128914 (to P.T.E.), K24HL105780 (to P.T.E.), and RO1 HL127564 (to S.K.), the National Human Genome Research Institute of the US National Institutes of Health under award number 5UM1HG008895 (to E.S.L. and S.K.), the Doris Duke Charitable Foundation under award number 2014105 (to S.A.L.), the Foundation Leducq under award number 14CVD01 (to P.T.E.), and the Ofer and Shelly Nemirovsky Research Scholar Award from Massachusetts General Hospital (to S.K.). The authors thank D. Altshuler (Vertex Pharmaceuticals, Boston, MA) for comments on an earlier version of this manuscript.

Author information

A.V.K., M.C., and S.K. conceived and designed the study. A.V.K., M.C., K.G.A., M.E.H., C.R., S.H.C., and S.A.L. acquired, analyzed, and interpreted the data. A.V.K., M.C., E.S.L., and S.K. drafted the manuscript. A.V.K., M.C., P.N., E.S.L., P.T.E., and S.K. critically revised the manuscript for important intellectual content.

Correspondence to Sekar Kathiresan.

Ethics declarations

Competing interests

A.V.K. and S.K. are listed as co-inventors on a patent application for the use of genetic risk scores to determine risk and guide therapy. S.K. and P.T.E. are supported by a grant from Bayer AG to the Broad Institute focused on the genetics and therapeutics of myocardial infarction and atrial fibrillation.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Risk gradient for coronary artery disease across the distribution of the genome-wide polygenic score and two previously published scores.

ac, Three polygenic scores for coronary artery disease were calculated within the UK Biobank testing dataset of 288,978 participants: a previously published score comprising 50 variants that had achieved genome-wide levels of statistical significance in previous studies (Eur. Heart J. 37, 561–567, 2016) (a); a previously published score comprising 49,310 variants derived from a Metabochip GWAS (Eur. Heart J. 37, 3267–3278, 2016) (b); and the newly derived genome-wide polygenic score comprising 6,630,150 variants (c). For each score, the population was divided into 100 bins according to percentile of the score and prevalence of coronary artery disease within each bin plotted. The prevalence of coronary artery disease across score percentiles ranged from 1.4% to 5.9% for the 50-variant score, 1.0% to 7.2% for the 49,310-variant score, and 0.8% to 11.1% for the 6,630,150-variant genome-wide polygenic score.

Supplementary Figure 2 Predicted versus observed prevalence of coronary artery disease according to genome-wide polygenic score percentile.

For each individual within the UK Biobank testing dataset, the predicted probability of disease was calculated using a logistic regression model with only the genome-wide polygenic score (GPS) as a predictor. The predicted prevalence of disease within each percentile bin of the GPS distribution was calculated as the average predicted probability of all individuals within that bin. The shape of the predicted risk gradient was consistent with the empirically observed risk gradient, reflected by black and blue dots, respectively.

Supplementary Figure 3 Predicted versus observed prevalence of four diseases according to genome-wide polygenic score percentile.

ad, For each individual within the UK Biobank testing dataset, the predicted probability of disease was calculated using a logistic regression model with only the genome-wide polygenic score (GPS) as a predictor. The predicted prevalence of disease within each percentile bin of the GPS distribution was calculated as the average predicted probability of all individuals within that bin. The shape of the predicted risk gradient was consistent with the empirically observed risk gradient, reflected by black and blue dots, respectively, for each of four diseases: atrial fibrillation (a), type 2 diabetes (b), inflammatory bowel disease (c), and breast cancer (d). Breast cancer analys is was restricted to female participants.

Supplementary Information

Supplementary Text and Figures

Supplementary Figures 1–3 and Supplementary Tables 1–10

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading