Evaluating and improving heritability models using summary statistics

Speed, Doug; Holmes, John; Balding, David J.

doi:10.1038/s41588-020-0600-y

Analysis
Published: 23 March 2020

Evaluating and improving heritability models using summary statistics

Nature Genetics volume 52, pages 458–462 (2020)Cite this article

9430 Accesses
75 Citations
50 Altmetric
Metrics details

Subjects

Abstract

There is currently much debate regarding the best model for how heritability varies across the genome. The authors of GCTA recommend the GCTA-LDMS-I model, the authors of LD Score Regression recommend the Baseline LD model, and we have recommended the LDAK model. Here we provide a statistical framework for assessing heritability models using summary statistics from genome-wide association studies. Based on 31 studies of complex human traits (average sample size 136,000), we show that the Baseline LD model is more realistic than other existing heritability models, but that it can be improved by incorporating features from the LDAK model. Our framework also provides a method for estimating the selection-related parameter α from summary statistics. We find strong evidence (P < 1 × 10⁻⁶) of negative genome-wide selection for traits, including height, systolic blood pressure and college education, and that the impact of selection is stronger inside functional categories, such as coding SNPs and promoter regions.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Genetic architecture estimates from different heritability models.**

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Article Open access 08 November 2019

Luke R. Lloyd-Jones, Jian Zeng, … Peter M. Visscher

Improved genetic prediction of complex traits from individual-level data or summary statistics

Article Open access 07 July 2021

Qianqian Zhang, Florian Privé, … Doug Speed

A global overview of pleiotropy and genetic architecture in complex traits

Article 19 August 2019

Kyoko Watanabe, Sven Stringer, … Danielle Posthuma

Data availability

We performed the UKBb GWAS using data applied for and downloaded via the UK Biobank website (www.ukbiobank.ac.uk). We obtained summary statistics for the 17 public GWAS studies from the websites of the corresponding studies. We downloaded the 1000 Genome Project data from the LDSC website (www.github.com/bulik/ldsc).

References

Speed, D., Cai, N., Johnson, M. R., Nejentsev, S. & Balding, D. J. Reevaluation of SNP heritability in complex human traits. Nat. Genet. 49, 986–992 (2017).
PubMed PubMed Central CAS Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
PubMed PubMed Central CAS Google Scholar
Bulik-Sullivan, B. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
PubMed PubMed Central CAS Google Scholar
Speed, D. & Balding, D. Better estimation of SNP heritability from summary statistics provides a new understanding of the genetic architecture of complex traits. Nat. Genet. 51, 277–284 (2019).
PubMed CAS Google Scholar
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
PubMed PubMed Central CAS Google Scholar
Speed, D., Hemani, G., Johnson, M. & Balding, D. Improved heritability estimation from genome-wide SNP data. Am. J. Hum. Genet. 91, 1011–1021 (2012).
PubMed PubMed Central CAS Google Scholar
Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47, 1114–1120 (2015).
PubMed PubMed Central CAS Google Scholar
Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).
PubMed PubMed Central CAS Google Scholar
Finucane, H. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
PubMed PubMed Central CAS Google Scholar
Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).
PubMed PubMed Central CAS Google Scholar
Corbeil, R. R. & Searle, S. R. Restricted maximum likelihood (REML) estimation of variance components in the mixed model. Technometrics 18, 31–38 (1976).
Google Scholar
Gusev, A. et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 95, 535–552 (2014).
PubMed PubMed Central CAS Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
PubMed PubMed Central CAS Google Scholar
Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet. 18, 117–127 (2017).
PubMed CAS Google Scholar
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
PubMed Central Google Scholar
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974).
Google Scholar
Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 50, 746–753 (2018).
PubMed CAS Google Scholar
Yang, J., Zeng, J., Goddard, M. E., Wray, N. R. & Visscher, P. M. Concepts, estimation and interpretation of SNP-based heritability. Nat. Genet. 49, 1304–1310 (2017).
PubMed CAS Google Scholar
Gazal, S., Marquez-luna, C., Finucane, H. K. & Price, A. L. Reconciling S-LDSC and LDAK models and functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019).
PubMed PubMed Central CAS Google Scholar
Hou, K. et al. Accurate estimation of SNP-heritability from biobank-scale data irrespective of genetic architecture. Nat. Genet. 51, 1244–1251 (2019).
PubMed PubMed Central CAS Google Scholar
Zheng, J. et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2016).
PubMed PubMed Central Google Scholar
Ypma, T. Historical development of the Newton–Raphson method. SIAM Rev. 37, 531–551 (1995).
Google Scholar
Efron, B. & Stein, C. The jackknife estimate of variance. Ann. Stat. 9, 586–596 (1981).
Google Scholar
Speed, D. et al. Describing the genetic architecture of epilepsy through heritability analysis. Brain 137, 2680–2689 (2014).
PubMed PubMed Central Google Scholar
Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333–338 (2011).
PubMed PubMed Central CAS Google Scholar
Liu, J. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).
PubMed PubMed Central CAS Google Scholar
The Tobacco and Genetics Consortium. Genome-wide meta-analyses identify multiple loci associated with smoking behavior. Nat. Genet. 42, 441–447 (2010).
PubMed Central Google Scholar
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
PubMed CAS Google Scholar
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
PubMed Central Google Scholar
Scott, R. et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes 66, 2888–2902 (2017).
PubMed PubMed Central CAS Google Scholar
Zheng, H. et al. Whole-genome sequencing identifies EN1 as a determinant of bone density and fracture. Nature 526, 112–117 (2015).
PubMed PubMed Central CAS Google Scholar
Locke, A. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).
PubMed PubMed Central CAS Google Scholar
Okbay, A. et al. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat. Genet. 48, 626–633 (2016).
Google Scholar
Wood, A. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–1186 (2014).
PubMed PubMed Central CAS Google Scholar
Perry, J. et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 514, 92–97 (2014).
PubMed PubMed Central CAS Google Scholar
Day, F. et al. Large-scale genomic analyses link reproductive aging to hypothalamic signaling, breast cancer susceptibility and BRCA1-mediated DNA repair. Nat. Genet. 47, 1294–1303 (2015).
PubMed PubMed Central CAS Google Scholar
Shungin, D. et al. New genetic loci link adipose and insulin biology to body fat distribution. Nat. Genet. 518, 187–196 (2015).
CAS Google Scholar
Okbay, A. et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533, 539–542 (2016).
PubMed PubMed Central CAS Google Scholar
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
PubMed PubMed Central CAS Google Scholar
The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
PubMed Central Google Scholar

Download references

Acknowledgements

We thank B. Shaban for help with the LDAK website, and A. Price, S. Gazal and H. Finucane for helpful discussions. D.S. is funded by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement no. 754513, by Aarhus University Research Foundation and by the Independent Research Fund Denmark under project no. 7025-00094B. D.S. and D.J.B. are funded by the Australian Research Council under project no. DP190103188.

Author information

Authors and Affiliations

Aarhus Institute of Advanced Studies, Aarhus University, Aarhus, Denmark
Doug Speed
Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
Doug Speed
UCL Genetics Institute, University College London, London, UK
Doug Speed & David J. Balding
Melbourne Integrative Genomics, School of BioSciences and School of Mathematics & Statistics, University of Melbourne, Parkville, Victoria, Australia
John Holmes & David J. Balding

Authors

Doug Speed
View author publications
You can also search for this author in PubMed Google Scholar
John Holmes
View author publications
You can also search for this author in PubMed Google Scholar
David J. Balding
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.S. and J.H. performed the analyses. D.S. and D.J.B. wrote the manuscript.

Corresponding author

Correspondence to Doug Speed.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of likelihoods.

Plots compare likelihood ratio test (LRT) statistics (twice the improvement in log likelihood relative to the null model) computed using the likelihood from restricted maximum likelihood (REML) with those from logl_SS, our new approximate likelihood, and logl_Old, the approximate likelihood we reported in the original version of SumHer (see Supplementary Note for details). We only analyze the 14 UKBb GWAS, because to perform REML requires individual-level data, and we only consider the GCTA, LDAK and LDAK-Thin Models, because REML is only feasible for simple heritability models. To ensure a fair comparison, when running SumHer we restrict the reference panel to the 4.7 M GWAS SNPs. The bottom plots are zoomed versions of the top plots (obtained by excluding height, the most heritable trait). We see that the LRT statistics from logl_SS are highly concordant with those from REML, indicating that the weights used when calculating logl_SS perform well. We observe lower concordance between the LRT statistics from logl_Old and those from REML, reflecting that logl_Old was based on the assumption that test statistics were Gaussian distributed, rather than Gamma distributed.

Extended Data Fig. 2 Estimated proportions of SNP heritability.

This is an expanded version of Fig. 1d, and shows that estimates of functional enrichments tend to converge as the heritability model becomes more complex. Plots report the estimated proportion of SNP heritability contributed by each category of SNPs, averaged across either the 14 UKBb or 17 Public GWAS (vertical segments indicate 95% confidence intervals). Bars indicate the heritability model used and are ordered by number of parameters (see Supplementary Table 13 for definitions): GCTA + 1Fun Model (two parameters, used by Gusev et al.¹²), LDAK + 1Fun Model (two parameters, Speed et al.¹), LDAK + 24Fun Model (25 parameters, Speed et al.⁴), Baseline Model (53 parameters, Finucane et al.⁹), BLD-LDAK and BLD-LDAK + Alpha Models (66 and 67 parameters, this paper) and Baseline LD Model (75 parameters, Gazal et al.¹⁰). The estimated enrichment of a category is obtained by dividing its estimated proportion of SNP heritability by the proportion of SNPs it contains (horizontal dashed lines). Numerical values are provided in Supplementary Tables 5 & 6.

Extended Data Fig. 3 Reduced-complexity heritability models.

The seven-parameter BLD-LDAK-Lite is a reduced version of the BLD-LDAK Model, obtained by removing two of the nine continuous annotations and all 57 binary annotations (Supplementary Table 8 explains how we used forward stepwise selection to decide which of the continuous annotations to retain). The nine-parameter BLD-LDAK-Lite+1Fun Model adds to the BLD-LDAK-Lite Model one function indicator and the corresponding 500 base pair buffer, while the eight-parameter BLD-LDAK-Lite+Alpha Model is the same as the BLD-LDAK-Lite Model, except annotations are scaled by [f_j(1-f_j)]^1+α. These plots show that estimates of SNP heritability and confounding bias from the BLD-LDAK-Lite Model, and average estimates of functional enrichments from the BLD-LDAK-Lite+1Fun Model are close to the those from the BLD-LDAK Model, while estimates of α from the BLD-LDAK-Lite+Alpha Model are close to those from the BLD-LDAK + Alpha Model. Numbers indicate how many of the pairs of estimates are inconsistent either nominally or after Bonferroni correction. Numerical values are provided in Supplementary Tables 3–7.

Extended Data Fig. 4 Comparison with GRE.

Hou et al.²¹ proposed GRE, a method for estimating SNP heritability without specifying a heritability model. GRE requires individual level data and that there are more individuals than the number of SNPs on the largest chromosome. Here we compare estimates from GRE to those from SumHer for the 14 UKBb GWAS. To run GRE, we follow the instructions at www.github.com/bogdanlab/h2-GRE; to satisfy the sample size requirement, we use only the 623k directly-genotyped SNPs (Hou et al. did likewise). For SumHer, we consider ten heritability models; to enable a fair comparison with GRE, we always restrict the reference panel to genotyped SNPs. The first three plots compare estimates of SNP heritability from GRE and SumHer. It is noticeable that when using only genotyped SNPs, changing the heritability model has a much smaller impact on estimates of SNP heritability than when using imputed SNPs (Supplementary Table 3); this reflects that with fewer SNPs, the impact of the prior assumptions is reduced. Nonetheless, if we consider GRE estimates to be the ‘gold standard’, then this analysis indicates that the LDAK-Thin, GCTA-LDMS-R, GCTA-LDMS-I, BLD-LDAK, BLD-LDAK + Alpha and Baseline LD Models produce more accurate estimates of SNP heritability than the GCTA, LDAK, LDAK + 24Fun and Baseline Models. In the fourth plot, the solid and dashed lines mark the point estimate and 95% confidence intervals for the gradient when regressing onto the Akaike Information Criterion (AIC) the absolute difference between estimates from SumHer and GRE (when performing this regression, we include an indicator for trait, to reflect that AIC will tend to be lower for more heritable traits). If we again consider GRE estimates to be the gold standard, then the fact that the gradient is significantly positive (P < 10⁻⁶) indicates that lower AIC implies more accurate estimates of SNP heritability.

Extended Data Fig. 5 Comparison of weighted least-squares and maximum likelihood solvers.

The plots compare likelihood ratio test (LRT) statistics (twice the improvement in log likelihood relative to the null model), computed using logl_SS, our approximate model likelihood. We consider six heritability models (see Supplementary Table 13 for definitions), estimating parameters using either maximum likelihood (our recommended approach) or weighted least-squares regression (the approach used by LDSC and previously by SumHer). Note that when we estimate parameters for the Baseline and Baseline LD Models using weighted least-squares regression, we frequently obtain negative E[S_j]; so that we can compute logl_SS, we replace these with 10⁻⁶. These plots show that for the GCTA, GCTA-LDMS-I, LDAK and LDAK + 24Fun Models (the simpler models), the two solvers result in near-identical model fit. However, for the Baseline and Baseline LD Models (the more complex models), weighted least-squares regression often results in a worse fit, because it does not respect that test statistics are approximately Gamma distributed. Note that the reason we observe discordance between the weighted least-squares estimates from LDSC and SumHer (mainly evident for the Baseline Model), is because the SumHer weighted least-squares solver is always iterative, whereas the LDSC solver is iterative when provided with a single-parameter heritability model, but one-step when provided with a multi-parameter model.

Extended Data Fig. 6 Reduced quality control for UKBb GWAS.

For our main analysis of the UKBb GWAS, we first identified individuals with values for all 14 phenotypes, then filtered so that no pair remained with allelic correlation >0.02 (Supplementary Note 6). As a secondary analysis, we instead identified individuals with values for any of the 14 phenotypes, then filtered so that no pair remained with allelic correlation >0.03125. This increased the number of individuals from 130,080 to 246,655, with on average 236k phenotypic values per GWAS (range 201k to 247k). The first plot shows that increasing the sample size does not change the ranking of models based on the Akaike Information Criterion. The remaining three plots shows that it does not significantly change estimates of SNP heritability or average functional enrichments from the BLD-LDAK Model, nor estimates of the selection-related parameter α from the BLD-LDAK + Alpha Model (horizontal and vertical segments indicate 95% confidence intervals; numbers indicate how many of the pairs of estimates are inconsistent either nominally or after Bonferroni correction).

Supplementary information

Supplementary Information

Supplementary Note and Tables 1–16

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Speed, D., Holmes, J. & Balding, D.J. Evaluating and improving heritability models using summary statistics. Nat Genet 52, 458–462 (2020). https://doi.org/10.1038/s41588-020-0600-y

Download citation

Received: 30 July 2019
Accepted: 24 February 2020
Published: 23 March 2020
Issue Date: April 2020
DOI: https://doi.org/10.1038/s41588-020-0600-y

This article is cited by

The genetic architecture of multimodal human brain age
- Junhao Wen
- Bingxin Zhao
- Christos Davatzikos
Nature Communications (2024)
Genetic influences on circulating retinol and its relationship to human health
- William R. Reay
- Dylan J. Kiltschewskij
- Murray J. Cairns
Nature Communications (2024)
Cross-ancestry genetic architecture and prediction for cholesterol traits
- Md. Moksedul Momin
- Xuan Zhou
- S. Hong Lee
Human Genetics (2024)
Heritability and recursive influence of host genetics on the rumen microbiota drive body weight variance in male Hu sheep lambs
- Weimin Wang
- Yukun Zhang
- Fadi Li
Microbiome (2023)
A phenome-wide scan reveals convergence of common and rare variant associations
- Dan Zhou
- Yuan Zhou
- Eric R. Gamazon
Genome Medicine (2023)

Evaluating and improving heritability models using summary statistics

Subjects

Abstract

Access options

Similar content being viewed by others

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Improved genetic prediction of complex traits from individual-level data or summary statistics

A global overview of pleiotropy and genetic architecture in complex traits

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Extended Data Fig. 1 Comparison of likelihoods.

Extended Data Fig. 2 Estimated proportions of SNP heritability.

Extended Data Fig. 3 Reduced-complexity heritability models.

Extended Data Fig. 4 Comparison with GRE.

Extended Data Fig. 5 Comparison of weighted least-squares and maximum likelihood solvers.

Extended Data Fig. 6 Reduced quality control for UKBb GWAS.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

The genetic architecture of multimodal human brain age

Genetic influences on circulating retinol and its relationship to human health

Cross-ancestry genetic architecture and prediction for cholesterol traits

Heritability and recursive influence of host genetics on the rumen microbiota drive body weight variance in male Hu sheep lambs

A phenome-wide scan reveals convergence of common and rare variant associations

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links