Abstract
The genetic effect-size distribution of a disease describes the number of risk variants, the range of their effect sizes and sample sizes that will be required to discover them. Accurate estimation has been a challenge. Here I propose Fourier Mixture Regression (FMR), validating that it accurately estimates real and simulated effect-size distributions. Applied to summary statistics for ten diseases (average \(N_{\textrm{eff}} = 169,000\)), FMR estimates that 100,000–1,000,000 cases will be required for genome-wide significant SNPs to explain 50% of SNP heritability. In such large studies, genome-wide significance becomes increasingly conservative, and less stringent thresholds achieve high true positive rates if confounding is controlled. Across traits, polygenicity varies, but the range of their effect sizes is similar. Compared with effect sizes in the top 10% of heritability, including most discovered thus far, those in the bottom 10–50% are orders of magnitude smaller and more numerous, spanning a large fraction of the genome.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Speos: an ensemble graph representation learning framework to predict core gene candidates for complex diseases
Nature Communications Open Access 08 November 2023
-
Scalable genetic screening for regulatory circuits using compressed Perturb-seq
Nature Biotechnology Open Access 23 October 2023
-
Novel Alzheimer’s disease genes and epistasis identified using machine learning GWAS platform
Scientific Reports Open Access 17 October 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
GWAS summary statistics are available at https://alkesgroup.broadinstitute.org/. Numerical results for Figs. 2–5 are reported in the Supplementary Tables.
Code availability
Open-source software is available at https://github.com/lukejoconnor63. GENESIS14 software is available at https://github.com/yandorazhang/GENESIS.
References
International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Huang, H. et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547, 173–178 (2017).
Ulirsch, J. C. et al. Interrogation of human hematopoiesis at single-cell and single-variant resolution. Nat. Genet. 51, 683–693 (2019).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. B 82, 1273–1300 (2020).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1335 (2015).
Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50, 621–629 (2018).
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–525 (2011).
Stahl, E. A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 44, 483–489 (2012).
Zhang, Y., Qi, G., Park, J.-H. & Chatterjee, N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet. 50, 1318–1326 (2018).
O’Connor, L. J. et al. Extreme polygenicity of complex traits is explained by negative selection. Am. J. Hum. Genet. 105, 456–476 (2019).
Park, J.-H. et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42, 570–575 (2010).
Palla, L. & Dudbridge, F. A fast method that uses polygenic scores to estimate the variance explained by genome-wide marker panels and the proportion of variants affecting a trait. Am. J. Hum. Genet. 97, 250–259 (2015).
Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 50, 746–753 (2018).
Zhu, X. & Stephens, M. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nat. Commun. 9, 4361 (2018).
Holland, D. et al. Beyond SNP heritability: polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model. PLoS Genet. 16, e1008612 (2020).
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
Moser, G. et al. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 11, e1004969 (2015).
Loh, P. R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).
Shi, H., Kichaev, G. & Pasaniuc, B. Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet. 99, 139–153 (2016).
Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
Kong, A. et al. The nature of nurture: effects of parental genotypes. Science 359, 424–428 (2018).
Morris, T. T., Davies, N. M., Hemani, G. & Smith, G. D. Population phenomena inflate genetic associations of complex social traits. Sci. Adv. 6, eaay0328 (2020).
Hormozdiari, F. et al. Widespread allelic heterogeneity in complex traits. Am. J. Hum. Genet. 100, 789–802 (2017).
Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
Nelson, C. P. et al. Genetically determined height and coronary artery disease. N. Engl. J. Med. 372, 1608–1618 (2015).
Palmer, C. & Pe’er, I. Statistical correction of the Winner’s Curse explains replication variability in quantitative trait genome-wide association studies. PLoS Genet. 13, e1006916 (2017).
Galinsky, K. J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).
Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 1015–1019 (2012).
Sohail, M. et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife 8, e39702 (2019).
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Stephens, M. False discovery rates: a new deal. Biostatistics 18, 275–294 (2017).
Zhu, X. & Stephens, M. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nat. Commun. 9, 4361 (2018).
Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proc. Natl Acad. Sci. USA 111, E455–E464 (2014).
Asgari, S. et al. A positively selected FBN1 missense variant reduces height in Peruvian individuals. Nature 582, 234–239 (2020).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Kichaev, G. & Pasaniuc, B. Leveraging functional-annotation data in trans-ethnic fine-mapping studies. Am. J. Hum. Genet. 97, 260–271 (2015).
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
Chung, W. et al. Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes. Nat. Commun. 10, 569 (2019).
Chun, S. et al. Non-parametric polygenic risk prediction via partitioned GWAS summary statistics. Am. J. Hum. Genet. 107, 46–59 (2020).
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).
Chen, W., McDonnell, S. K., Thibodeau, S. N., Tillmans, L. S. & Schaid, D. J. Incorporating functional annotations for fine-mapping causal variants in a Bayesian framework using summary statistics. Genetics 204, 933–958 (2016).
Kichaev, G. et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics 33, 248–255 (2017).
Benner, C., Havulinna, A., Salomaa, V., Ripatti, S. & Pirinen, M. Refining fine-mapping: effect sizes and regional heritability. Preprint at bioRxiv https://doi.org/10.1101/318618 (2018).
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Simons, Y. B., Bullaughey, K., Hudson, R. R. & Sella, G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biol. 16, e2002985 (2018).
Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88 (2016).
Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).
Speed, D. et al. Reevaluation of SNP heritability in complex human traits. Nat. Genet. 49, 986–992 (2017).
Gazal, S., Marquez-Luna, C., Finucane, H. K. & Price, A. L. Reconciling S-LDSC and LDAK functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019).
Belmont, J. W. et al. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Conneely, K. N. & Boehnke, M. So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. Am. J. Hum. Genet. 81, 1158–1168 (2007).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
O’Connor, L. J. lukejoconnor/FMR: initial release of FMR software. Zenodo https://doi.org/10.5281/zenodo.4670516 (2021).
Acknowledgements
I am grateful to A. Price, J. Ballard, A. Nadig, D. Weiner, O. Weissbrod, G. Getz, B. Neale and E. Lander for suggestions and discussions.
Author information
Authors and Affiliations
Contributions
L.J.O. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares no competing interests.
Additional information
Peer review information Nature Genetics thanks Yan Zhang for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance of FMR in simulations at different sample sizes.
I show the true HDM (yellow), estimates for 10 individual simulation replicates (grey), the mean estimate across 20 replicates (blue), and the mean uncorrected estimate. The uncorrected estimate is obtained by running FMR without any correction for sampling variation in the GWAS summary statistics (see Supplementary Note). Data were simulated under a point-normal model with either 1% or 10% of SNPs having nonzero causal effect sizes.
Extended Data Fig. 2 Calibration of FMR jackknife standard errors.
Simulations were performed under a normal mixture model with small-, medium- and large-effect SNPs (similar to Fig. 1d), at sample size N=460k, N=145k or N=50k. For different effect-size thresholds, I calculated the standard error of the proportion of random-effect heritability explained by SNPs with effect sizes less than that threshold. Bar plots show root-mean-squared jackknife standard errors (blue) and empirical standard errors (orange) based on 25 replicates. At large sample size (N=460k), standard errors were sometimes underestimated, probably due to the nonnegativity constraints in the regression. Caution is needed when making comparisons between the genetic architecture of different traits, as underestimated standard errors could lead to false-positive differences.
Extended Data Fig. 3 Effect of changing the FMR sampling times and mixture components in simulations.
Simulations were performed under a normal mixture model with small-, medium- and large-effect SNPs (similar to Fig. 1d), at sample size N=460k. I specified a set of 17 mixture components (\(\sigma ^2 = [2^{ - 9},2^{ - 8}, \ldots 2^7]\)) and 17 sampling times (\(t_k = 1/\sigma _k\)), and performed simulations with various subsets of the respective values. In panels a-d, I use the same values of \(\sigma ^2\) (\(\sigma _3^2,\sigma _4^2, \ldots ,\sigma _{15}^2\), which correspond to the default FMR model) and various values of \(t\). In panels e-f, I vary the values of \(\sigma ^2\). In most cases, very similar results are obtained, except when too few sampling times are used (panel d). 25 replicates are performed (identical between the figure panels), the first 10 of which are plotted in grey. The mean and standard deviation across replicates are shown in blue. (a) \(\sigma ^2 = [\sigma _3^2,\sigma _4^2, \ldots ,\sigma _{15}^2]\), \({\boldsymbol{t}} = [t_3,t_4, \ldots ,t_{15}]\); (b) \(\sigma ^2 = [\sigma _3^2,\sigma _4^2, \ldots ,\sigma _{15}^2]\), \({\boldsymbol{t}} = [t_1,t_2, \ldots ,t_{17}]\); (c) \(\sigma ^2 = [\sigma _3^2,\sigma _4^2, \ldots ,\sigma _{15}^2]\), \({\boldsymbol{t}} = [t_5,t_6, \ldots ,t_{11}]\); (d) \(\sigma ^2 = [\sigma _3^2,\sigma _4^2, \ldots ,\sigma _{15}^2]\), \({\boldsymbol{t}} = [t_5,t_7, \ldots ,t_{15}]\); (e) \(\sigma _1^2,\sigma _2^2, \ldots ,\sigma _{17}^2\), \(t_1,t_2, \ldots ,t_{17}\); (f) \(\sigma ^2 = \left[ {\sigma _5^2,\sigma _6^2, \ldots ,\sigma _{13}^2} \right]\), \({\boldsymbol{t}} = [t_5,t_6, \ldots ,t_{13}]\). I recommend using 13 mixture components and 13 sampling times, even though a smaller number may suffice (Extended Data Fig. 3f).
Extended Data Fig. 4 Observed number of genome-wide significant SNPs and proportion of heritability explained at N=145k vs 460k.
For numerical results, see Supplementary Table 2.
Extended Data Fig. 5 Predicted vs. observed heritability explained by genome-wide significant SNPs at different significance thresholds.
%h2GWAS was predicted using interim-release UK Biobank summary statistics (maximum N=145k) and evaluated in the full release (maximum N=460k). Squared correlations between predicted and observed values were 0.94, 0.95, 0.93, and 0.88 in panels a-d respectively. Lower r2 at \(\chi ^2 > 1000\)(panel d) could result from the small number of loci with large effect sizes, which may increase the sampling variance of both the FMR predictions and the observed values. In panel d, the data points for several traits are superimposed near the origin. For numerical results, see Supplementary Table 2.
Extended Data Fig. 6 Predicted vs. observed number of genome-wide significant SNPs at different significance thresholds.
MGWAS was predicted using interim-release UK Biobank summary statistics (maximum N=145k) and evaluated in the full release (maximum N=460k). Squared correlations between predicted and observed values were 0.92, 0.97, 0.92 and 0.91 in panels a-d respectively. In panel d, the data points for several points are superimposed near the origin. For numerical results, see Supplementary Table 2.
Extended Data Fig. 7 Consistency of FMR predictions at different sample sizes.
%h2GWAS and MGWAS were predicted for 22 traits based on N=145k vs. N=460k summary statistics, with target sample size equal to 460k, 2M or 10M. Predictions assume that the LD score regression intercept will be equal to what was observed at N=145k for both sets of estimates. Numerical results are presented in Supplementary Table 2.
Extended Data Fig. 8 Performance of GENESIS vs. FMR predictions in UK Biobank.
FMR and GENESIS were applied to interim-release UK Biobank summary statistics (maximum N=145k) for 22 traits in order to predict the results of the full release (maximum N=460k). Numerical results are presented in Supplementary Table 1.
Extended Data Fig. 9 Consistency of GENESIS predictions at different sample sizes.
%h2GWAS and MGWAS were predicted for 19 traits (Supplementary Table 3) based on N=145k vs. N=460k summary statistics, with target sample size equal to 460k, 2M or 10M. At N=460k, predictions of large-N %h2GWAS were slightly smaller (panels b-c), while predictions of \(M_{{\mathrm{GWAS}}}\) were slightly larger (panel f). This difference could result from a less severe form of the power-dependent bias that is known to affect the point-normal (2-component) model when it is misspecified: as sample size increases, SNPs with smaller effect sizes become detectable, and estimates shift toward a larger number of causal SNPs with smaller effect sizes. (This only occurs when the model is misspecified, with a larger-than-expected number of small-effect SNPs). The 3-component model ameliorates this bias by including a small-effect heritability component even at small sample sizes. However, if this model too is misspecified (for example when there is a mixture of small-, medium- and large-effect SNPs), then it would be affected in the same way as the point-normal model, to a lesser degree. Numerical results are presented in Supplementary Table 3. The same analysis using FMR is presented in Extended Data Fig. 7.
Extended Data Fig. 10 Estimated HDM of height using summary statistics from GIANT vs. UK Biobank.
If results were biased by population stratification, the bottom-left portion of the curve (corresponding to small-effect SNPs) would be inflated for estimates based on GIANT.
Supplementary information
Supplementary Information
Supplementary Tables 1, 4, 6 and 8, Figs. 1–7 and Note
Supplementary Tables
Supplementary Tables 2, 3, 5, 7 and 9–12
Rights and permissions
About this article
Cite this article
O’Connor, L.J. The distribution of common-variant effect sizes. Nat Genet 53, 1243–1249 (2021). https://doi.org/10.1038/s41588-021-00901-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-021-00901-3
This article is cited by
-
Shared genetic architecture between irritable bowel syndrome and psychiatric disorders reveals molecular pathways of the gut-brain axis
Genome Medicine (2023)
-
The genetic basis of major depressive disorder
Molecular Psychiatry (2023)
-
Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies
Nature Genetics (2023)
-
Speos: an ensemble graph representation learning framework to predict core gene candidates for complex diseases
Nature Communications (2023)
-
Scalable genetic screening for regulatory circuits using compressed Perturb-seq
Nature Biotechnology (2023)