Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

You are viewing this page in draft mode.

A generalized linear mixed model association tool for biobank-scale data

Abstract

Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case–control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Comparison of runtime and memory usage between fastGWA-GLMM and SAIGE.
Fig. 2: Runtime and memory usage of fastGWA-GLMM in the pseudo-cohort of two million individuals.
Fig. 3: FPR computed from the null variants.

Data availability

The individual-level genotype and phenotype data are available through formal application to the UKB (http://www.ukbiobank.ac.uk). GWAS summary statistics for the 2,989 binary traits from our analysis of the UKB data are fully available at http://fastgwa.info/ukbimpbin and the GWAS Catalog (GCP ID: GCP000224). Source data are provided with this paper.

Code availability

FastGWA-GLMM, fastGWA-BB and ACAT-V are integrated in the GCTA software package (v.1.93.3), available at https://yanglab.westlake.edu.cn/software/gcta. The source code of GCTA v.1.93.3 is available at https://doi.org/10.5281/zenodo.5226943, and the analysis code to produce the major results presented in the paper is available at https://doi.org/10.5281/zenodo.5501110.

References

  1. 1.

    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429.e19 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Kemp, J. P. et al. Identification of 153 new loci associated with heel bone mineral density and functional involvement of GPC6 in osteoporosis. Nat. Genet. 49, 1468 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Tin, A. et al. Target genes, variants, tissues and transcriptional pathways influencing human serum urate levels. Nat. Genet. 51, 1459–1474 (2019).

  6. 6.

    Craig, J. E. et al. Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression. Nat. Genet. 52, 160–166 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Canela-Xandri, O., Law, A., Gray, A., Woolliams, J. A. & Tenesa, A. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach. Nat. Commun. 6, 10162 (2015).

    CAS  PubMed  Google Scholar 

  9. 9.

    Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).

    CAS  PubMed  Google Scholar 

  11. 11.

    Pirinen, M., Donnelly, P. & Spencer, C. C. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 7, 369–390 (2013).

    Google Scholar 

  12. 12.

    Van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat. Genet. 48, 1043–1048 (2016).

    PubMed  PubMed Central  Google Scholar 

  13. 13.

    Howson, J. M. et al. Fifteen new risk loci for coronary artery disease highlight arterial-wall-specific mechanisms. Nat. Genet. 49, 1113 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet 88, 76–82 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Liu, Y. et al. Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. Preprint at bioRxiv https://doi.org/10.1101/308296 (2018).

  18. 18.

    Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. https://doi.org/10.1038/s41588-021-00870-7 (2021).

  19. 19.

    Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7, e14325 (2019).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Chatila, T. A. Interleukin-4 receptor signaling pathways in asthma pathogenesis. Trends Mol. Med. 10, 493–499 (2004).

    CAS  PubMed  Google Scholar 

  22. 22.

    Wenzel, S. E. et al. IL4Rα mutations are associated with asthma exacerbations and mast cell/IgE expression. Am. J. Respir. Crit. Care Med. 175, 570–576 (2007).

    CAS  PubMed  Google Scholar 

  23. 23.

    Hirota, T. et al. Genome-wide association study identifies three new susceptibility loci for adult asthma in the Japanese population. Nat. Genet. 43, 893–896 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Ni, G. et al. A comparison of ten polygenic score methods for psychiatric disorders applied across multiple cohorts. Biol. Psychiatry https://doi.org/10.1016/j.biopsych.2021.04.018 (2021).

  26. 26.

    Lloyd-Jones, L. R., Robinson, M. R., Yang, J. & Visscher, P. M. Transformation of summary statistics from linear mixed model association on all-or-none traits to odds ratio. Genetics 208, 1397–1408 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Breyer, J. P., Avritt, T. G., McReynolds, K. M., Dupont, W. D. & Smith, J. R. Confirmation of the HOXB13 G84E germline mutation in familial prostate cancer. Cancer Epidemiol. Prev. Biomark. 21, 1348–1353 (2012).

    CAS  Google Scholar 

  29. 29.

    Ewing, C. M. et al. Germline mutations in HOXB13 and prostate-cancer risk. N. Engl. J. Med. 366, 141–149 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Karlsson, R. et al. A population-based assessment of germline HOXB13 G84E mutation and prostate cancer risk. Eur. Urol. 65, 169–176 (2014).

    CAS  PubMed  Google Scholar 

  31. 31.

    Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).

    PubMed  PubMed Central  Google Scholar 

  32. 32.

    Pulit, S. L., de With, S. A. & de Bakker, P. I. Resetting the bar: statistical significance in whole‐genome sequencing‐based association studies of global populations. Genet. Epidemiol. 41, 145–151 (2017).

    PubMed  Google Scholar 

  33. 33.

    Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 18, 86 (2017).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).

    CAS  PubMed  Google Scholar 

  35. 35.

    Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).

    PubMed  PubMed Central  Google Scholar 

  36. 36.

    Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).

    CAS  PubMed  Google Scholar 

  40. 40.

    Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).

  43. 43.

    Breslow, N. E. & Lin, X. Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika 82, 81–91 (1995).

    Google Scholar 

  44. 44.

    Kuonen, D. Miscellanea. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86, 929–935 (1999).

    Google Scholar 

  45. 45.

    McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    UK10K consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).

    Google Scholar 

  47. 47.

    Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics 33, 2776–2778 (2017).

    CAS  PubMed  Google Scholar 

  48. 48.

    Millard, L. A. C., Davies, N. M., Gaunt, T. R., Davey Smith, G. & Tilling, K. Software application profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. Int. J. Epidemiol. 47, 29–35 (2017).

  49. 49.

    World Health Organization. International Statistical Classification of Diseases and Related Health Problems 10th revision (ICD-10) (World Health Organization, 2016).

  50. 50.

    Lubin, J. H. & Gail, M. H. Biased selection of controls for case–control analyses of cohort studies. Biometrics 40, 63–75 (1984).

  51. 51.

    Yang, J. et al. jianyangqt/gcta: GCTA (v1.93.3beta2). Zenodo https://doi.org/10.5281/zenodo.5226943 (2021).

  52. 52.

    Jiang, L., Zheng, Z., Fang, H. & Yang, J. A generalized linear mixed model association tool for biobank-scale data—code. Zenodo https://doi.org/10.5281/zenodo.5501110 (2021).

Download references

Acknowledgements

We thank T. Qi for helpful discussion about the gene-based test. We thank J. Sidorenko for assistance in preparation of the UK Biobank data, and Alibaba Cloud—Australia and New Zealand for hosting the online tool. We thank the University of Queensland Research Computing Centre and the Westlake University High-Performance Computing Center for assistance in computing. J.Y. was supported by the Australian Research Council (grant no. FT180100186), the Australian National Health and Medical Research Council (grant no. 1113400) and the Westlake Education Foundation (grant no. 101566022001). The present study makes use of data from the UKB (applications: 12505 and 66982). UKB was established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government and the Northwest Regional Development Agency. It has also had funding from the Welsh Assembly Government, British Heart Foundation and Diabetes UK. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Affiliations

Authors

Contributions

J.Y. conceived and supervised the study. J.Y., L.J. and Z.Z. designed the experiment and developed the methods. Z.Z. developed the software tools with input from L.J., H.F. and J.Y. L.J. and Z.Z. performed the simulations and data analyses under the assistance and guidance of J.Y. L.J. and J.Y. wrote the manuscript with the participation of Z.Z. All the authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Jian Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Bjarni Vilhjálmsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Runtime of fastGWA-GLMM for 8 traits with different prevalence levels.

The x-axis represents the sample size, and the y-axis represents the total runtime in hour units. Different traits are labelled with different colours. The data used in this test consisted of 11,842,647 variants. All tests were performed in the same computing environment: 80 GB memory and 8 CPUs (Intel Xeon Gold 6148). Each test was repeated 5 times for an average.

Extended Data Fig. 2 FPR for SAIGE, fastGWA-GLMM and REGENIE quantified using the null common variants in simulations.

Three methods, SAIGE, fastGWA-GLMM, and REGENIE, are compared. The y-axis represents the FPR computed from the null common variants (that is, all the common variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 3 FPR for SAIGE, fastGWA-GLMM and REGENIE quantified using the rare null variants in simulations.

Three methods, SAIGE, fastGWA-GLMM, and REGENIE, are compared. The y-axis represents the FPR computed from the null rare variants (that is, all the rare variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 4 Comparison of power (as measured by the mean χ2 value of the causal variants) between SAIGE, fastGWA-GLMM and REGENIE.

The y-axis represents the mean χ2 value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). Apart from being evaluated for the 11,000 variants altogether in panel (a), the mean χ2 value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c), respectively. Each boxplot represents the distribution of mean χ2 across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 5 FPR for fastGWA-GLMM and other methods quantified using all the null variants in simulations.

The y-axis represents the FPR computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha (P value threshold) levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 6 FPR for fastGWA-GLMM and fastGWA-GLMM-Ped quantified using all the null variants in simulations.

FastGWA-GLMM-Ped: fastGWA-GLMM using the pedigree relatedness matrix. fastGWA-GLMM: fastGWA-GLMM using the sparse GRM. The y-axis represents the FPR computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 7 Mean χ2 value of the causal variants for fastGWA-GLMM and fastGWA-GLMM-Ped under different simulation scenarios.

FastGWA-GLMM-Ped: fastGWA-GLMM using the pedigree relatedness matrix. fastGWA-GLMM: fastGWA-GLMM using the sparse GRM. The y-axis represents the mean χ2 value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). Apart from being evaluated for the 11,000 variants altogether in panel a), the mean χ2 value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c) respectively. Each boxplot represents the distribution of mean χ2 across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 8 False positive rate (FPR) for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the FPR computed from the null genes (that is, all the 1,224 genes on chromosome 1 under the null simulation scenarios), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as \(n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.01, 0.005, 0.001, and 5×10−4), as shown in panels from a) to e), repectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 9 Statistical power for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the power, defined as the proportion of the 100 simulated causal genes on chromosome 1 with P values less than the significance threshold after Bonferroni correction (that is, 0.05/1224=4.1×10−5, where 1,224 is the number of genes used in the simulation), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as \(n_{case}/(n_{case} + n_{control})\)). We varied the proportion of variants being causal in a gene (5%, 20%, or 50%) and the directions of variant effects (random or consistent), as labelled in the title of each panel. Each boxplot represents the distribution of power across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 10 Area under the curve (AUC) for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the AUC (that is, the area under the receiver operating characteristic (ROC) curve), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as \(n_{case}/(n_{case} + n_{control})\)). We varied the proportion of variants being causal in a gene (5%, 20% or 50%) and the directions of variant effects (random vs. consistent), as labelled in the title of each panel. Each boxplot represents the distribution of AUC across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Supplementary information

Supplementary Information

Supplementary Notes 1–14, Tables 1–11 and 13–14, Figs. 1–17 and References.

Reporting Summary

Peer Review Information

Supplementary Table

Supplementary Table 12 Quasi-independent association signals identified by fastGWA-GLMM for the 2,989 binary traits in the UK Biobank.

Source data

Source Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 6

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Source Data Extended Data Fig. 8

Statistical source data.

Source Data Extended Data Fig. 9

Statistical source data.

Source Data Extended Data Fig. 10

Statistical source data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jiang, L., Zheng, Z., Fang, H. et al. A generalized linear mixed model association tool for biobank-scale data. Nat Genet 53, 1616–1621 (2021). https://doi.org/10.1038/s41588-021-00954-4

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing