Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# A generalized linear mixed model association tool for biobank-scale data

## Abstract

Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case–control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.

This is a preview of subscription content, access via your institution

## Access options

\$32.00

All prices are NET prices.

## Data availability

The individual-level genotype and phenotype data are available through formal application to the UKB (http://www.ukbiobank.ac.uk). GWAS summary statistics for the 2,989 binary traits from our analysis of the UKB data are fully available at http://fastgwa.info/ukbimpbin and the GWAS Catalog (GCP ID: GCP000224). Source data are provided with this paper.

## Code availability

FastGWA-GLMM, fastGWA-BB and ACAT-V are integrated in the GCTA software package (v.1.93.3), available at https://yanglab.westlake.edu.cn/software/gcta. The source code of GCTA v.1.93.3 is available at https://doi.org/10.5281/zenodo.5226943, and the analysis code to produce the major results presented in the paper is available at https://doi.org/10.5281/zenodo.5501110.

## References

1. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

2. Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429.e19 (2016).

3. Kemp, J. P. et al. Identification of 153 new loci associated with heel bone mineral density and functional involvement of GPC6 in osteoporosis. Nat. Genet. 49, 1468 (2017).

4. Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).

5. Tin, A. et al. Target genes, variants, tissues and transcriptional pathways influencing human serum urate levels. Nat. Genet. 51, 1459–1474 (2019).

6. Craig, J. E. et al. Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression. Nat. Genet. 52, 160–166 (2020).

7. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).

8. Canela-Xandri, O., Law, A., Gray, A., Woolliams, J. A. & Tenesa, A. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach. Nat. Commun. 6, 10162 (2015).

9. Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

10. Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).

11. Pirinen, M., Donnelly, P. & Spencer, C. C. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 7, 369–390 (2013).

12. Van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat. Genet. 48, 1043–1048 (2016).

13. Howson, J. M. et al. Fifteen new risk loci for coronary artery disease highlight arterial-wall-specific mechanisms. Nat. Genet. 49, 1113 (2017).

14. Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).

15. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet 88, 76–82 (2011).

16. Liu, Y. et al. Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).

17. Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. Preprint at bioRxiv https://doi.org/10.1101/308296 (2018).

18. Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. https://doi.org/10.1038/s41588-021-00870-7 (2021).

19. Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).

20. Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7, e14325 (2019).

21. Chatila, T. A. Interleukin-4 receptor signaling pathways in asthma pathogenesis. Trends Mol. Med. 10, 493–499 (2004).

22. Wenzel, S. E. et al. IL4Rα mutations are associated with asthma exacerbations and mast cell/IgE expression. Am. J. Respir. Crit. Care Med. 175, 570–576 (2007).

23. Hirota, T. et al. Genome-wide association study identifies three new susceptibility loci for adult asthma in the Japanese population. Nat. Genet. 43, 893–896 (2011).

24. Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).

25. Ni, G. et al. A comparison of ten polygenic score methods for psychiatric disorders applied across multiple cohorts. Biol. Psychiatry https://doi.org/10.1016/j.biopsych.2021.04.018 (2021).

26. Lloyd-Jones, L. R., Robinson, M. R., Yang, J. & Visscher, P. M. Transformation of summary statistics from linear mixed model association on all-or-none traits to odds ratio. Genetics 208, 1397–1408 (2018).

27. Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).

28. Breyer, J. P., Avritt, T. G., McReynolds, K. M., Dupont, W. D. & Smith, J. R. Confirmation of the HOXB13 G84E germline mutation in familial prostate cancer. Cancer Epidemiol. Prev. Biomark. 21, 1348–1353 (2012).

29. Ewing, C. M. et al. Germline mutations in HOXB13 and prostate-cancer risk. N. Engl. J. Med. 366, 141–149 (2012).

30. Karlsson, R. et al. A population-based assessment of germline HOXB13 G84E mutation and prostate cancer risk. Eur. Urol. 65, 169–176 (2014).

31. Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).

32. Pulit, S. L., de With, S. A. & de Bakker, P. I. Resetting the bar: statistical significance in whole‐genome sequencing‐based association studies of global populations. Genet. Epidemiol. 41, 145–151 (2017).

33. Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 18, 86 (2017).

34. Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).

35. Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).

36. Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).

37. Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).

38. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

39. Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).

40. Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

41. Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).

42. Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).

43. Breslow, N. E. & Lin, X. Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika 82, 81–91 (1995).

44. Kuonen, D. Miscellanea. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86, 929–935 (1999).

45. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).

46. UK10K consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).

47. Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics 33, 2776–2778 (2017).

48. Millard, L. A. C., Davies, N. M., Gaunt, T. R., Davey Smith, G. & Tilling, K. Software application profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. Int. J. Epidemiol. 47, 29–35 (2017).

49. World Health Organization. International Statistical Classification of Diseases and Related Health Problems 10th revision (ICD-10) (World Health Organization, 2016).

50. Lubin, J. H. & Gail, M. H. Biased selection of controls for case–control analyses of cohort studies. Biometrics 40, 63–75 (1984).

51. Yang, J. et al. jianyangqt/gcta: GCTA (v1.93.3beta2). Zenodo https://doi.org/10.5281/zenodo.5226943 (2021).

52. Jiang, L., Zheng, Z., Fang, H. & Yang, J. A generalized linear mixed model association tool for biobank-scale data—code. Zenodo https://doi.org/10.5281/zenodo.5501110 (2021).

## Acknowledgements

We thank T. Qi for helpful discussion about the gene-based test. We thank J. Sidorenko for assistance in preparation of the UK Biobank data, and Alibaba Cloud—Australia and New Zealand for hosting the online tool. We thank the University of Queensland Research Computing Centre and the Westlake University High-Performance Computing Center for assistance in computing. J.Y. was supported by the Australian Research Council (grant no. FT180100186), the Australian National Health and Medical Research Council (grant no. 1113400) and the Westlake Education Foundation (grant no. 101566022001). The present study makes use of data from the UKB (applications: 12505 and 66982). UKB was established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government and the Northwest Regional Development Agency. It has also had funding from the Welsh Assembly Government, British Heart Foundation and Diabetes UK. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

## Author information

Authors

### Contributions

J.Y. conceived and supervised the study. J.Y., L.J. and Z.Z. designed the experiment and developed the methods. Z.Z. developed the software tools with input from L.J., H.F. and J.Y. L.J. and Z.Z. performed the simulations and data analyses under the assistance and guidance of J.Y. L.J. and J.Y. wrote the manuscript with the participation of Z.Z. All the authors reviewed and approved the final manuscript.

### Corresponding author

Correspondence to Jian Yang.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Genetics thanks Bjarni Vilhjálmsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 Runtime of fastGWA-GLMM for 8 traits with different prevalence levels.

The x-axis represents the sample size, and the y-axis represents the total runtime in hour units. Different traits are labelled with different colours. The data used in this test consisted of 11,842,647 variants. All tests were performed in the same computing environment: 80 GB memory and 8 CPUs (Intel Xeon Gold 6148). Each test was repeated 5 times for an average.

### Extended Data Fig. 2 FPR for SAIGE, fastGWA-GLMM and REGENIE quantified using the null common variants in simulations.

Three methods, SAIGE, fastGWA-GLMM, and REGENIE, are compared. The y-axis represents the FPR computed from the null common variants (that is, all the common variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence $$= n_{case}/(n_{case} + n_{control})$$). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 3 FPR for SAIGE, fastGWA-GLMM and REGENIE quantified using the rare null variants in simulations.

Three methods, SAIGE, fastGWA-GLMM, and REGENIE, are compared. The y-axis represents the FPR computed from the null rare variants (that is, all the rare variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence $$= n_{case}/(n_{case} + n_{control})$$). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 4 Comparison of power (as measured by the mean χ2 value of the causal variants) between SAIGE, fastGWA-GLMM and REGENIE.

The y-axis represents the mean χ2 value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence $$= n_{case}/(n_{case} + n_{control})$$). Apart from being evaluated for the 11,000 variants altogether in panel (a), the mean χ2 value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c), respectively. Each boxplot represents the distribution of mean χ2 across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 5 FPR for fastGWA-GLMM and other methods quantified using all the null variants in simulations.

The y-axis represents the FPR computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence $$= n_{case}/(n_{case} + n_{control})$$). FPR is evaluated at five different alpha (P value threshold) levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 6 FPR for fastGWA-GLMM and fastGWA-GLMM-Ped quantified using all the null variants in simulations.

FastGWA-GLMM-Ped: fastGWA-GLMM using the pedigree relatedness matrix. fastGWA-GLMM: fastGWA-GLMM using the sparse GRM. The y-axis represents the FPR computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence $$= n_{case}/(n_{case} + n_{control})$$). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10−4, 5×10−5, and 5×10−6), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 7 Mean χ2 value of the causal variants for fastGWA-GLMM and fastGWA-GLMM-Ped under different simulation scenarios.

FastGWA-GLMM-Ped: fastGWA-GLMM using the pedigree relatedness matrix. fastGWA-GLMM: fastGWA-GLMM using the sparse GRM. The y-axis represents the mean χ2 value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence $$= n_{case}/(n_{case} + n_{control})$$). Apart from being evaluated for the 11,000 variants altogether in panel a), the mean χ2 value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c) respectively. Each boxplot represents the distribution of mean χ2 across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 8 False positive rate (FPR) for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the FPR computed from the null genes (that is, all the 1,224 genes on chromosome 1 under the null simulation scenarios), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as $$n_{case}/(n_{case} + n_{control})$$). FPR is evaluated at five different alpha levels (α=0.05, 0.01, 0.005, 0.001, and 5×10−4), as shown in panels from a) to e), repectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 9 Statistical power for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the power, defined as the proportion of the 100 simulated causal genes on chromosome 1 with P values less than the significance threshold after Bonferroni correction (that is, 0.05/1224=4.1×10−5, where 1,224 is the number of genes used in the simulation), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as $$n_{case}/(n_{case} + n_{control})$$). We varied the proportion of variants being causal in a gene (5%, 20%, or 50%) and the directions of variant effects (random or consistent), as labelled in the title of each panel. Each boxplot represents the distribution of power across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

### Extended Data Fig. 10 Area under the curve (AUC) for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the AUC (that is, the area under the receiver operating characteristic (ROC) curve), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as $$n_{case}/(n_{case} + n_{control})$$). We varied the proportion of variants being causal in a gene (5%, 20% or 50%) and the directions of variant effects (random vs. consistent), as labelled in the title of each panel. Each boxplot represents the distribution of AUC across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided $$\chi _{\mathrm{d.f.} = 1}^2$$ statistic to test against the null hypothesis of no association.

## Supplementary information

### Supplementary Information

Supplementary Notes 1–14, Tables 1–11 and 13–14, Figs. 1–17 and References.

### Supplementary Table

Supplementary Table 12 Quasi-independent association signals identified by fastGWA-GLMM for the 2,989 binary traits in the UK Biobank.

## Source data

### Source Data Fig. 3

Statistical source data.

### Source Data Extended Data Fig. 2

Statistical source data.

### Source Data Extended Data Fig. 3

Statistical source data.

### Source Data Extended Data Fig. 4

Statistical source data.

### Source Data Extended Data Fig. 5

Statistical source data.

### Source Data Extended Data Fig. 6

Statistical source data.

### Source Data Extended Data Fig. 7

Statistical source data.

### Source Data Extended Data Fig. 8

Statistical source data.

### Source Data Extended Data Fig. 9

Statistical source data.

### Source Data Extended Data Fig. 10

Statistical source data.

## Rights and permissions

Reprints and Permissions

Jiang, L., Zheng, Z., Fang, H. et al. A generalized linear mixed model association tool for biobank-scale data. Nat Genet 53, 1616–1621 (2021). https://doi.org/10.1038/s41588-021-00954-4

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41588-021-00954-4