A generalized linear mixed model association tool for biobank-scale data

Jiang, Longda; Zheng, Zhili; Fang, Hailing; Yang, Jian

doi:10.1038/s41588-021-00954-4

Technical Report
Published: 04 November 2021

A generalized linear mixed model association tool for biobank-scale data

Longda Jiang^1,2^na1,
Zhili Zheng¹^na1,
Hailing Fang^2,3 &
…
Jian Yang ORCID: orcid.org/0000-0003-2001-2474^1,2,3

Nature Genetics volume 53, pages 1616–1621 (2021)Cite this article

20k Accesses
145 Citations
22 Altmetric
Metrics details

Subjects

Abstract

Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case–control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Comparison of runtime and memory usage between fastGWA-GLMM and SAIGE.**

**Fig. 2: Runtime and memory usage of fastGWA-GLMM in the pseudo-cohort of two million individuals.**

**Fig. 3: FPR computed from the null variants.**

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Ting-Hsuan Sun, Chia-Chun Wang, … Kai-Cheng Hsu

Data availability

The individual-level genotype and phenotype data are available through formal application to the UKB (http://www.ukbiobank.ac.uk). GWAS summary statistics for the 2,989 binary traits from our analysis of the UKB data are fully available at http://fastgwa.info/ukbimpbin and the GWAS Catalog (GCP ID: GCP000224). Source data are provided with this paper.

Code availability

FastGWA-GLMM, fastGWA-BB and ACAT-V are integrated in the GCTA software package (v.1.93.3), available at https://yanglab.westlake.edu.cn/software/gcta. The source code of GCTA v.1.93.3 is available at https://doi.org/10.5281/zenodo.5226943, and the analysis code to produce the major results presented in the paper is available at https://doi.org/10.5281/zenodo.5501110.

References

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS PubMed PubMed Central Google Scholar
Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429.e19 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kemp, J. P. et al. Identification of 153 new loci associated with heel bone mineral density and functional involvement of GPC6 in osteoporosis. Nat. Genet. 49, 1468 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
Article CAS PubMed PubMed Central Google Scholar
Tin, A. et al. Target genes, variants, tissues and transcriptional pathways influencing human serum urate levels. Nat. Genet. 51, 1459–1474 (2019).
Craig, J. E. et al. Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression. Nat. Genet. 52, 160–166 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
Article PubMed PubMed Central Google Scholar
Canela-Xandri, O., Law, A., Gray, A., Woolliams, J. A. & Tenesa, A. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach. Nat. Commun. 6, 10162 (2015).
Article CAS PubMed Google Scholar
Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).
Article CAS PubMed Google Scholar
Pirinen, M., Donnelly, P. & Spencer, C. C. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 7, 369–390 (2013).
Article Google Scholar
Van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat. Genet. 48, 1043–1048 (2016).
Article PubMed PubMed Central Google Scholar
Howson, J. M. et al. Fifteen new risk loci for coronary artery disease highlight arterial-wall-specific mechanisms. Nat. Genet. 49, 1113 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet 88, 76–82 (2011).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).
Article CAS PubMed PubMed Central Google Scholar
Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. Preprint at bioRxiv https://doi.org/10.1101/308296 (2018).
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. https://doi.org/10.1038/s41588-021-00870-7 (2021).
Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7, e14325 (2019).
Article PubMed PubMed Central Google Scholar
Chatila, T. A. Interleukin-4 receptor signaling pathways in asthma pathogenesis. Trends Mol. Med. 10, 493–499 (2004).
Article CAS PubMed Google Scholar
Wenzel, S. E. et al. IL4Rα mutations are associated with asthma exacerbations and mast cell/IgE expression. Am. J. Respir. Crit. Care Med. 175, 570–576 (2007).
Article CAS PubMed Google Scholar
Hirota, T. et al. Genome-wide association study identifies three new susceptibility loci for adult asthma in the Japanese population. Nat. Genet. 43, 893–896 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).
Article PubMed PubMed Central Google Scholar
Ni, G. et al. A comparison of ten polygenic score methods for psychiatric disorders applied across multiple cohorts. Biol. Psychiatry https://doi.org/10.1016/j.biopsych.2021.04.018 (2021).
Lloyd-Jones, L. R., Robinson, M. R., Yang, J. & Visscher, P. M. Transformation of summary statistics from linear mixed model association on all-or-none traits to odds ratio. Genetics 208, 1397–1408 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
Article CAS PubMed PubMed Central Google Scholar
Breyer, J. P., Avritt, T. G., McReynolds, K. M., Dupont, W. D. & Smith, J. R. Confirmation of the HOXB13 G84E germline mutation in familial prostate cancer. Cancer Epidemiol. Prev. Biomark. 21, 1348–1353 (2012).
Article CAS Google Scholar
Ewing, C. M. et al. Germline mutations in HOXB13 and prostate-cancer risk. N. Engl. J. Med. 366, 141–149 (2012).
Article CAS PubMed PubMed Central Google Scholar
Karlsson, R. et al. A population-based assessment of germline HOXB13 G84E mutation and prostate cancer risk. Eur. Urol. 65, 169–176 (2014).
Article CAS PubMed Google Scholar
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).
Article PubMed PubMed Central Google Scholar
Pulit, S. L., de With, S. A. & de Bakker, P. I. Resetting the bar: statistical significance in whole‐genome sequencing‐based association studies of global populations. Genet. Epidemiol. 41, 145–151 (2017).
Article PubMed Google Scholar
Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 18, 86 (2017).
Article PubMed PubMed Central Google Scholar
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Article CAS PubMed Google Scholar
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Article PubMed PubMed Central Google Scholar
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).
Article CAS PubMed PubMed Central Google Scholar
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Article CAS PubMed PubMed Central Google Scholar
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Article CAS PubMed Google Scholar
Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Article CAS PubMed PubMed Central Google Scholar
Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).
Breslow, N. E. & Lin, X. Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika 82, 81–91 (1995).
Article Google Scholar
Kuonen, D. Miscellanea. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86, 929–935 (1999).
Article Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article CAS PubMed PubMed Central Google Scholar
UK10K consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Article Google Scholar
Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics 33, 2776–2778 (2017).
Article CAS PubMed Google Scholar
Millard, L. A. C., Davies, N. M., Gaunt, T. R., Davey Smith, G. & Tilling, K. Software application profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. Int. J. Epidemiol. 47, 29–35 (2017).
World Health Organization. International Statistical Classification of Diseases and Related Health Problems 10th revision (ICD-10) (World Health Organization, 2016).
Lubin, J. H. & Gail, M. H. Biased selection of controls for case–control analyses of cohort studies. Biometrics 40, 63–75 (1984).
Yang, J. et al. jianyangqt/gcta: GCTA (v1.93.3beta2). Zenodo https://doi.org/10.5281/zenodo.5226943 (2021).
Jiang, L., Zheng, Z., Fang, H. & Yang, J. A generalized linear mixed model association tool for biobank-scale data—code. Zenodo https://doi.org/10.5281/zenodo.5501110 (2021).

Download references

Acknowledgements

We thank T. Qi for helpful discussion about the gene-based test. We thank J. Sidorenko for assistance in preparation of the UK Biobank data, and Alibaba Cloud—Australia and New Zealand for hosting the online tool. We thank the University of Queensland Research Computing Centre and the Westlake University High-Performance Computing Center for assistance in computing. J.Y. was supported by the Australian Research Council (grant no. FT180100186), the Australian National Health and Medical Research Council (grant no. 1113400) and the Westlake Education Foundation (grant no. 101566022001). The present study makes use of data from the UKB (applications: 12505 and 66982). UKB was established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government and the Northwest Regional Development Agency. It has also had funding from the Welsh Assembly Government, British Heart Foundation and Diabetes UK. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Longda Jiang, Zhili Zheng.

Authors and Affiliations

Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia
Longda Jiang, Zhili Zheng & Jian Yang
School of Life Sciences, Westlake University, Hangzhou, China
Longda Jiang, Hailing Fang & Jian Yang
Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
Hailing Fang & Jian Yang

Authors

Longda Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhili Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hailing Fang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Y. conceived and supervised the study. J.Y., L.J. and Z.Z. designed the experiment and developed the methods. Z.Z. developed the software tools with input from L.J., H.F. and J.Y. L.J. and Z.Z. performed the simulations and data analyses under the assistance and guidance of J.Y. L.J. and J.Y. wrote the manuscript with the participation of Z.Z. All the authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Jian Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Bjarni Vilhjálmsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Runtime of fastGWA-GLMM for 8 traits with different prevalence levels.

The x-axis represents the sample size, and the y-axis represents the total runtime in hour units. Different traits are labelled with different colours. The data used in this test consisted of 11,842,647 variants. All tests were performed in the same computing environment: 80 GB memory and 8 CPUs (Intel Xeon Gold 6148). Each test was repeated 5 times for an average.

Extended Data Fig. 2 FPR for SAIGE, fastGWA-GLMM and REGENIE quantified using the null common variants in simulations.

Three methods, SAIGE, fastGWA-GLMM, and REGENIE, are compared. The y-axis represents the FPR computed from the null common variants (that is, all the common variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10⁻⁴, 5×10⁻⁵, and 5×10⁻⁶), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 3 FPR for SAIGE, fastGWA-GLMM and REGENIE quantified using the rare null variants in simulations.

Three methods, SAIGE, fastGWA-GLMM, and REGENIE, are compared. The y-axis represents the FPR computed from the null rare variants (that is, all the rare variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10⁻⁴, 5×10⁻⁵, and 5×10⁻⁶), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 4 Comparison of power (as measured by the mean χ2 value of the causal variants) between SAIGE, fastGWA-GLMM and REGENIE.

The y-axis represents the mean χ² value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). Apart from being evaluated for the 11,000 variants altogether in panel (a), the mean χ² value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c), respectively. Each boxplot represents the distribution of mean χ² across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 5 FPR for fastGWA-GLMM and other methods quantified using all the null variants in simulations.

The y-axis represents the FPR computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha (P value threshold) levels (α=0.05, 0.005, 5×10⁻⁴, 5×10⁻⁵, and 5×10⁻⁶), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 6 FPR for fastGWA-GLMM and fastGWA-GLMM-Ped quantified using all the null variants in simulations.

FastGWA-GLMM-Ped: fastGWA-GLMM using the pedigree relatedness matrix. fastGWA-GLMM: fastGWA-GLMM using the sparse GRM. The y-axis represents the FPR computed from the null variants (that is, all the variants on the even chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.005, 5×10⁻⁴, 5×10⁻⁵, and 5×10⁻⁶), as shown in panels from a) to e), respectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 7 Mean χ² value of the causal variants for fastGWA-GLMM and fastGWA-GLMM-Ped under different simulation scenarios.

FastGWA-GLMM-Ped: fastGWA-GLMM using the pedigree relatedness matrix. fastGWA-GLMM: fastGWA-GLMM using the sparse GRM. The y-axis represents the mean χ² value of the causal variants (10,000 common and 1,000 rare causal variants on the odd chromosomes), and the x-axis represents different levels of prevalence of the simulated binary phenotypes (prevalence \(= n_{case}/(n_{case} + n_{control})\)). Apart from being evaluated for the 11,000 variants altogether in panel a), the mean χ² value is also evaluated for common (MAF ≥ 0.01) and rare (MAF < 0.01) causal variants separately, as shown in panels b) and c) respectively. Each boxplot represents the distribution of mean χ² across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dots. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 8 False positive rate (FPR) for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the FPR computed from the null genes (that is, all the 1,224 genes on chromosome 1 under the null simulation scenarios), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as \(n_{case}/(n_{case} + n_{control})\)). FPR is evaluated at five different alpha levels (α=0.05, 0.01, 0.005, 0.001, and 5×10⁻⁴), as shown in panels from a) to e), repectively. The dashed lines indicate the expected FPRs (that is, the alpha levels). Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 9 Statistical power for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the power, defined as the proportion of the 100 simulated causal genes on chromosome 1 with P values less than the significance threshold after Bonferroni correction (that is, 0.05/1224=4.1×10⁻⁵, where 1,224 is the number of genes used in the simulation), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as \(n_{case}/(n_{case} + n_{control})\)). We varied the proportion of variants being causal in a gene (5%, 20%, or 50%) and the directions of variant effects (random or consistent), as labelled in the title of each panel. Each boxplot represents the distribution of power across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Extended Data Fig. 10 Area under the curve (AUC) for ACAT-V, fastGWA-BB, and REGENIE-Burden under different simulation scenarios.

Three gene-based test methods are compared in this analysis, that is, ACAT-V (implemented in GCTA), fastGWA-BB, and REGENIE-Burden. The y-axis represents the AUC (that is, the area under the receiver operating characteristic (ROC) curve), and “Prev” on the x-axis represents different levels of simulated prevalence of the binary trait. The prevalence is defined as \(n_{case}/(n_{case} + n_{control})\)). We varied the proportion of variants being causal in a gene (5%, 20% or 50%) and the directions of variant effects (random vs. consistent), as labelled in the title of each panel. Each boxplot represents the distribution of AUC across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval, central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR, and outliers are shown as separate dot. In all the analyses, we used a one-sided \(\chi _{\mathrm{d.f.} = 1}^2\) statistic to test against the null hypothesis of no association.

Source data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, L., Zheng, Z., Fang, H. et al. A generalized linear mixed model association tool for biobank-scale data. Nat Genet 53, 1616–1621 (2021). https://doi.org/10.1038/s41588-021-00954-4

Download citation

Received: 14 December 2020
Accepted: 22 September 2021
Published: 04 November 2021
Issue Date: November 2021
DOI: https://doi.org/10.1038/s41588-021-00954-4

This article is cited by

Plasma campesterol and ABCG5/ABCG8 gene loci on the risk of cholelithiasis and cholecystitis: evidence from Mendelian randomization and colocalization analyses
- Jiarui Mi
- Qingwei Jiang
- Haotian Chen
Human Genomics (2024)
Genome-wide association study and development of molecular markers for yield and quality traits in peanut (Arachis hypogaea L.)
- Minjie Guo
- Li Deng
- Li Ren
BMC Plant Biology (2024)
Genome-wide association study of nausea and vomiting during pregnancy in Japan: the TMM BirThree Cohort Study
- Yudai Yonezawa
- Ippei Takahashi
- Shinichi Kuriyama
BMC Pregnancy and Childbirth (2024)
Association between inflammatory bowel disease and cancer risk: evidence triangulation from genetic correlation, Mendelian randomization, and colocalization analyses across East Asian and European populations
- Di Liu
- Meiling Cao
- Youxin Wang
BMC Medicine (2024)
Association of glucose-lowering drug target and risk of gastrointestinal cancer: a mendelian randomization study
- Yi Yang
- Bo Chen
- Yi Wang
Cell & Bioscience (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links