Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Zhou, Wei; Zhao, Zhangchen; Nielsen, Jonas B.; Fritsche, Lars G.; LeFaive, Jonathon; Gagliano Taliun, Sarah A.; Bi, Wenjian; Gabrielsen, Maiken E.; Daly, Mark J.; Neale, Benjamin M.; Hveem, Kristian; Abecasis, Goncalo R.; Willer, Cristen J.; Lee, Seunggeun

doi:10.1038/s41588-020-0621-6

Technical Report
Published: 18 May 2020

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Nature Genetics volume 52, pages 634–639 (2020)Cite this article

9577 Accesses
85 Citations
17 Altmetric
Metrics details

Subjects

Abstract

With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. Here we propose a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples and can account for unbalanced case–control ratios for binary traits. Through extensive simulation studies and analysis of the HUNT study with 69,716 Norwegian samples and the UK Biobank data with 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large-sample data (N > 400,000) with type I error rates well controlled.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Estimated and projected computational cost by sample size (N) for gene-based tests of 15,342 genes, each containing 50 rare variants.**

**Fig. 2: Quantile–quantile plots of exome-wide gene-based association results.**

Exome-wide analysis implicates rare protein-altering variants in human handedness

Article Open access 02 April 2024

Dick Schijven, Sourena Soheili-Nezhad, … Clyde Francks

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Data availability

The summary statistics and quantile–quantile plots for 53 quantitative phenotypes and 10 binary phenotypes in the UK Biobank by SAIGE-GENE are available for public download at https://www.leelabsg.org/resources.

Code availability

SAIGE-GENE is implemented as an open-source R package available at https://github.com/weizhouUMICH/SAIGE/.

References

Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Preprint at bioRxiv https://doi.org/10.1101/563866 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS Google Scholar
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).
Article CAS Google Scholar
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Article CAS Google Scholar
Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775 (2012).
Article Google Scholar
Chen, H. et al. Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am. J. Hum. Genet. 104, 260–274 (2019).
Article CAS Google Scholar
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Article CAS Google Scholar
Natarajan, P. et al. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat. Commun. 9, 3391 (2018).
Article Google Scholar
Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Article CAS Google Scholar
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
Article CAS Google Scholar
Kuonen, D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 4, 7 (1999).
Google Scholar
Daniels, H. E. Saddlepoint approximations in statistics. Ann. Math. Stat. 25, 631–650 (1954).
Article Google Scholar
Lee, S., Fuchsberger, C., Kim, S. & Scott, L. An efficient resampling method for calibrating single and gene-based rare variant association analysis in case–control studies. Biostatistics 17, 1–15 (2016).
Article CAS Google Scholar
Zhao, Z. et al. UK Biobank whole-exome sequence binary phenome analysis with robust region-based rare-variant test. Am. J. Hum. Genet. 106, 3–12 (2020).
Article CAS Google Scholar
Krokstad, S. et al. Cohort profile: the HUNT study, Norway. Int. J. Epidemiol. 42, 968–977 (2013).
Article CAS Google Scholar
Langhammer, A., Krokstad, S., Romundstad, P., Heggland, J. & Holmen, J. The HUNT study: participation is associated with survival and depends on socioeconomic status, diseases and symptoms. BMC Med. Res. Method. 12, 143 (2012).
Article Google Scholar
Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Article CAS Google Scholar
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Article CAS Google Scholar
Liu, D. J. et al. Meta-analysis of gene-level tests for rare variant association. Nat. Genet. 46, 200–204 (2014).
Article CAS Google Scholar
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).
Article Google Scholar
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013).
Article CAS Google Scholar
Willer, C. J. et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 40, 161–169 (2008).
Article CAS Google Scholar
Holm, H. et al. Several common variants modulate heart rate, PR interval and QRS duration. Nat. Genet. 42, 117–122 (2010).
Article CAS Google Scholar
Eijgelsheim, M. et al. Genome-wide association analysis identifies multiple loci related to resting heart rate. Hum. Mol. Genet. 19, 3885–3894 (2010).
Article CAS Google Scholar
Eppinga, R. N. et al. Identification of genomic loci associated with resting heart rate and shared genetic predictors with all-cause mortality. Nat. Genet. 48, 1557–1563 (2016).
Article CAS Google Scholar
Arking, D. E. et al. Genetic association study of QT interval highlights role for calcium signaling pathways in myocardial repolarization. Nat. Genet. 46, 826–836 (2014).
Article CAS Google Scholar
Swoap, S. J., Weinshenker, D., Palmiter, R. D. & Garber, G. Dbh ^–/– mice are hypotensive, have altered circadian rhythms, and have abnormal responses to dieting and stress. Am. J. Physiol. Regul. Integr. Comp. Physiol. 286, R108–R113 (2004).
Article CAS Google Scholar
Marouli, E. et al. Rare and low-frequency coding variants alter human adult height. Nature 542, 186–190 (2017).
Article CAS Google Scholar
Turalba, A. V. & Chen, T. C. Clinical and genetic characteristics of primary juvenile-onset open-angle glaucoma (JOAG). Semin. Ophthalmol. 23, 19–25 (2008).
Article Google Scholar
Breslow, N. E. & Clayton, D. G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993).
Google Scholar
Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).
Article CAS Google Scholar
Lee, S. H. & van der Werf, J. H. An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree. Genet. Sel. Evol. 38, 25–43 (2006).
Article CAS Google Scholar
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).
Article Google Scholar
Lee, S., Teslovich, T. M., Boehnke, M. & Lin, X. General framework for meta-analysis of rare variants in sequencing association studies. Am. J. Hum. Genet. 93, 42–53 (2013).
Article CAS Google Scholar
Davis, T. A. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2) (Society for Industrial and Applied Mathematics, 2006).
Abecasis, G. R., Cherny, S. S., Cookson, W. O. & Cardon, L. R. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30, 97–101 (2002).
Article CAS Google Scholar
de Villemereuil, P., Schielzeth, H., Nakagawa, S. & Morrissey, M. General methods for evolutionary quantitative genetic inference from generalized mixed models. Genetics 204, 1281–1294 (2016).
Article Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article CAS Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article CAS Google Scholar

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource under application number 45227. The Nord-Trøndelag Health Study (the HUNT study) is a collaboration between the HUNT Research Centre (Faculty of Medicine, Norwegian University of Science and Technology (NTNU)), the Nord-Trøndelag County Council, the Central Norway Health Authority and the Norwegian Institute of Public Health. The K.G. Jebsen Center for Genetic Epidemiology is financed by Stiftelsen Kristian Gerhard Jebsen, the Faculty of Medicine and Health Sciences at the Norwegian University of Science and Technology (NTNU) and the Central Norway Regional Health Authority. S.L. and W.B. were supported by National Institutes of Health grant R01HG008773. W.Z. was supported by the National Human Genome Research Institute of the National Institutes of Health under award number T32HG010464.

Author information

These authors contributed equally: Wei Zhou, Zhangchen Zhao.

Authors and Affiliations

Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA
Wei Zhou, Zhangchen Zhao, Lars G. Fritsche, Jonathon LeFaive, Sarah A. Gagliano Taliun, Wenjian Bi, Goncalo R. Abecasis & Seunggeun Lee
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
Wei Zhou, Mark J. Daly & Benjamin M. Neale
Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Wei Zhou, Mark J. Daly & Benjamin M. Neale
Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Wei Zhou, Mark J. Daly & Benjamin M. Neale
Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA
Zhangchen Zhao, Lars G. Fritsche, Jonathon LeFaive, Sarah A. Gagliano Taliun, Wenjian Bi, Goncalo R. Abecasis & Seunggeun Lee
Division of Cardiology, Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
Jonas B. Nielsen & Cristen J. Willer
K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
Maiken E. Gabrielsen & Kristian Hveem
Institute for Molecular Medicine Finland, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
Mark J. Daly
HUNT Research Centre, Department of Public Health and Nursing, Norwegian University of Science and Technology (NTNU), Levanger, Norway
Kristian Hveem
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
Cristen J. Willer
Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI, USA
Cristen J. Willer
Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea
Seunggeun Lee

Authors

Wei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhangchen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jonas B. Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Lars G. Fritsche
View author publications
You can also search for this author in PubMed Google Scholar
Jonathon LeFaive
View author publications
You can also search for this author in PubMed Google Scholar
Sarah A. Gagliano Taliun
View author publications
You can also search for this author in PubMed Google Scholar
Wenjian Bi
View author publications
You can also search for this author in PubMed Google Scholar
Maiken E. Gabrielsen
View author publications
You can also search for this author in PubMed Google Scholar
Mark J. Daly
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin M. Neale
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Hveem
View author publications
You can also search for this author in PubMed Google Scholar
Goncalo R. Abecasis
View author publications
You can also search for this author in PubMed Google Scholar
Cristen J. Willer
View author publications
You can also search for this author in PubMed Google Scholar
Seunggeun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.Z., Z.Z. and S.L. designed experiments. W.Z., Z.Z. and S.L. performed experiments. W.Z. implemented the software with input from W.B. and J.L. J.B.N., L.G.F. and S.A.G.T. constructed phenotypes for UK Biobank data. M.E.G. and K.H. provided data for the HUNT study. W.Z., Z.Z., C.J.W., S.L. and G.R.A. analyzed UK Biobank data. B.M.N. and M.J.D. provided helpful advice. W.Z., Z.Z. and S.L. wrote the manuscript with input from S.A.G.T. and M.E.G.

Corresponding authors

Correspondence to Wei Zhou or Seunggeun Lee.

Ethics declarations

Competing interests

G.R.A. is an employee of Regeneron Pharmaceuticals. He owns stock and stock options for Regeneron Pharmaceuticals. B.M.N. is a member of the Deep Genomics Scientific Advisory Board, has received travel expenses from Illumina and also serves as a consultant for Avanir and Trigeminal solutions.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Workflow of SAIGE-GENE.

SAIGE-GENE consists of two steps: (1) Fitting the null generalized linear mixed model (GLMM) to estimate variance components and other model parameters; (2) Testing for association between each genetic variant set, such as a gene or a region, and the phenotype.

Extended Data Fig. 2 Plots of the variance ratio of the score statistics by MAC for rare variants with and without the full GRM for sample relatedness (left) and with the full GRM and a sparse GRM for closely related samples (right).

a, Genotypes were simulated for 500 families and 5,000 independent individuals based on the pedigree structure shown in Supplementary Fig. 1 and the null model was fitted for the simulated quantitative trait with h² = 0.2. The sparse GRM was constructed using a coefficient of relatedness cutoff 0.2. b, 20,000 samples with White British ancestry were randomly selected from the UK Biobank and the null model was fitted for the automated read pulse rate. The sparse GRM was constructed using a coefficient of relatedness cutoff 0.125. c, 20,000 samples were randomly selected form the HUNT study and the null model was fitted for HDL. The sparse GRM was constructed using a coefficient of relatedness cutoff 0.125.

Extended Data Fig. 3 Scatter plots of association P-values from SAIGE-GENE versus SMMAT and EmmaX-SKAT for the Burden, SKAT, and SKAT-O tests based on simulation data on the -log₁₀ scale.

1,000,000 genes were tested with 1,000 families, each having 10 members, as shown in the Supplementary Fig. 1. The Pearson’s correlation coefficients r² > 0.99 for -log₁₀(P-values) between SAIGE and SMMAT and between SAIGE and EmmaX-SKAT. a, h² = 0.2. b, h² = 0.4.

Extended Data Fig. 4 Scatter plots of association P-values from SAIGE-GENE versus SMMAT and EmmaX-SKAT for the Burden, SKAT, and SKAT-O tests based on real data analysis on the -log₁₀ scale.

a,b, 12,000 genes were tested for automated read pulse rate on 20,000 randomly selected white British samples in the HRC-imputed UK Biobank (a) and for HDL on 20,000 randomly selected samples in HUNT (b). Missense and stop-gain variants with MAF ≤ 1% were included. The Pearson’s correlation coefficients r² > 0.99 for -log₁₀(P-values) between SAIGE and SMMAT and between SAIGE and EmmaX-SKAT.

Extended Data Fig. 5 Scatter plots of association P-values on the -log10 scale from SAIGE-GENE with two sample relatedness cutoffs for the sparse GRM, 0.125 and 0.2. 15,338 genes were tested for automated read pulse rate in white British samples in the HRC-imputed UK Biobank (N = 385,365).

N, sample size. Missense and stop-gain variants with MAF ≤ 1% were included. a, Burden. b, SKAT. c, SKAT-O.

Extended Data Fig. 6 Quantile-quantile plots of association P-values for 10 million variant sets from the simulation study for phenotypes with various case-control ratios (N = 100,000).

a, Case:Control = 1:9. b, Case:Control = 1:19. c, Case:Control = 1:99. N, sample size.

Extended Data Fig. 7 Empirical computation time.

a,b, Step 1 for fitting a null mixed model (a) and Step 2 for association tests (b), respectively, by sample sizes (N) for gene-based tests for 15,342 genes, each containing 50 rare variants. Benchmarking was performed on randomly sub-sampled UK Biobank data with 408,144 White British participants for waist-to-hip ratio. The reported run time was median of five runs with samples randomly selected from the full sample set using different sampling seeds. The reported computation time for EmmaX-SKAT and SMMAT was projected when N > 20,000. As the number of tested markers varies by sample sizes, the computation time was projected for 50 markers per gene for plotting. Numerical data are provided in Supplementary Table 1.

Extended Data Fig. 8 Log-log plot of the estimated run time as a function of number of markers per gene.

Benchmarking was performed on randomly sub-sampled 400,000 UK Biobank data with 408,144 white British participants for waist-to-hip ratio on 15,342 genes. The plotted run time was median of five runs with samples randomly selected from the full sample set using different sampling seeds. The computation time for other different number of markers per gene was projected based on the benchmarked time.

Extended Data Fig. 9 Log-log plots of the estimated run time and memory usage as a function of sample size (N) for genome-wide tests for 286,000 chunks.

a, Run time. b, Memory usage. Each chunk contains 50 variants on average, given that there are 14.3 million markers in the HRC-imputed UK Biobank with MAF ≤ 1% and imputation info score ≥ 0.8. Numerical data are provided in Supplementary Table 1. Benchmarking was performed on randomly sub-sampled UK Biobank data with 408,144 white British participants for waist-to-hip ratio. The plotted run time and memory were medians of five runs with samples randomly selected from the full sample set using different sampling seeds.

Extended Data Fig. 10 Log-log plots of the estimated run time for as a function of sample size (N) for SAIGE-GENE with and without using the robust adjustment.

a, Exome-wide gene-based tests for 15,871 genes. b, Genome-wide tests for 286,000 chunks. Each gene or chunk contains 50 variants on average. Benchmarking was performed on randomly sub-sampled UK Biobank data with 402,163 white British participants tested for glaucoma (PheCode: 365, 4,462 cases and 397,701 controls). The case-control ratio remained the same in subsampled data sets. The reported run time was median of five runs with samples randomly selected from the full sample set using different sampling seeds. As the number of tested markers varies by sample sizes, the computation time was projected for 50 markers per gene for plotting. Numerical data are provided in Supplementary Table 2.

Supplementary information

Supplementary Information

Supplementary Note, Figs. 1–3 and Tables 1–16

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, W., Zhao, Z., Nielsen, J.B. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat Genet 52, 634–639 (2020). https://doi.org/10.1038/s41588-020-0621-6

Download citation

Received: 20 March 2019
Accepted: 31 March 2020
Published: 18 May 2020
Issue Date: June 2020
DOI: https://doi.org/10.1038/s41588-020-0621-6

This article is cited by

The clinical relevance of a polygenic risk score for type 2 diabetes mellitus in the Korean population
- Na Yeon Kim
- Haekyung Lee
- Seunggeun Lee
Scientific Reports (2024)
The Born in Guangzhou Cohort Study enables generational genetic discoveries
- Shujia Huang
- Siyang Liu
- Xiu Qiu
Nature (2024)
GWAS for systemic sclerosis identifies six novel susceptibility loci including one in the Fcγ receptor region
- Yuki Ishikawa
- Nao Tanaka
- Chikashi Terao
Nature Communications (2024)
Body mass index stratified meta-analysis of genome-wide association studies of polycystic ovary syndrome in women of European ancestry
- Kharis Burns
- Benjamin H. Mullin
- Bronwyn G. A. Stuckey
BMC Genomics (2024)
Insights into the genetics of menopausal vasomotor symptoms: genome-wide analyses of routinely-collected primary care health records
- Katherine S. Ruth
- Robin N. Beaumont
- Anna Murray
BMC Medical Genomics (2023)