Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# A resource-efficient tool for mixed model association analysis of large-scale data

## Abstract

The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test statistics and hence to spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we develop an MLM-based tool (fastGWA) that controls for population stratification by principal components and for relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrate by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then apply fastGWA to 2,173 traits on array-genotyped and imputed samples from 456,422 individuals and to 2,048 traits on whole-exome-sequenced samples from 46,191 individuals in the UKB.

This is a preview of subscription content, access via your institution

## Relevant articles

• ### Genome-wide associated variants of subclinical atherosclerosis among young people with HIV and gene-environment interactions

Journal of Translational Medicine Open Access 20 December 2022

• ### H3AGWAS: a portable workflow for genome wide association studies

BMC Bioinformatics Open Access 19 November 2022

• ### Mendelian randomization analysis of factors related to ovulation and reproductive function and endometrial cancer risk

BMC Medicine Open Access 01 November 2022

## Access options

\$32.00

All prices are NET prices.

## Data availability

The individual-level genotype and phenotype data are available through formal application to the UK Biobank (http://www.ukbiobank.ac.uk). All the summary-level statistics are available at our data portal (http://cnsgenomics.com/software/gcta/#DataResource). Source data for Extended Data Figs. 13 are available online.

## Code availability

fastGWA is available at http://cnsgenomics.com/software/gcta/#fastGWA. The fastGWA online tool was built on the code modified from the PheWeb project (https://github.com/statgen/pheweb/).

## References

1. Visscher, P. M. et al. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).

2. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

3. Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).

4. DeWan, A. et al. HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 314, 989–992 (2006).

5. Burton, P. R. et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

6. Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894 (2007).

7. Scott, L. J. et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345 (2007).

8. Sanna, S. et al. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat. Genet. 40, 198–203 (2008).

9. Unoki, H. et al. SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in east asian and european populations. Nat. Genet. 40, 1098–1102 (2008).

10. Yasuda, K. et al. Variants in KCNQ1 are associated with susceptibility to type 2 diabetes mellitus. Nat. Genet. 40, 1092–1097 (2008).

11. Hunter, D. J. et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39, 870–874 (2007).

12. Aulchenko, Y. S., Ripke, S., Isaacs, A. & Van Duijn, C. M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007).

13. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

14. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

15. Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

16. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

17. Cardon, L. R. & Palmer, L. J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).

18. Freedman, M. L. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).

19. Voight, B. F. & Pritchard, J. K. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 1, e32 (2005).

20. Astle, W. & Balding, D. J. Population structure and cryptic relatedness in genetic association studies. Statist. Sci. 24, 451–471 (2009).

21. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

22. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).

23. Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet 38, 203–208 (2006).

24. Aulchenko, Y. S., de Koning, D. J. & Haley, C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177, 577–585 (2007).

25. Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).

26. Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).

27. Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).

28. Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nat. Methods 8, 833 (2011).

29. Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat. Genet. 44, 1066–1071 (2012).

30. Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nat. Methods 9, 525–526 (2012).

31. Segura, V. et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat. Genet. 44, 825–830 (2012).

32. Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).

33. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

34. Jakobsdottir, J. & McPeek, M. S. MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals. Am. J. Hum. Genet. 92, 652–666 (2013).

35. Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).

36. Loh, P. R. et al. Efficient bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

37. Canela-Xandri, O., Law, A., Gray, A., Woolliams, J. A. & Tenesa, A. A new tool called DISSECT for analysing large genomic data sets using a big data approach. Nat. Commun. 6, 10162 (2015).

38. Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

39. Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).

40. Eu-Ahsunthornwattana, J. et al. Comparison of methods to account for relatedness in genome-wide association studies with family-based data. PLoS. Genet. 10, e1004445 (2014).

41. Zaitlen, N. et al. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS. Genet. 9, e1003520 (2013).

42. Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545–554 (1971).

43. Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).

44. Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).

45. Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47, 1114–1120 (2015).

46. Ge, T., Chen, C.-Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 13, e1006711 (2017).

47. Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. Preprint at bioRxiv https://doi.org/10.1101/308296 (2018).

48. Devlin, B., Roeder, K. & Wasserman, L. Genomic control, a new approach to genetic-based association studies. Theor Popul. Biol. 60, 155–166 (2001).

49. Verbeke, G. & Lesaffre, E. The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data. Comput. Stat. Data Anal. 23, 541–556 (1997).

50. Zhou, X., Carbonetto, P. & Stephens, M. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 9, e1003264 (2013).

51. Wu, Y., Zheng, Z., Visscher, P. M. & Yang, J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 18, 86 (2017).

52. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

53. Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).

54. Amin, N., Van Duijn, C. M. & Aulchenko, Y. S. A genomic background based method for association analysis in related individuals. PloS ONE 2, e1274 (2007).

55. Galinsky, K. J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in europe and east asia. Am. J. Hum. Genet. 98, 456–472 (2016).

56. Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics 33, 2776–2778 (2017).

57. Loh, P. R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).

58. Van Hout, C. V. et al. Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank. Preprint at bioRxiv https://doi.org/10.1101/572347 (2019).

59. Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).

## Acknowledgements

We thank H. Wang and J. Sidorenko for assistance in data preparation, A. McRae for organizing computing resources, P.-R. Loh for constructive comments on the manuscript, L. Yengo for helpful discussion, the Neale Lab for making the data processing pipelines publicly available, and Alibaba Cloud Australia and New Zealand for hosting the online tool. This research was supported by the Australian Research Council (DP160101343, DP160101056, FT180100186, and FL180100072), the Australian National Health and Medical Research Council (1078037, 1078901, 1113400, and 1107258), and the Sylvia & Charles Viertel Charitable Foundation. This study makes use of data from the UK Biobank (project ID: 12514). A full list of acknowledgements relating to this data set can be found in the Supplementary Note.

## Author information

Authors

### Contributions

J.Y. conceived and supervised the study. J.Y., L.J., and Z.Z. designed the experiment. Z.Z. developed the software tools. L.J. and Z.Z. performed the simulations and data analyses under the assistance and guidance from J.Y., P.M.V., T.Q., N.R.W., and K.E.K. P.M.V., N.R.W., and J.Y. contributed resources and funding. L.J. and J.Y. wrote the manuscript with the participation of all authors. All authors reviewed and approved the final manuscript.

### Corresponding author

Correspondence to Jian Yang.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 Comparison between fastGWA-REML and AI-REML.

The phenotypes were simulated based on real genotypes of 100,000 individuals from the UKB with Vg = 0.4 (see part 5 of the Supplementary Note for details of the simulation method and data). Plotted are the $$\hat \sigma _g^2$$ values estimated by fastGWA-REML against those estimated by the AI-REML in GCTA. Each dot represents one simulation replicate (100 simulations in total). The Pearson’s correlation coefficient of $$\hat \sigma _g^2$$ between the two methods is >0.9999.

### Extended Data Fig. 2 Comparison between the approximate and exact fastGWA tests.

We selected four quantitative traits from the UKB for comparison, including height (HT, nHT = 455,332), forced expiratory volume in 1-second (FEV, nFEV = 415,931), pulse rate (PR, nPR = 149,082), and educational attainment (EA, nEA = 304,998) (see Supplementary Table 4 for more information about the traits). Plotted are the estimated variant effects (a) or χ2-statistics (b) of 8,531,416 variants computed by the exact fastGWA method (fastGWA-Exact) against those by the fastGWA test using the GRAMMAR-GAMMA approximation (see part 2 of the Supplementary Note for details). The Pearson’s correlation coefficients of the estimated variant effect or χ2-statistic between the two methods are > 0.9999 for all the four traits.

### Extended Data Fig. 3 The first and second principal components (PC1 and PC2) of all of the UKB participants of European ancestry (n = 456,422) compared to their self-reported ethnicity.

The red dots represent those individuals who self-reported as ‘British’, the green dots represent those who self-reported as ‘Irish’, and the purple dots represent those who self-reported as ‘other-white background’.

### Extended Data Fig. 4 Comparison of $$\hat \sigma _g^2$$ estimated by fastGWA-REML to that estimated by BOLT-REML (used in BOLT-LMM) at different degrees of relatedness in simulations.

The x-axis represents different degrees of relatedness with (0, 0) representing no common environmental effect, (1st, 0.1Vp) or (1st, 0.2Vp) representing common environmental effects explaining 10% or 20% of the phenotypic variance (Vp) among 1st degree relatives, (≥2nd, 0.1Vp) or (≥2nd, 0.2Vp) representing common environmental effects explaining 10% or 20% of Vp among all pairs of the 1st and 2nd degree relatives, and (≥2nd, Gradient) representing common environmental effects explaining 20% of Vp among the 1st degree relatives and 10% of Vp among the 2nd degree relatives. The y-axis represents the value of $$\hat \sigma _g^2$$. The black dashed line represents the true simulation parameter (h2 = 0.4). Each boxplot represents the distribution of $$\hat \sigma _g^2$$ across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval of the median, the central box indicates the interquartile range (IQR), and whiskers indicate data up to 1.5 times the IQR. We also show the Haseman–Elston (HE) regression estimate of $$\sigma _g^2$$ in the fastGWA model, with a gray bar to indicate its expected value computed using the approximation theory presented in part 9 of the Supplementary Note.

### Extended Data Fig. 5 Comparison of false positive rate (FPR) among different association methods.

We used the simulated data as presented in Figs. 1 and 2 to compute the FPR of each association method across different simulation scenarios with different levels of common environmental effects. Each boxplot represents the distribution of FPR across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval of the median, the central box indicates the interquartile range (IQR), whiskers indicate data up to 1.5 times the IQR and outliers are shown as separate dots. In each simulation replicate, the P value of each variant was calculated based on the reported effect estimate and s.e. using a $$\chi _{df = 1}^2$$ test.

### Extended Data Fig. 6 Genomic inflation and power of fastGWA with the sparse GRM thresholded at different genetic relatedness cut-off values.

This simulation was performed based on real genotypes from the UKB (see simulation settings in part 5 of the Supplementary Note). We constructed different sparse GRMs by setting off-diagonal elements below a certain threshold (varying from 0.03 to 0.10) to 0 and performed fastGWA analyses using these sparse GRMs. Each boxplot represents the distribution of estimates (that is, median λ, or mean χ2) across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval of the median, the central box indicates the interquartile range (IQR), and whiskers indicate data up to 1.5 times the IQR.

### Extended Data Fig. 7 Comparison of genomic inflation and power between fastGWA, fastGWA-LOCO, and fastGWA-Ped.

Shown are the results from the analyses of a simulated data set based on the simulation strategy described in part 5 of the Supplementary Note (with $$\sigma _g^2 = 0.4V_p$$, $$\sigma _c^2 = 0.1V_p,\,or\,0.2V_p$$ for all 1st and 2nd relatives and $$\sigma _c^2 = 0$$ for all unrelated individuals). We did not observe any increase in power when applying the LOCO scheme to fastGWA because fastGWA estimates pedigree relatedness by a sparse GRM, to model phenotypic covariance between close relatives due to genetic and/or common environmental effects, and the pedigree relatedness estimated using all autosomes are similar to those using 21 chromosomes under the LOCO scheme. Each boxplot represents the distribution of estimates (that is, median λ, or mean χ2) across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval of the median, the central box indicates the interquartile range (IQR), and whiskers indicate data up to 1.5 times the IQR.

### Extended Data Fig. 8 Comparison of genomic inflation between BOLT-LMM (estimating the variance components only once using all variants) and BOLT-LMM_fine-tuning (re-estimating the variance components when a chromosome is left out).

The simulation setting was the same as the (0, 0) scenario in Fig. 1. The median λ was computed at the null variants. Each boxplot represents the distribution of median λ across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval of the median, the central box indicates the interquartile range (IQR), and whiskers indicate data up to 1.5 times the IQR.

### Extended Data Fig. 9 Genomic inflation of BOLT-LMM-Mix using LD score based on different LD window sizes and references.

a, Results from simulations based on the simulated genotype data (part 5 of the Supplementary Note) using the same setting as in the (0, 0) case in Fig. 1. The LD scores were computed from the sample using three window sizes; that is, 1 Mb (BOLT-LMM-Mix_wind-1Mb), 10 Mb (BOLT-LMM-Mix_wind-10Mb), and 20 Mb (BOLT-LMM-Mix_wind-20Mb). b, Results from simulations based on real genotypes (part 5 of the Supplementary Note) using the same settings as in the (0, 0) and (≥2nd, 0.1Vp) cases in Fig. 1. Two sets of LD score were tested; LD scores computed from the sample using a window size of 1 Mb (BOLT-LMM-Mix_UKB-LDsc) and LD scores obtained from the BOLT-LMM website (BOLT-LMM-Mix_provided-LDsc). Each boxplot represents the distribution of estimates (that is, median λ, or mean χ2) across 100 simulation replicates. The line inside each box indicates the median value, notches indicate the 95% confidence interval of the median, the central box indicates the interquartile range (IQR), and whiskers indicate data up to 1.5 times the IQR.

### Extended Data Fig. 10 Comparison between the reported genetic relatedness and the SNP-derived genetic relatedness of the UKB participants.

The y-axis represents the SNP-derived genetic relatedness computed from GCTA using 565,631 common variants on HapMap3 (175,708 individual pairs with estimated genetic relatedness ≥ 0.05). The x-axis represents the expected genetic relatedness based on the pedigree information provided by the UKB (monozygotic twin, 1; parent-offspring/full sib, 0.5; second degree relatives, 0.25; third degree relatives, 0.125; and unlabelled pair, ‘none’) on x-axis. Each circle represents one pair of relatives, the dashed diagonal line represents y = x, and the red horizontal lines represent the mean value of each relatedness group.

## Supplementary information

### Supplementary Information

Supplementary Figures 1–10, Notes 1–11 and Tables 1–8

## Source data

### Source Data Extended Data Fig. 1

The statistical source data to generate Figure 1.

### Source Data Extended Data Fig. 2

The statistical source data to generate Figure 2.

### Source Data Extended Data Fig. 3

The statistical source data to generate Figure 3.

## Rights and permissions

Reprints and Permissions

Jiang, L., Zheng, Z., Qi, T. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet 51, 1749–1755 (2019). https://doi.org/10.1038/s41588-019-0530-8

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41588-019-0530-8

• ### Genomic atlas of the plasma metabolome prioritizes metabolites implicated in human diseases

• Yiheng Chen
• Tianyuan Lu
• J. Brent Richards

Nature Genetics (2023)

• ### Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies

• Xihao Li
• Corbin Quick
• Xihong Lin

Nature Genetics (2023)

• ### H3AGWAS: a portable workflow for genome wide association studies

• Jean-Tristan Brandenburg
• Lindsay Clark
• Scott Hazelhurst

BMC Bioinformatics (2022)

• ### Mendelian randomization analysis of factors related to ovulation and reproductive function and endometrial cancer risk

• Shannon D’Urso
• Pooja Arumugam
• Gunn-Helen Moen

BMC Medicine (2022)

• ### TCR-L: an analysis tool for evaluating the association between the T-cell receptor repertoire and clinical phenotypes

• Meiling Liu
• Juna Goo
• Qianchuan He

BMC Bioinformatics (2022)