Abstract
Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Inferring cell-type-specific causal gene regulatory networks during human neurogenesis
Genome Biology Open Access 30 May 2023
-
Identification of the genetic basis of the duck growth rate in multiple growth stages using genome-wide association analysis
BMC Genomics Open Access 26 May 2023
-
Genetic Mapping of the Root Mycobiota in Rice and its Role in Drought Tolerance
Rice Open Access 22 May 2023
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout





References
Voight, B.F. & Pritchard, J.K. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 1, e32 (2005).
Weir, B.S., Anderson, A.D. & Hepler, A.B. Genetic relatedness analysis: modern data and new challenges. Nat. Rev. Genet. 7, 771–780 (2006).
Newman, D.L., Abney, M., McPeek, M.S., Ober, C. & Cox, N.J. The importance of genealogy in determining genetic associations with complex traits. Am. J. Hum. Genet. 69, 1146–1148 (2001).
Helgason, A., Yngvadttir, B., Hrafnkelsson, B., Gulcher, J. & Stefnsson, K. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).
Pritchard, J.K., Stephens, M., Rosenberg, N.A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Bacanu, S.A., Devlin, B. & Roeder, K. Association studies for quantitative traits in structured populations. Genet. Epidemiol. 22, 78–93 (2002).
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).
Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).
Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).
Cho, Y.S. et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat. Genet. 41, 527–534 (2009).
Fisher, S.R.A. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52, 399–433 (1918).
Ober, C., Abney, M. & McPeek, M.S. The genetic dissection of complex traits in a founder population. Am. J. Hum. Genet. 69, 1068–1079 (2001).
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Zhao, K. et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3, e4 (2007).
Kang, H.M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Rantakallio, P. Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatr. Scand. 193 (suppl.) 1–71 (1969).
Varilo, T. & Peltonen, L. Isolates and their potential use in complex gene mapping efforts. Curr. Opin. Genet. Dev. 14, 316–323 (2004).
Jakkula, E. et al. The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 83, 787–794 (2008).
Kariya, T. & Kurata, H. Generalized Least Squares (John Wiley & Sons, 2004).
Chen, W.M. & Abecasis, G.R. Family-based association tests for genomewide association scans. Am. J. Hum. Genet. 81, 913–926 (2007).
Lynch, M. & Walsh, B. Genetics and Analysis of Quantitative Traits (Sinauer, Sunderland, Massachusetts, 1998).
Lowe, J.K. et al. Genome-wide association studies in an isolated founder population from the Pacific Island of Kosrae. PLoS Genet. 5, e1000365 (2009).
Pilia, G. et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2, e132 (2006).
Easton, D.F. et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007).
Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579–584 (2009).
Ahmed, S. et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 41, 585–590 (2009).
Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).
Kathiresan, S. et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 40, 189–197 (2008).
Hinkley, D.V. Theoretical Statistics (CRC Press, Boca Raton, 1979).
Whittemore, A.S. & Tu, I.P. Simple, robust linkage tests for affected sibs. Am. J. Hum. Genet. 62, 1228–1242 (1998).
de Bakker, P.I.W. et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166–1172 (2006).
Nejentsev, S. et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature 450, 887–892 (2007).
Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638–645 (2008).
Thornton, T. & McPeek, M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 81, 321–337 (2007).
Guan, W., Liang, L., Boehnke, M. & Abecasis, G.R. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet. Epidemiol. 33, 508–517 (2009).
Choi, Y., Wijsman, E.M. & Weir, B.S. Case-control association testing in the presence of unknown relationships. Genet. Epidemiol. 33, 668–678 (2009).
Rakovski, C.S. & Stram, D.O. A kinship-based modification of the armitage trend test to address hidden population structure and small differential genotyping errors. PLoS One 4, e5825 (2009).
Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).
Kang, H.M., Ye, C. & Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 180, 1909–1925 (2008).
Irizarry, R.A. et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods 2, 345–350 (2005).
Marchini, J., Donnelly, P. & Cardon, L.R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 413–417 (2005).
Evans, D.M., Marchini, J., Morris, A.P. & Cardon, L.R. Two-stage two-locus models in genome-wide association. PLoS Genet. 2, e157 (2006).
Falconer, D.S. & Mackay, T.F.C. Introduction to Quantitative Genetics 4th edn. (Longman, 1996).
Lange, K. Mathematical and Statistical Methods for Genetic Analysis (Springer, 2002).
Lynch, M. & Ritland, K. Estimation of pairwise relatedness with molecular markers. Genetics 152, 1753–1766 (1999).
Epstein, M.P., Duren, W.L. & Boehnke, M. Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67, 1219–1231 (2000).
Thomas, S.C. & Hill, W.G. Estimating quantitative genetic parameters using sibships reconstructed from marker data. Genetics 155, 1961–1972 (2000).
Ritland, K. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. 67, 175–185 (2009).
McPeek, M.S. & Sun, L. Statistical tests for detection of misspecified relationships by use of genome-screen data. Am. J. Hum. Genet. 66, 1076–1094 (2000).
Milligan, B.G. Maximum-likelihood estimation of relatedness. Genetics 163, 1153–1167 (2003).
Maher, B. Personal genomes: the case of the missing heritability. Nature 456, 18–21 (2008).
McArdle, B.H. & Anderson, M.J. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82, 290–297 (2001).
McCulloch, C.E. Generalized Linear Mixed Models (Institute of Mathematical Statistics, Alexandria, Virginia, and American Statistical Association, Beachwood, Ohio, 2003).
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Agresti, A. & Wiley, J. Categorical Data Analysis (Wiley, New York, 1990).
Acknowledgements
We thank the NFBC66 team for access to phenotype and genotype data used in the analyses presented here. The genotype data were generated at the Broad Institute with support from National Heart, Lung, and Blood Institute grant 6R01HL087679-03. We thank D. Clayton for reading through the manuscript and for providing important suggestions. We acknowledge the WTCCC for allowing us to use their data set. H.M.K., N.A.Z., J.H.S. and E.E. are supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.
Author information
Authors and Affiliations
Contributions
H.M.K., J.H.S., C.S. and E.E. designed the methods and experiments; H.M.K., J.H.S., S.K.S., S.-y.K., N.B.F., C.S. and E.E. jointly analyzed the NFBC66 data set; H.M.K., J.H.S., N.A.Z., C.S. and E.E. jointly analyzed the WTCCC data set; H.M.K., J.H.S., S.K.S., N.B.F., C.S. and E.E. wrote the manuscript; all authors contributed their critical reviews of the manuscript during its preparation.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Tables 1–3, Supplementary Figures 1–6 and Supplementary Note (PDF 2666 kb)
Rights and permissions
About this article
Cite this article
Kang, H., Sul, J., Service, S. et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42, 348–354 (2010). https://doi.org/10.1038/ng.548
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.548
This article is cited by
-
East African cichlid fishes
EvoDevo (2023)
-
Inferring cell-type-specific causal gene regulatory networks during human neurogenesis
Genome Biology (2023)
-
PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data
BMC Bioinformatics (2023)
-
Identification of the genetic basis of the duck growth rate in multiple growth stages using genome-wide association analysis
BMC Genomics (2023)
-
Genetic Mapping of the Root Mycobiota in Rice and its Role in Drought Tolerance
Rice (2023)