Technical Report | Published:

Variance component model to account for sample structure in genome-wide association studies

Nature Genetics volume 42, pages 348354 (2010) | Download Citation

Abstract

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    & Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 1, e32 (2005).

  2. 2.

    , & Genetic relatedness analysis: modern data and new challenges. Nat. Rev. Genet. 7, 771–780 (2006).

  3. 3.

    , , , & The importance of genealogy in determining genetic associations with complex traits. Am. J. Hum. Genet. 69, 1146–1148 (2001).

  4. 4.

    , , , & An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).

  5. 5.

    , , & Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

  6. 6.

    Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  7. 7.

    & Genomic control for association studies. Biometrics 55, 997–1004 (1999).

  8. 8.

    , & Association studies for quantitative traits in structured populations. Genet. Epidemiol. 22, 78–93 (2002).

  9. 9.

    et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

  10. 10.

    , & Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

  11. 11.

    & Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).

  12. 12.

    et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).

  13. 13.

    et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).

  14. 14.

    et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat. Genet. 41, 527–534 (2009).

  15. 15.

    The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52, 399–433 (1918).

  16. 16.

    , & The genetic dissection of complex traits in a founder population. Am. J. Hum. Genet. 69, 1068–1079 (2001).

  17. 17.

    et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).

  18. 18.

    et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3, e4 (2007).

  19. 19.

    et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).

  20. 20.

    et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

  21. 21.

    Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatr. Scand. 193 (suppl.) 1–71 (1969).

  22. 22.

    & Isolates and their potential use in complex gene mapping efforts. Curr. Opin. Genet. Dev. 14, 316–323 (2004).

  23. 23.

    et al. The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 83, 787–794 (2008).

  24. 24.

    & Generalized Least Squares (John Wiley & Sons, 2004).

  25. 25.

    & Family-based association tests for genomewide association scans. Am. J. Hum. Genet. 81, 913–926 (2007).

  26. 26.

    & Genetics and Analysis of Quantitative Traits (Sinauer, Sunderland, Massachusetts, 1998).

  27. 27.

    et al. Genome-wide association studies in an isolated founder population from the Pacific Island of Kosrae. PLoS Genet. 5, e1000365 (2009).

  28. 28.

    et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2, e132 (2006).

  29. 29.

    et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007).

  30. 30.

    et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579–584 (2009).

  31. 31.

    et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 41, 585–590 (2009).

  32. 32.

    & Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).

  33. 33.

    et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 40, 189–197 (2008).

  34. 34.

    Theoretical Statistics (CRC Press, Boca Raton, 1979).

  35. 35.

    & Simple, robust linkage tests for affected sibs. Am. J. Hum. Genet. 62, 1228–1242 (1998).

  36. 36.

    et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166–1172 (2006).

  37. 37.

    et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature 450, 887–892 (2007).

  38. 38.

    et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638–645 (2008).

  39. 39.

    & Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 81, 321–337 (2007).

  40. 40.

    , , & Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet. Epidemiol. 33, 508–517 (2009).

  41. 41.

    , & Case-control association testing in the presence of unknown relationships. Genet. Epidemiol. 33, 668–678 (2009).

  42. 42.

    & A kinship-based modification of the armitage trend test to address hidden population structure and small differential genotyping errors. PLoS One 4, e5825 (2009).

  43. 43.

    & A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).

  44. 44.

    , & Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 180, 1909–1925 (2008).

  45. 45.

    et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods 2, 345–350 (2005).

  46. 46.

    , & Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 413–417 (2005).

  47. 47.

    , , & Two-stage two-locus models in genome-wide association. PLoS Genet. 2, e157 (2006).

  48. 48.

    & Introduction to Quantitative Genetics 4th edn. (Longman, 1996).

  49. 49.

    Mathematical and Statistical Methods for Genetic Analysis (Springer, 2002).

  50. 50.

    & Estimation of pairwise relatedness with molecular markers. Genetics 152, 1753–1766 (1999).

  51. 51.

    , & Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67, 1219–1231 (2000).

  52. 52.

    & Estimating quantitative genetic parameters using sibships reconstructed from marker data. Genetics 155, 1961–1972 (2000).

  53. 53.

    Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. 67, 175–185 (2009).

  54. 54.

    & Statistical tests for detection of misspecified relationships by use of genome-screen data. Am. J. Hum. Genet. 66, 1076–1094 (2000).

  55. 55.

    Maximum-likelihood estimation of relatedness. Genetics 163, 1153–1167 (2003).

  56. 56.

    Personal genomes: the case of the missing heritability. Nature 456, 18–21 (2008).

  57. 57.

    & Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82, 290–297 (2001).

  58. 58.

    Generalized Linear Mixed Models (Institute of Mathematical Statistics, Alexandria, Virginia, and American Statistical Association, Beachwood, Ohio, 2003).

  59. 59.

    Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

  60. 60.

    & Categorical Data Analysis (Wiley, New York, 1990).

Download references

Acknowledgements

We thank the NFBC66 team for access to phenotype and genotype data used in the analyses presented here. The genotype data were generated at the Broad Institute with support from National Heart, Lung, and Blood Institute grant 6R01HL087679-03. We thank D. Clayton for reading through the manuscript and for providing important suggestions. We acknowledge the WTCCC for allowing us to use their data set. H.M.K., N.A.Z., J.H.S. and E.E. are supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.

Author information

Author notes

    • Hyun Min Kang
    •  & Jae Hoon Sul

    These authors contributed equally to this work.

Affiliations

  1. Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.

    • Hyun Min Kang
  2. Center for Computational Medicine and Bioinformatics, The University of Michigan Medical School, Ann Arbor, Michigan, USA.

    • Hyun Min Kang
  3. Computer Science Department, University of California, Los Angeles, California, USA.

    • Jae Hoon Sul
    •  & Eleazar Eskin
  4. Center for Neurobehavioral Genetics, University of California, Los Angeles, California, USA.

    • Susan K Service
    • , Sit-yee Kong
    •  & Nelson B Freimer
  5. Department of Epidemiology and Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA.

    • Noah A Zaitlen
  6. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA.

    • Chiara Sabatti
  7. Department of Human Genetics, University of California, Los Angeles, California, USA.

    • Eleazar Eskin

Authors

  1. Search for Hyun Min Kang in:

  2. Search for Jae Hoon Sul in:

  3. Search for Susan K Service in:

  4. Search for Noah A Zaitlen in:

  5. Search for Sit-yee Kong in:

  6. Search for Nelson B Freimer in:

  7. Search for Chiara Sabatti in:

  8. Search for Eleazar Eskin in:

Contributions

H.M.K., J.H.S., C.S. and E.E. designed the methods and experiments; H.M.K., J.H.S., S.K.S., S.-y.K., N.B.F., C.S. and E.E. jointly analyzed the NFBC66 data set; H.M.K., J.H.S., N.A.Z., C.S. and E.E. jointly analyzed the WTCCC data set; H.M.K., J.H.S., S.K.S., N.B.F., C.S. and E.E. wrote the manuscript; all authors contributed their critical reviews of the manuscript during its preparation.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Chiara Sabatti or Eleazar Eskin.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Tables 1–3, Supplementary Figures 1–6 and Supplementary Note

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/ng.548

Further reading