Abstract
Population structure causes genome-wide linkage disequilibrium between unlinked loci, leading to statistical confounding in genome-wide association studies. Mixed models have been shown to handle the confounding effects of a diffuse background of large numbers of loci of small effect well, but they do not always account for loci of larger effect. Here we propose a multi-locus mixed model as a general method for mapping complex traits in structured populations. Simulations suggest that our method outperforms existing methods in terms of power as well as false discovery rate. We apply our method to human and Arabidopsis thaliana data, identifying new associations and evidence for allelic heterogeneity. We also show how a priori knowledge from an A. thaliana linkage mapping study can be integrated into our method using a Bayesian approach. Our implementation is computationally efficient, making the analysis of large data sets (n > 10,000) practicable.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Cardon, L.R. & Palmer, L.J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).
Marchini, J., Cardon, L.R., Phillips, M.S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Pritchard, J.K., Stephens, M., Rosenberg, N.A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Zhao, K. et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3, e4 (2007).
Henderson, C.R. Application of Linear Models in Animal Breeding (University of Guelph, Guelph, Canada, 1984).
Fisher, R.A. The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb. 52, 399–433 (1918).
Kang, H.M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Kang, H.M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010).
Aulchenko, Y.S., de Koning, D.J. & Haley, C. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177, 577–585 (2007).
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).
Yang, J. et al. Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 19, 807–812 (2011).
Jansen, R.C. Interval mapping of multiple quantitative trait loci. Genetics 135, 205–211 (1993).
Zeng, Z.B. Precision mapping of quantitative trait loci. Genetics 136, 1457–1468 (1994).
Platt, A., Vilhjalmsson, B.J. & Nordborg, M. Conditions under which genome-wide association studies will be positively misleading. Genetics 186, 1045–1052 (2010).
Allen, A.S., Satten, G.A., Bray, S.L., Dudbridge, F. & Epstein, M.P. Fast and robust association tests for untyped SNPs in case-control studies. Hum. Hered. 70, 167–176 (2010).
Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H. & Goldstein, D.B. Rare variants create synthetic genome-wide associations. PLoS Biol. 8, e1000294 (2010).
Cordell, H.J. & Clayton, D.G. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70, 124–141 (2002).
Hoggart, C.J., Whittaker, J.C., De Iorio, M. & Balding, D.J. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4, e1000130 (2008).
Malo, N., Libiger, O. & Schork, N.J. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am. J. Hum. Genet. 82, 375–385 (2008).
Croiseau, P. & Cordell, H.J. Analysis of North American Rheumatoid Arthritis Consortium data using a penalized logistic regression approach. BMC Proc. 3, S61 (2009).
Cho, S. et al. Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann. Hum. Genet. 74, 416–428 (2010).
Wang, D., Eskridge, K.M. & Crossa, J. Identifying QTLs and epistasis in structured plant populations using adaptive mixed LASSO. J. Agric. Biol. Environ. Stat. 16, 170–184 (2011).
Ayers, K.L. & Cordell, H.J. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 34, 879–891 (2010).
Horton, M.W. et al. Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nat. Genet. 44, 212–216 (2012).
Chen, J.H. & Chen, Z.H. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008).
Astle, W. & Balding, D.J. Population structure and cryptic relatedness in genetic association studies. Stat. Sci. 24, 451–471 (2009).
Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).
Kathiresan, S. et al. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 41, 56–65 (2009).
Teslovich, T.M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).
Baxter, I. et al. A coastal cline in sodium accumulation in Arabidopsis thaliana is driven by natural variation of the sodium transporter AtHKT1;1. PLoS Genet. 6, e1001193 (2010).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc., B 58, 267–288 (1996).
Valdar, W., Holmes, C.C., Mott, R. & Flint, J. Mapping in structured populations by resample model averaging. Genetics 182, 1263–1277 (2009).
Tian, F. et al. Genome-wide association study of leaf architecture in the maize nested association mapping population. Nat. Genet. 43, 159–162 (2011).
Stephens, M. & Balding, D.J. Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10, 681–690 (2009).
Servin, B. & Stephens, M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 3, e114 (2007).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, New York, 2009).
Kass, R.E. & Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
Acknowledgements
We acknowledge the NFBC1966 Study investigators for allowing us to use their phenotype and genotype data in our study. The NFBC1966 Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with the Broad Institute, the University of California, Los Angeles (UCLA), the University of Oulu and the National Institute for Health and Welfare in Finland. This manuscript was not prepared in collaboration with the investigators from the NFBC1966 Study and does not necessarily reflect the opinions or views of these investigators or those at the collaborating institutes. We thank N.B. Freimer and S.K. Service for their help in pre-processing the NFBC1966 data. We would also like to thank P. Forai for excellent information technology and cluster support at GMI, the INRA MIGALE bioinformatics platform for additional computational resources and D.V. Conti, D.J. Balding and S. Srivastava for useful discussions on the topic. Finally, we would like to thank the anonymous reviewers for their helpful comments on the manuscript. This work was supported by grants from the Ecologie des For≖ts, Prairies et milieux Aquatiques (EFPA) department of INRA to V.S. and Deutsche Forschungsgemeinschaft (DFG) to A.K. and by grants from the US National Institutes of Health (P50 HG002790) and the European Union Framework Programme 7 (TransPLANT, grant agreement 283496) to M.N., as well as by the Austrian Academy of Sciences through GMI.
Author information
Authors and Affiliations
Contributions
All authors contributed to designing the study. V.S. and B.J.V. ran the simulations and analyzed the data. V.S., B.J.V. and M.N. wrote the manuscript with input from A.P., A.K., Ü.S. and Q.L.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Table 1, Supplementary Figures 1–11 and Supplementary Note (PDF 1167 kb)
Rights and permissions
About this article
Cite this article
Segura, V., Vilhjálmsson, B., Platt, A. et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet 44, 825–830 (2012). https://doi.org/10.1038/ng.2314
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.2314
This article is cited by
-
Prediction accuracy of genomic estimated breeding values for fruit traits in cultivated tomato (Solanum lycopersicum L.)
BMC Plant Biology (2024)
-
Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction
Genome Biology (2024)
-
Superior haplotypes of key drought-responsive genes reveal opportunities for the development of climate-resilient rice varieties
Communications Biology (2024)
-
Multi-locus genome-wide association study and genomic prediction for flowering time in chrysanthemum
Planta (2024)
-
Multi-locus genome-wide association studies reveal the dynamic genetic architecture of flowering time in chrysanthemum
Plant Cell Reports (2024)