Mixed models can correct for population structure for genomic regions under selection

Sul, Jae Hoon; Eskin, Eleazar

doi:10.1038/nrg2813-c1

Download PDF

Correspondence
Published: 26 February 2013

Mixed models can correct for population structure for genomic regions under selection

Jae Hoon Sul¹ &
Eleazar Eskin²

Nature Reviews Genetics volume 14, page 300 (2013)Cite this article

7321 Accesses
19 Citations
4 Altmetric
Metrics details

Subjects

An article in this journal by Price et al. (New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 459–463 (2010))¹ showed by simulations that mixed models² may be susceptible to spurious associations on markers with unusual allele frequency differences between populations, such as markers in regions under selection. They stated that the reason for the spurious associations or inflation of test statistics is because mixed models model population structure as a random effect, although it is a fixed effect.

After investigating this problem further, we found that modelling population structure as a random effect is not the cause of inflation and that it is a kinship matrix that determines the performance of mixed models. The kinship matrix defines pairwise genetic relatedness among individuals and is usually estimated by using all genotyped markers. Because most markers in the simulations carried out by Price et al.¹ have small allele frequency differences between two populations, a kinship matrix estimated from all markers does not effectively capture the population structure. However, when a kinship matrix is computed from the first principal component or from only the unusually differentiated markers (UDMs), we find that mixed models achieve almost the same performance as EIGENSTRAT³, which is a method that incorporates population structure as a fixed effect (Table 1). A kinship matrix that captures the same information as the first principal component vector can be obtained by computing the outer product of the vector: ww^T, where w is the first principal component vector. Use of this matrix in mixed models is closely related to including the first principal component as a covariate in EIGENSTRAT. As for the kinship matrix generated from UDMs, because UDMs are not known in advance, one may try to detect these using methods to identify markers under selection, one of which was developed by our group. This method, which is called spatial ancestry analysis (SPA)⁴, correctly identifies UDMs in the simulations, and mixed models with kinship estimated from those markers detected by SPA have almost the same inflation as kinship from the true UDMs.

Table 1 Genomic control inflation factor of mixed models with various kinship matrices

Full size table

Although the approach of using kinship matrices from UDMs is effective in capturing broad differences among individuals, it may not capture narrow sample structure, such as family structure. One approach to solving this problem is to include an additional kinship matrix estimated from markers other than UDMs. This means that we have two kinship matrices in mixed models: one that is computed from UDMs and the other that is computed from the rest of markers. This would effectively remove inflation by population structure and other sample structure. We apply this approach to the simulations and show that this approach removes inflation on UDMs (Table 1).

We also investigated whether UDMs cause the inflation of statistics in real genome-wide association studies (from the 1966 Northern Finland Birth Cohort (NFBC66)⁵ and the Wellcome Trust Case Control Consortium (WTCCC)⁶). We observed inflation on a few phenotypes, but no inflation was statistically significant (data not shown).

In summary, mixed models are equivalent to methods that consider population structure as a fixed effect when the appropriate kinship matrix is applied. Mixed models can easily be extended to correct for inflation caused by UDMs, although our results failed to identify a case in which the phenomenon reported in Price et al. occurs in practice.

References

Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nature Rev. Genet. 11, 459–463 (2010).
Article CAS Google Scholar
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).
Article CAS Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).
Article CAS Google Scholar
Yang, W.-Y. Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nature Genet. 44, 725–731 (2012).
Article CAS Google Scholar
Sabatti, C. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genet. 41, 35–46 (2008).
Article Google Scholar
WTCC Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

Download references

Acknowledgements

J.H.S. and E.E. are supported by US National Science Foundation grants 0513612, 0731455, 0729049, 0916676 and 1065276, and US National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568 and P01-HL28481.

Author information

Authors and Affiliations

Jae Hoon Sul is at the Computer Science Department, University of California, Los Angeles, California 90095, USA.,
Jae Hoon Sul
Eleazar Eskin is at the Department of Human Genetics, University of California, Los Angeles, California 90095, USA.,
Eleazar Eskin

Authors

Jae Hoon Sul
View author publications
You can also search for this author in PubMed Google Scholar
Eleazar Eskin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eleazar Eskin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sul, J., Eskin, E. Mixed models can correct for population structure for genomic regions under selection. Nat Rev Genet 14, 300 (2013). https://doi.org/10.1038/nrg2813-c1

Download citation

Published: 26 February 2013
Issue Date: April 2013
DOI: https://doi.org/10.1038/nrg2813-c1

This article is cited by

Genome-wide association study and high-quality gene mining related to soybean protein and fat
- Qi Zhang
- Tingting Sun
- Peiwu Wang
BMC Genomics (2023)
Genome-wide association study reveals that GhTRL1 and GhPIN8 affect cotton root development
- Ziqian Cui
- Shaodong Liu
- Jing Chen
Theoretical and Applied Genetics (2022)
Mapping the genomic architecture of adaptive traits with interspecific introgressive origin: a coalescent-based approach
- Hussein A. Hejase
- Kevin J. Liu
BMC Genomics (2016)
Association of transcriptome-wide sequence variation with climate gradients in valley oak (Quercus lobata)
- Paul F. Gugger
- Shawn J. Cokus
- Victoria L. Sork
Tree Genetics & Genomes (2016)
Genome-Wide Association Analysis Identifies Dcc as an Essential Factor in the Innervation of the Peripheral Vestibular System in Inbred Mice
- Pezhman Salehi
- Anthony Myint
- Rick A. Friedman
Journal of the Association for Research in Otolaryngology (2016)

Mixed models can correct for population structure for genomic regions under selection

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

FURTHER INFORMATION

Rights and permissions

About this article

Cite this article

This article is cited by

Genome-wide association study and high-quality gene mining related to soybean protein and fat

Genome-wide association study reveals that GhTRL1 and GhPIN8 affect cotton root development

Mapping the genomic architecture of adaptive traits with interspecific introgressive origin: a coalescent-based approach

Association of transcriptome-wide sequence variation with climate gradients in valley oak (Quercus lobata)

Genome-Wide Association Analysis Identifies Dcc as an Essential Factor in the Innervation of the Peripheral Vestibular System in Inbred Mice

Search

Quick links

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

Related links

FURTHER INFORMATION

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Genome-wide association study and high-quality gene mining related to soybean protein and fat

Genome-wide association study reveals that GhTRL1 and GhPIN8 affect cotton root development

Mapping the genomic architecture of adaptive traits with interspecific introgressive origin: a coalescent-based approach

Association of transcriptome-wide sequence variation with climate gradients in valley oak (Quercus lobata)

Genome-Wide Association Analysis Identifies Dcc as an Essential Factor in the Innervation of the Peripheral Vestibular System in Inbred Mice

Search

Quick links