An article in this journal by Price et al. (New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 459–463 (2010))1 showed by simulations that mixed models2 may be susceptible to spurious associations on markers with unusual allele frequency differences between populations, such as markers in regions under selection. They stated that the reason for the spurious associations or inflation of test statistics is because mixed models model population structure as a random effect, although it is a fixed effect.
After investigating this problem further, we found that modelling population structure as a random effect is not the cause of inflation and that it is a kinship matrix that determines the performance of mixed models. The kinship matrix defines pairwise genetic relatedness among individuals and is usually estimated by using all genotyped markers. Because most markers in the simulations carried out by Price et al.1 have small allele frequency differences between two populations, a kinship matrix estimated from all markers does not effectively capture the population structure. However, when a kinship matrix is computed from the first principal component or from only the unusually differentiated markers (UDMs), we find that mixed models achieve almost the same performance as EIGENSTRAT3, which is a method that incorporates population structure as a fixed effect (Table 1). A kinship matrix that captures the same information as the first principal component vector can be obtained by computing the outer product of the vector: wwT, where w is the first principal component vector. Use of this matrix in mixed models is closely related to including the first principal component as a covariate in EIGENSTRAT. As for the kinship matrix generated from UDMs, because UDMs are not known in advance, one may try to detect these using methods to identify markers under selection, one of which was developed by our group. This method, which is called spatial ancestry analysis (SPA)4, correctly identifies UDMs in the simulations, and mixed models with kinship estimated from those markers detected by SPA have almost the same inflation as kinship from the true UDMs.
Although the approach of using kinship matrices from UDMs is effective in capturing broad differences among individuals, it may not capture narrow sample structure, such as family structure. One approach to solving this problem is to include an additional kinship matrix estimated from markers other than UDMs. This means that we have two kinship matrices in mixed models: one that is computed from UDMs and the other that is computed from the rest of markers. This would effectively remove inflation by population structure and other sample structure. We apply this approach to the simulations and show that this approach removes inflation on UDMs (Table 1).
We also investigated whether UDMs cause the inflation of statistics in real genome-wide association studies (from the 1966 Northern Finland Birth Cohort (NFBC66)5 and the Wellcome Trust Case Control Consortium (WTCCC)6). We observed inflation on a few phenotypes, but no inflation was statistically significant (data not shown).
In summary, mixed models are equivalent to methods that consider population structure as a fixed effect when the appropriate kinship matrix is applied. Mixed models can easily be extended to correct for inflation caused by UDMs, although our results failed to identify a case in which the phenomenon reported in Price et al. occurs in practice.
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nature Rev. Genet. 11, 459–463 (2010).
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).
Yang, W.-Y. Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nature Genet. 44, 725–731 (2012).
Sabatti, C. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genet. 41, 35–46 (2008).
WTCC Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
J.H.S. and E.E. are supported by US National Science Foundation grants 0513612, 0731455, 0729049, 0916676 and 1065276, and US National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568 and P01-HL28481.
The authors declare no competing financial interests.
About this article
Cite this article
Sul, J., Eskin, E. Mixed models can correct for population structure for genomic regions under selection. Nat Rev Genet 14, 300 (2013). https://doi.org/10.1038/nrg2813-c1
Molecular Ecology (2020)
Species-wide patterns of DNA methylation variation inQuercus lobataand their association with climate gradients
Molecular Ecology (2016)
Genome-Wide Association Analysis Identifies Dcc as an Essential Factor in the Innervation of the Peripheral Vestibular System in Inbred Mice
Journal of the Association for Research in Otolaryngology (2016)
The American Journal of Human Genetics (2016)
Association of transcriptome-wide sequence variation with climate gradients in valley oak (Quercus lobata)
Tree Genetics & Genomes (2016)