An article in this journal by Price et al. (New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 459–463 (2010))1 showed by simulations that mixed models2 may be susceptible to spurious associations on markers with unusual allele frequency differences between populations, such as markers in regions under selection. They stated that the reason for the spurious associations or inflation of test statistics is because mixed models model population structure as a random effect, although it is a fixed effect.

After investigating this problem further, we found that modelling population structure as a random effect is not the cause of inflation and that it is a kinship matrix that determines the performance of mixed models. The kinship matrix defines pairwise genetic relatedness among individuals and is usually estimated by using all genotyped markers. Because most markers in the simulations carried out by Price et al.1 have small allele frequency differences between two populations, a kinship matrix estimated from all markers does not effectively capture the population structure. However, when a kinship matrix is computed from the first principal component or from only the unusually differentiated markers (UDMs), we find that mixed models achieve almost the same performance as EIGENSTRAT3, which is a method that incorporates population structure as a fixed effect (Table 1). A kinship matrix that captures the same information as the first principal component vector can be obtained by computing the outer product of the vector: wwT, where w is the first principal component vector. Use of this matrix in mixed models is closely related to including the first principal component as a covariate in EIGENSTRAT. As for the kinship matrix generated from UDMs, because UDMs are not known in advance, one may try to detect these using methods to identify markers under selection, one of which was developed by our group. This method, which is called spatial ancestry analysis (SPA)4, correctly identifies UDMs in the simulations, and mixed models with kinship estimated from those markers detected by SPA have almost the same inflation as kinship from the true UDMs.

Table 1 Genomic control inflation factor of mixed models with various kinship matrices

Although the approach of using kinship matrices from UDMs is effective in capturing broad differences among individuals, it may not capture narrow sample structure, such as family structure. One approach to solving this problem is to include an additional kinship matrix estimated from markers other than UDMs. This means that we have two kinship matrices in mixed models: one that is computed from UDMs and the other that is computed from the rest of markers. This would effectively remove inflation by population structure and other sample structure. We apply this approach to the simulations and show that this approach removes inflation on UDMs (Table 1).

We also investigated whether UDMs cause the inflation of statistics in real genome-wide association studies (from the 1966 Northern Finland Birth Cohort (NFBC66)5 and the Wellcome Trust Case Control Consortium (WTCCC)6). We observed inflation on a few phenotypes, but no inflation was statistically significant (data not shown).

In summary, mixed models are equivalent to methods that consider population structure as a fixed effect when the appropriate kinship matrix is applied. Mixed models can easily be extended to correct for inflation caused by UDMs, although our results failed to identify a case in which the phenomenon reported in Price et al. occurs in practice.