Mixed models can correct for population structure for genomic regions under selection

An article in this journal by Price et al. (New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11, 459–463 (2010))1 showed by simulations that mixed models2 may be susceptible to spurious associations on markers with unusual allele frequency differences between populations, such as markers in regions under selection. They stated that the reason for the spurious associations or inflation of test statistics is because mixed models model population structure as a random effect, although it is a fixed effect.

After investigating this problem further, we found that modelling population structure as a random effect is not the cause of inflation and that it is a kinship matrix that determines the performance of mixed models. The kinship matrix defines pairwise genetic relatedness among individuals and is usually estimated by using all genotyped markers. Because most markers in the simulations carried out by Price et al.1 have small allele frequency differences between two populations, a kinship matrix estimated from all markers does not effectively capture the population structure. However, when a kinship matrix is computed from the first principal component or from only the unusually differentiated markers (UDMs), we find that mixed models achieve almost the same performance as EIGENSTRAT3, which is a method that incorporates population structure as a fixed effect (Table 1). A kinship matrix that captures the same information as the first principal component vector can be obtained by computing the outer product of the vector: wwT, where w is the first principal component vector. Use of this matrix in mixed models is closely related to including the first principal component as a covariate in EIGENSTRAT. As for the kinship matrix generated from UDMs, because UDMs are not known in advance, one may try to detect these using methods to identify markers under selection, one of which was developed by our group. This method, which is called spatial ancestry analysis (SPA)4, correctly identifies UDMs in the simulations, and mixed models with kinship estimated from those markers detected by SPA have almost the same inflation as kinship from the true UDMs.

Table 1 Genomic control inflation factor of mixed models with various kinship matrices

Although the approach of using kinship matrices from UDMs is effective in capturing broad differences among individuals, it may not capture narrow sample structure, such as family structure. One approach to solving this problem is to include an additional kinship matrix estimated from markers other than UDMs. This means that we have two kinship matrices in mixed models: one that is computed from UDMs and the other that is computed from the rest of markers. This would effectively remove inflation by population structure and other sample structure. We apply this approach to the simulations and show that this approach removes inflation on UDMs (Table 1).

We also investigated whether UDMs cause the inflation of statistics in real genome-wide association studies (from the 1966 Northern Finland Birth Cohort (NFBC66)5 and the Wellcome Trust Case Control Consortium (WTCCC)6). We observed inflation on a few phenotypes, but no inflation was statistically significant (data not shown).

In summary, mixed models are equivalent to methods that consider population structure as a fixed effect when the appropriate kinship matrix is applied. Mixed models can easily be extended to correct for inflation caused by UDMs, although our results failed to identify a case in which the phenomenon reported in Price et al. occurs in practice.

References

  1. 1

    Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nature Rev. Genet. 11, 459–463 (2010).

    CAS  Article  Google Scholar 

  2. 2

    Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).

    CAS  Article  Google Scholar 

  3. 3

    Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).

    CAS  Article  Google Scholar 

  4. 4

    Yang, W.-Y. Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nature Genet. 44, 725–731 (2012).

    CAS  Article  Google Scholar 

  5. 5

    Sabatti, C. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genet. 41, 35–46 (2008).

    Article  Google Scholar 

  6. 6

    WTCC Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

Download references

Acknowledgements

J.H.S. and E.E. are supported by US National Science Foundation grants 0513612, 0731455, 0729049, 0916676 and 1065276, and US National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568 and P01-HL28481.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Eleazar Eskin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Eleazar Eskin's homepage

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Sul, J., Eskin, E. Mixed models can correct for population structure for genomic regions under selection. Nat Rev Genet 14, 300 (2013). https://doi.org/10.1038/nrg2813-c1

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing