Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip® Human Mapping 10K array

Hao, Ke; Li, Cheng; Rosenow, Carsten; Wong, Wing H

doi:10.1038/sj.ejhg.5201273

Download PDF

Article
Published: 15 September 2004

Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip^® Human Mapping 10K array

Ke Hao¹,
Cheng Li^1,2,
Carsten Rosenow³ &
…
Wing H Wong^1,4

European Journal of Human Genetics volume 12, pages 1001–1006 (2004)Cite this article

787 Accesses
40 Citations
Metrics details

Abstract

Population-based association design is often compromised by false or nonreplicable findings, partially due to population stratification. Genomic control (GC) approaches were proposed to detect and adjust for this confounder. To date, the performance of this strategy has not been extensively evaluated on real data. More than 10 000 single-nucleotide polymorphisms (SNPs) were genotyped on subjects from four populations (including an Asian, an African-American and two Caucasian populations) using GeneChip^® Mapping 10 K array. On these data, we tested the performance of two GC approaches in different scenarios including various numbers of GC markers and different degrees of population stratification. In the scenario of substantial population stratification, both GC approaches are sensitive using only 20–50 random SNPs, and the mixed subjects can be separated into homogeneous subgroups. In the scenario of moderate stratification, both GC approaches have poor sensitivities. However, the bias in association test can still be corrected even when no statistical significant population stratification is detected. We conducted extensive benchmark analyses on GC approaches using SNPs over the whole human genome. We found GC method can cluster subjects to homogeneous subgroups if there is a substantial difference in genetic background. The inflation factor, estimated by GC markers, can effectively adjust for the confounding effect of population stratification regardless of its extent. We also suggest that as low as 50 random SNPs with heterozygosity >40% should be sufficient as genomic controls.

Genome-wide association studies

Article 26 August 2021

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Introduction

In theory, for equivalent sample size, association test is far more powerful than pedigree-based linkage studies in searching for genomic regions underlying human diseases.¹ The basic idea behind association test is that the disease alleles are more frequent in ascertained cases than in controls. Markers physically close to the disease loci will also be detected because of linkage disequilibrium (LD). However, the application of this approach is compromised by false or nonreplicable findings,² partially due to population stratification, which causes unlinked markers to show association with the phenotype.^{3, 4} Recent population admixture also bias association test, and an example is the spurious finding between immunoglobulin haplotype Gm^3,5,13,14 and NIDDM in the Gila River Indian Community.⁵ The study was confounded by the subjects' degree of Caucasian genetic heritage.⁶

To overcome this serious danger, a correction strategy has been proposed.^{6, 7} It requires to genotype additional unlinked markers, often called ‘genomic control (GC) markers’, as the cost of detecting and correcting possible confounders. Under the assumption of no association between GC markers and phenotype and no population stratification, the χ² statistics of association test between the ith GC marker and case–control status, denoted as Y_i², follows a χ² distribution with one degree of freedom if using additive genetic model. And the sum of the χ² statistics of n GC markers, denoted as Y_n², follows a χ² distribution with n degree of freedom, where we can easily test whether the population stratification is present. Furthermore, we assume the test statistic is inflated by a factor λ, Y_n²/λ∼χ_n². If we assume λ is constant for all loci, we can then use it to adjust the population stratification. One robust way to estimate the inflation factor is:

where 0.456 is simply the median of χ² distribution with one degree of freedom.⁸ We denote this method as the combined χ² approach in this paper. An alternative method, proposed by Pritchard et al,⁹ tackles this problem in two steps. Firstly, the GC markers are used to separate the study subjects into genetically homogeneous subgroups, and second, the association tests will be conducted within each subgroup.

To date, the performance of the GC approaches has not been examined extensively in real genotype data. Previous researches were based on simulated data or small number of GC markers. The Affymetrix Mapping 10 K array has recently become available, offering the ability to genotype more than 10 000 single-nucleotide polymorphisms (SNPs) across the human genome in a timely manner.¹⁰ Using this technology, we evaluated the current genomic control approaches in testing and controlling for population stratification.

Methods

Study subjects

Four groups of subjects were used in the current study, (1) 20 Asians, (2) 42 African-Americans, (3) 42 Caucasian collected by Coriell Institute, and (4) 54 Caucasian subjects collected from Utah, USA. The DNA samples of group (1–3) were purchased from Coriell Institute, and the group (4) samples were collected by Centre d'Etude du Polymorphisme Humain (CEPH) research laboratory. All subjects were unrelated individuals and remained anonymous to the authors.

Genotyping

A measure of 250 ng genomic DNA of each subject was digested with XbaI at 37°C for 4 h. The DNA fragments undergo ligation to a universal adaptor and then PCR-amplification with a common primer. The amplicon was cleaved by partial DnaseI digest to shorter fragments, and labeled with biotinylated ddATP using terminal deoxytransferase. The labeled DNA was injected into the microarray cartridge and incubated overnight. The hybridized microarray was washed and stained following a three-step protocol, and was scanned under the manufacturer's directions (Affymetrix). Finally, the genotype was determined using an automated scoring software (Affymetrix). The detailed genotyping procedure used has been previously described elsewhere.¹⁰ This data set has been made available to the public at http://www.affymetrix.com/support/developer/resource_center/index.affx?terms=no.

Statistical analysis

Only autosomal markers were used in the analysis. We firstly compared the allele frequencies and heterozygosities of the genotyped SNPs among populations. Second, we evaluated the performance of genomic control method in detecting population stratification through an iterative procedure. We pooled genotypes from different groups together, that is, Asian and African Americans, and attempted to detect this mixture using GC approach. In each iteration loop, we randomly selected n=20 or 50 SNPs from the data set, calculated Y_n² in the combined χ² approach, and conducted test for population stratification. Overall, 10 000 iterations were carried out, and we summarized the power as P (P<0.05). Furthermore, we assessed the degree of bias it would cause in association tests if we ignored the underlying population stratification. The Armitage's trend test for additive model was used.⁸ We randomly assigned a fraction (0, 25, 50, 75 and 100%) of each ethnic group to be disease affected, pooled two groups together, and tested for disease–SNP association. This simulation procedure was repeated 10 000 times, and we recorded the rejection rate at 0.05 level. Upon observed substantial population stratification, we also estimated the inflation factor (λ), and calculated the rejection rate again with controlling for population stratification.

We also applied the Pritchard's approach, which is a model-based clustering method using unlinked SNPs to infer population structures, and assign individuals to clusters.⁹ The method is implemented in a software, STRUCTURE (version 2), which was downloaded from http://pritch.bsd.uchicago.edu. We evaluated this method by pooling two ethnic groups together, and run STRUCTURE to detect the population structure using 50 or 500 GC SNPs.

Results

A total of 158 unrelated individuals from four ethnic groups were genotyped on 10 043 SNP markers by the array, with an overall call rate of 96.4%. These SNPs are fairly polymorphic in our study samples, and the average heterozygosity (>40%) and allele frequency (>20%) of the SNPs were similar across all four ethnic groups (Table 1). Using the combined χ² approach, we found 10–20 SNPs were sufficient to detect population stratification in the scenarios of Asian-Caucasian, Asian-African American and Caucasian-African American mixture at the nominal 0.05 level (Table 2). The power in rejecting the null (no population stratification) was over 95% in these cases by only using 10 genomic control SNPs. However, when mixing the Caucasian subjects collected by Coriell Inc. and those collected by CEPH, we have limited power to detect the stratification even using 50 random SNPs (Table 2). One possibility stands as there was no significant population stratification between these two groups of Caucasian subjects, so that we can conduct association test without adjustment. However, we observed substantial bias in the test if we mix any two ethnic groups together (Table 3). In the cases of mixing the two groups of Caucasian subjects, the rejection rate could be more than 20% under the null hypothesis (Table 3). It should be noted, that in the 0.5/0.5 situation of Table 3, we simulated case–control studies matched on ethnicity. When sample size is small-to-moderate, using asymptotic χ² distribution tends to yield overestimated P-value and result into conservative test.¹¹ Only when sample size becomes large, the asymptotic P-value is accurate.¹¹ As a consequence, in the 0.5/0.5 column of Table 3, the rejection rates were slightly less than α level except when mixing the two Caucasian groups where population stratification was less severe. Upon observing strong bias in marker–disease association testing if ignoring the population stratification, we used the estimated inflation factor (λ) to adjust the association tests, and obtained correct rejection rate (Table 4). In addition, we simulated situations of mixing two ethnic groups (eg Asian and African-American), where one group was matched in cases and controls but the other group was mismatched. In this case, we also observed elevated false-positive rate caused by population stratification, which could be appropriately adjusted for by using χ² genomic control methods.

Table 1 Mean heterozygosity and allele frequency of the SNPs among study subjects

Full size table

Table 2 Power of testing for population stratification at 0.05 level^*

Full size table

Table 3 Rejection rate of association test under the null hypothesis at 0.05 level^a

Full size table

Table 4 Controlling for population stratification with 50 unlinked markers^*

Full size table

In the cases of Asian-Caucasian, Asian-African American and Caucasian-African American pooling, the STRUCTURE software can easily separate the two groups using 50 random SNPs. In contrast, when we combine the two groups of Caucasian samples, the software failed to detect the population stratification even using 500 SNPs. Because the computation time of this method is fairly long, we carried out only 20 runs using different sets of 500 random SNPs, but in none of the 20 occasions the subpopulations were detected.

Discussion

In this paper, we evaluated two different strategies in detecting and controlling for population stratification using real data. We surveyed the cases of (1) pooling genetically distant populations, such as Asians and Caucasians, and (2) pooling genetically similar populations, such as two groups of Caucasian samples. In case (1), both strategies were able to detect the stratification with a small number of GC SNPs, however, in case (2), the sensitivity is low in both strategies even with hundreds of SNPs. The inflation factor (λ) in the combined χ² approach can correctly adjust the confounding effect even when the population stratification was statistically nonsignificant. In contrast, no adjustment can be attempted in Pritchard's approach when subpopulation structure is not detected.

In the past decade, we have witnessed the rise of family-based designs as an alternative to population-based study.⁸ The motivation is that family-based designs are protected from population stratification by its nature.¹² This protection comes with a cost: (1) family-based samples are more difficult to collect, and (2) conditioning on the same number of genotypes, family-based tests are less powerful than their population-based counterparts.¹³ Furthermore, the genomic control approaches made population-based study at least comparable to family-based tests.

We can generally consider two scenarios. (1) In the situation of substantial population stratification, Pritchard's method appears to be the most attractive. It can cluster subjects into homogeneous groups, and tests can be conducted within these groups. We should note that, in this case, the gene–disease association could be quite different among ethnic groups in terms of both magnitude and direction due to the distinct genetic backgrounds. Family-based approach can only detect the average genetic effect, which could miss the association if its directions are opposite among subpopulations. (2) In the situation of subtle population stratification, such as two groups of Caucasian subjects collected from different geological regions, Pritchard's method showed limited sensitivity in detecting the stratification. Fortunately, the combined χ² approach can still appropriately adjust for the confounding effect using estimated λ. Our results suggest that the adjustment is fairly accurate in various degrees of population stratification. Simulation studies also showed, in this case, population-based study with GC adjustment is statistically more powerful than family-based tests.⁸

The degree of population stratification varies among genetic markers. Some markers carry similar allele frequencies across populations, and in contrast, some markers are ethnic specific. When the two underlying populations are not separable (ie, the scenario of mixing two Caucasian samples), we have to estimate the λ on a number of random SNPs and apply it as a constant. By these means, we controlled the overall false-positive rate on all SNPs. However, for each individual SNP, its degree of population stratification could be under- or overestimated. The advantage of the χ² approach is that it can adjust for population stratification even when the underlying populations are not separable. Its drawback is, when we are able to separate the underlying populations, this method is not the most efficient way to adjust for the stratification, on the other hand, the STRUCTURE strategy is more reasonable in this scenario. The disadvantage of the STRUCTURE strategy is, when the underlying populations are not separable, it simply cannot provide any adjustment.

It should be noted that, in the situation of subtle population stratification, both the combined χ² approach and the Pritchard's method showed limited power in detecting it. However, we may still suffer from severe biases if the association test was performed without adjustment. In this study, we found the inflation factor can solve this potential danger, even when no significant subpopulations are detected. Hence, adjusting the test using λ is recommended in a population-based association study if GC data are available. In addition, we set λ≥1, which partially caused that, Table 4a, the corrected rejection rate under the null hypothesis is slightly less than 5%.

‘How many GC markers should we use?’ is always an intriguing question, and different suggestions have been made.^{7, 8, 9} Actually, the answer to this question highly depends on the population structure, sample size and heterozygosity of genomic control makers, and these parameters varies greatly across studies. Thus, no simply cutoff can be suggested. In this study, we investigated the impact of allele frequencies of genomic control SNPs to the power in detecting the stratification. We found frequent SNPs provide much larger power than infrequent SNPs. In our data, SNPs with minor allele frequency less than 15% offer nearly no power. Here, 50 SNPs with an average heterozygosity around 40% provided great power to detect population stratification and to make appropriate adjustment. As demonstrated in Table 4b, the variance of λ is fairly small, which means no matter which 50 SNPs on the genome we choose as GC marker they will lead to similar estimations of λ. With the rapid progress of biotechnology, the cost of SNP genotyping has been reduced greatly. To genotype a set of 50 SNPs in a population-based study is no longer a major financial or technological hurdle (in comparison to collecting family samples); moreover, GC methods will provide valuable and often necessary adjustments. GC approach can also be applied to family-based tests. When the direction of association is opposite among subpopulations, family-based tests may lead to a mistaken null finding. Since GC markers can separate sample to genetically homogeneous subgroups, conducting family-based test within these groups is arguably more appropriate and powerful.

In another setting, where we would like to examine whether a group of individuals belongs to a certain population (eg Asian), using known ethnic-specific markers would be more powerful and efficient than random markers. To date, numbers of these ethnic-specific markers have been characterized on many populations.¹⁴

After our initial submission to the European Journal of Human Genetics, two papers on this topic were published in the Nature Genetics.^{11, 14} Here, we take the opportunity of manuscript revision to compare the designs and results of these studies. Freedman et al¹⁴ utilized multiple populations of moderate-to-large sample size, but only typed a few dozen markers on each sample. They observed similar results as ours, that subtle population stratification is not detectable with adequate power by using χ² methods. However, in this situation, the subtle population stratification still increases the likelihood of false positives. Unfortunately, Freedman et al looked at neither the usage of λ in adjusting association test or the STRUCTURE strategy in detail. Similar to our study, Marchini et al¹¹ typed large number of SNPs on small-to-moderate sample sizes. Using a Bayesian method,^{11, 15} they found substantial stratification among Asian, White and Black subjects, and much smaller difference between Chinese and Japanese. Furthermore, Marchini et al¹¹ simulated large cohorts and investigated the impact of sample size on association tests in the context of population stratification. Their results agree well with our findings

References

Risch N, Merikangas K : The future of genetic studies of complex human diseases. Science 1996; 273: 1516–1517.
Article CAS PubMed Google Scholar
Weiss ST, Silverman EK, Palmer LJ : Case–control association studies in pharmacogenetics. Pharmacogenom J 2001; 1: 157–158.
Article CAS Google Scholar
Lander ES, Schork NJ : Genetic dissection of complex traits. Science 1994; 265: 2037–2048.
Article CAS PubMed Google Scholar
Ewens WJ, Spielman RS : The transmission/disequilibrium test: history, subdivision, and admixture. Am J Hum Genet 1995; 57: 455–464.
Article CAS PubMed PubMed Central Google Scholar
Knowler WC, Williams RC, Pettitt DJ, Steinberg AG : Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet 1988; 43: 520–526.
CAS PubMed PubMed Central Google Scholar
Thomas DC, Witte JS : Point: population stratification: a problem for case–control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev 2002; 11: 505–512.
PubMed Google Scholar
Devlin B, Roeder K : Genomic control for association studies. Biometrics 1999; 55: 997–1004.
Article CAS PubMed Google Scholar
Bacanu SA, Devlin B, Roeder K : The power of genomic control. Am J Hum Genet 2000; 66: 1933–1944.
Article CAS PubMed PubMed Central Google Scholar
Pritchard JK, Stephens M, Donnelly P : Inference of population structure using multilocus genotype data. Genetics 2000; 155: 945–959.
CAS PubMed PubMed Central Google Scholar
Kennedy GC, Matsuzaki H, Dong S et al: Large-scale genotyping of complex DNA. Nat Biotechnol 2003; 21: 1233–1237.
Article CAS PubMed Google Scholar
Marchini J, Cardon LR, Phillips MS, Donnelly P : The effects of human population structure on large genetic association studies. Nat Genet 2004; 36: 512–517.
Article CAS PubMed Google Scholar
Spielman RS, McGinnis RE, Ewens WJ : Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 1993; 52: 506–516.
CAS PubMed PubMed Central Google Scholar
Morton NE, Collins A : Tests and estimates of allelic association in complex inheritance. Proc Natl Acad Sci USA 1998; 95: 11389–11393.
Article CAS PubMed PubMed Central Google Scholar
Rosenberg NA, Pritchard JK, Weber JL et al: Genetic structure of human populations. Science 2002; 298: 2381–2385.
Article CAS PubMed Google Scholar
Freedman ML, Reich D, Penney KL et al: Assessing the impact of population stratification on genetic association studies. Nat Genet 2004; 36: 388–393.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We would like to thank Dr Rui Mei for providing the data set and running the arrays. We thank Dr Xin Xu and Dr Tianhua Niu for carefully reading the manuscript and providing insightful comments. This work is partially supported by NIH Grant 1R01HG02341.

Author information

Authors and Affiliations

Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
Ke Hao, Cheng Li & Wing H Wong
Department of Biostatistics, Dana Farber Cancer Institute, Boston, MA, USA
Cheng Li
Genomics Collaboration, Affymetrix, Santa Clara, CA, USA
Carsten Rosenow
Department of Statistics, Harvard University, Cambridge, MA, USA
Wing H Wong

Authors

Ke Hao
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Carsten Rosenow
View author publications
You can also search for this author in PubMed Google Scholar
Wing H Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wing H Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hao, K., Li, C., Rosenow, C. et al. Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip^® Human Mapping 10K array. Eur J Hum Genet 12, 1001–1006 (2004). https://doi.org/10.1038/sj.ejhg.5201273

Download citation

Received: 18 February 2004
Revised: 11 June 2004
Accepted: 22 July 2004
Published: 15 September 2004
Issue Date: 01 December 2004
DOI: https://doi.org/10.1038/sj.ejhg.5201273

Keywords

This article is cited by

Genomic inflation factors under polygenic inheritance
- Jian Yang
- Michael N Weedon
- Peter M Visscher
European Journal of Human Genetics (2011)
Angiotensin-converting enzyme insertion/deletion polymorphism is not associated with susceptibility and outcome in sepsis and acute respiratory distress syndrome
- Jesús Villar
- Carlos Flores
- Arthur S. Slutsky
Intensive Care Medicine (2008)
Association analyses confirming a susceptibility locus for intracranial aneurysm at chromosome 14q23
- Yohei Mineharu
- Kayoko Inoue
- Akio Koizumi
Journal of Human Genetics (2008)
A CXCL2 tandem repeat promoter polymorphism is associated with susceptibility to severe sepsis in the Spanish population
- C Flores
- N Maca-Meyer
- J Villar
Genes & Immunity (2006)
Susceptibility to Buruli ulcer is associated with the SLC11A1 (NRAMP1) D543N polymorphism
- Y Stienstra
- T S van der Werf
- G van der Steege
Genes & Immunity (2006)

Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip^® Human Mapping 10K array

Abstract

Similar content being viewed by others

Genome-wide association studies

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Introduction

Methods

Study subjects

Genotyping

Statistical analysis

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

Genomic inflation factors under polygenic inheritance

Angiotensin-converting enzyme insertion/deletion polymorphism is not associated with susceptibility and outcome in sepsis and acute respiratory distress syndrome

Association analyses confirming a susceptibility locus for intracranial aneurysm at chromosome 14q23

A CXCL2 tandem repeat promoter polymorphism is associated with susceptibility to severe sepsis in the Spanish population

Susceptibility to Buruli ulcer is associated with the SLC11A1 (NRAMP1) D543N polymorphism

Search

Quick links

Abstract

Similar content being viewed by others

Introduction

Methods

Study subjects

Genotyping

Statistical analysis

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links