Assessing the impact of population stratification on genetic association studies

Freedman, Matthew L; Reich, David; Penney, Kathryn L; McDonald, Gavin J; Mignault, Andre A; Patterson, Nick; Gabriel, Stacey B; Topol, Eric J; Smoller, Jordan W; Pato, Carlos N; Pato, Michele T; Petryshen, Tracey L; Kolonel, Laurence N; Lander, Eric S; Sklar, Pamela; Henderson, Brian; Hirschhorn, Joel N; Altshuler, David

doi:10.1038/ng1333

Letter
Published: 28 March 2004

Assessing the impact of population stratification on genetic association studies

Matthew L Freedman^1,2,3^na1,
David Reich^3,4^na1,
Kathryn L Penney^1,2,3,
Gavin J McDonald^3,4,
Andre A Mignault⁴,
Nick Patterson³,
Stacey B Gabriel³,
Eric J Topol⁵,
Jordan W Smoller^6,7,
Carlos N Pato^8,9,
Michele T Pato^8,9,
Tracey L Petryshen³,
Laurence N Kolonel¹⁰,
Eric S Lander^3,11,
Pamela Sklar^3,6,7,
Brian Henderson¹²,
Joel N Hirschhorn^3,4,13 &
…
David Altshuler^1,3,4,14

Nature Genetics volume 36, pages 388–393 (2004)Cite this article

9486 Accesses
601 Citations
11 Altmetric
Metrics details

Abstract

Population stratification refers to differences in allele frequencies between cases and controls due to systematic differences in ancestry rather than association of genes with disease. It has been proposed that false positive associations due to stratification can be controlled by genotyping a few dozen unlinked genetic markers. To assess stratification empirically, we analyzed data from 11 case-control and case-cohort association studies. We did not detect statistically significant evidence for stratification but did observe that assessments based on a few dozen markers lack power to rule out moderate levels of stratification that could cause false positive associations in studies designed to detect modest genetic risk factors. After increasing the number of markers and samples in a case-cohort study (the design most immune to stratification), we found that stratification was in fact present. Our results suggest that modest amounts of stratification can exist even in well designed studies.

You have full access to this article via your institution.

Download PDF

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Genome-wide association studies

Article 26 August 2021

Main

There has been much debate^1,2,3,4 but limited data^5,6,7 about the impact of population stratification on case-control association studies. Systematic differences in the ancestry of cases and controls are one source of false positive associations^8,9, but the fraction of published associations that is attributable to stratification is unknown¹⁰. It has been argued that the effects of stratification can be eliminated simply by carefully matching cases and controls according to self-reported ancestry and geographical origin². Recently, empirical methods to detect stratification based on genotypes at unlinked markers have been described¹¹. The largest application of such methods involved genotyping 44 unlinked markers in four case-control studies⁵. Stratification was detected in one study, although the signal was no longer apparent after more stringent matching of cases and controls based on the birthplaces of the individuals' grandparents. This has been interpreted as evidence that stratification may be less of a concern than originally anticipated.

We assessed stratification empirically by analyzing data from 24–48 unlinked single-nucleotide polymorphisms (SNPs) in 11 association studies spanning a range of disease states and self-reported ancestries and three different epidemiological designs. These studies included seven ongoing studies in our laboratory and reanalysis of data from the four studies previously reported⁵. We assessed stratification first by testing for statistically significant evidence of differentiation between cases and controls using the method of Pritchard and Rosenberg¹¹ and second by estimating the magnitude of stratification consistent with the data using Genomic Control^12,13.

None of the 11 studies showed significant evidence of stratification after correcting for multiple hypothesis testing, consistent with previous studies^5,6 (Table 1). Comparing cases and controls from different studies with the same self-reported ancestry (European American), we found no significant evidence of stratification in nine pairwise comparisons using 33–43 SNPs (Supplementary Table 1 online).

Table 1 Assessment of population stratification in 11 epidemiological studies

Full size table

We next applied the method of Genomic Control^12,13 to estimate quantitatively the amount of stratification consistent with the data for each of the 11 studies. Genomic Control is conceptually simple: the method examines the distribution of association statistics (χ²) between unlinked genetic variants typed in cases and controls. The statistic at a candidate allele being tested for association can then be compared with the genome-wide distribution of statistics for markers that are probably unrelated to disease to assess whether the candidate allele stands out. In the absence of stratification, association between unlinked genetic variants and disease should follow a χ² distribution with 1 degree of freedom^12,13. In the presence of stratification, the distribution of association statistics should be inflated by a value termed λ, which becomes larger with increasing of sample size (Fig. 1).

**Figure 1: The effect of stratification on association studies.**

We estimated stratification for each of the 11 data sets and report the inflation of association statistics that would be expected in a study of 1,000 cases and 1,000 controls, called λ₁₀₀₀. (It is simple to extrapolate from λ₁₀₀₀ to the inflation factor due to stratification for any sample size¹².) Consistent with the fact that the 11 data sets showed no significant evidence for stratification, the confidence intervals for λ₁₀₀₀ overlapped 1 in every study. Nevertheless, we found that the confidence intervals were sufficiently broad that substantial levels of stratification could not be excluded. For example, the 95th percentile upper bound on λ₁₀₀₀ in the studies averaged 7.9 (Table 1 and Fig. 2).

Figure 2: Likelihood surfaces for stratification for the 11 studies, assuming 1,000 cases and 1,000 controls (we provide results for λ₁₀₀₀, but likelihood surfaces for other numbers of cases and controls could be obtained simply by rescaling the axis using the equation in Methods).

We increased power to detect stratification by increasing the number of SNPs and samples examined in one of the 11 studies that initially showed no significance evidence for stratification, the African American prostate cancer study (P < 0.75). For the follow-up, we approximately quadrupled the number of markers and increased the sample size by a factor of 5–6 (474 prostate cancer cases and 476 cohort controls). The new markers consisted of a collection of missense SNPs, which we treated as being in the same class as the noncoding SNPs, because, within the limits of our resolution (Table 2), they showed the same levels of population differentiation (with sufficient power, such differences can probably be detected¹⁴). The new markers also included a second set of SNPs chosen for their large allele frequency differences between west Africans and Europeans¹⁵, which makes them particularly powerful for detecting stratification¹⁶.

Table 2 Comparison of levels of stratification in missense versus randomly chosen SNPs

Full size table

In this expanded data set we found significant evidence of stratification (P < 0.0001). When we restricted the analysis to 469 cases and 268 controls in whom all markers were successfully typed, the result was still significant (P < 0.01; Table 1). We then removed from the analysis 40 cases and 48 controls who reported that either they or their parents had some non-African American ancestry^2,5, because a small number of individuals with misclassified ancestry might disproportionately affect the result. The evidence for stratification was stronger in this subset (P < 0.005; Table 1). Notably, the Genomic Control estimate of stratification (removing SNPs that had been specifically chosen to have large differences in frequency across populations¹⁷) was λ₁₀₀₀ = 1.5, with a 95th percentile upper bound of 3.34. This indicates that an observation of χ₂ = 19.5, expected only once by chance in a scan of 100,000 SNPs, would instead be seen 31 times (effective χ₂ = 19.5/1.5 = 13) due to this level of stratification. At the 95% upper confidence limit of our estimate (λ₁₀₀₀ = 3.34), 1,568 false positives would be expected due to stratification.

The observation of population stratification in African Americans with prostate cancer is not entirely unexpected. People of west African descent are thought to have a higher genetic risk for prostate cancer than those of European descent¹⁸, and hence African Americans with prostate cancer, who are known to have ancestry from both populations¹⁵, might be expected to have more African ancestry, on average, than controls. Population stratification was also observed in a separate study of African Americans with prostate cancer⁹. Our analysis strengthens this result, in that our sample was prospectively collected in a population-based cohort¹⁹, considered to be the optimal epidemiological design to minimize systematic differences between cases and controls (as opposed to the case-control design).

We also followed up with a study of European Americans with prostate cancer (approximately doubling the number of SNPs to 79 and quadrupling the number of samples to 391 cases and 456 cohort controls). In this study, we did not find statistically significant evidence for stratification (P < 0.10). The 95th percentile upper bound on stratification from Genomic Control, however, was similar to that in the study of African Americans (λ₁₀₀₀ = 3.03; Fig. 2). Much more data will be needed from many studies before it is possible to assess whether matching cases and controls solely on the basis of their self-reported ancestry, in a population such as European Americans without recent mixture, is adequate to take into account population stratification.

Our data indicate that genotyping a few dozen markers cannot rule out modest levels of population stratification that could generate false positives in an association study designed to detect alleles of weak effect—even in the setting of a prospectively collected cohort study. Stratification is probably most problematic in populations whose ancestors recently mixed due to intercontinental migrations and for diseases that have different prevalence rates across these ancestral populations^11,13 (such as hypertension, obesity, diabetes and autoimmunity). Because the importance of stratification grows with sample size^12,13, however, it seems possible that, even for diseases whose incidence rates are not currently known to vary across populations, stratification could exist. Thus, our study argues that stratification cannot be excluded based on either first principles or published empirical data. We suggest instead that investigators continue to monitor for stratification. In addition to presenting nominal P values, investigators should also report the range of values consistent with the Genomic Control estimate of stratification in the samples based on genotyping unlinked markers. Alternatively, investigators could present a P value corrected for the full range of possible values of λ₁₀₀₀, using the full Bayesian approach to Genomic Control¹².

Our data show that stratification cannot be excluded as a possibility in real case-control studies, but that there is no need to abandon case-control and case-cohort studies in favor of family-based designs (such as transmission disequilibrium tests). Two powerful approaches are available to detect and correct for stratification²⁰. The first clusters samples based on multilocus genotypes (e.g., STRUCTURE²¹) to identify individuals with different ancestries. This provides a way to adjust for ancestry as a covariate in the association analysis^7,21. Genomic Control, on the other hand, makes a quantitative estimate of the degree of stratification and uses it to adjust for any stratification that might be present. The two methods are not mutually exclusive: STRUCTURE can be used first to identify and eliminate samples that contribute unduly to stratification, and a smaller Genomic Control correction can then be made in the final study.

How many SNPs need to be used in an assessment of stratification? This question must be viewed in relation to the magnitude of genetic effects under study. Given a substantial magnitude of effect and a highly significant P value, only a few dozen markers probably need to be genotyped to rule out gross stratification as an explanation for the positive association^11,22 (Table 3). In contrast, if the results point to more modest influences on disease, such as the risk due to variation in CTLA4 on autoimmune thyroid disease and type 1 diabetes²³, it may be necessary to genotype a larger number of markers to rule out modest amounts of stratification. Genotyping more than 340 markers can bring the conservative 95th percentile upper bound on the level of stratification to within 10% of the true value (Table 3). Fortunately, as the number of SNPs tested in association studies grows larger (to survey the genome for risk-associated alleles of increasingly modest effect), the bounds on the estimate of stratification should become increasingly precise with no additional effort, as all the markers in a study can be used to assess and adjust for stratification^12,13.

Table 3 Number of SNPs necessary to ensure an association is not due to stratification

Full size table

Methods

Clinical samples.

We obtained all samples for the new data collections with permission of the principal investigators and with approval of the Institutional Review Boards of the Massachusetts General Hospital, the Cleveland Clinic, SUNY/Upstate Medical University, the University of Hawaii and the University of Southern California. Informed consent was obtained from all subjects by the institutions responsible for the collections. Citations provide additional detail on the ascertainment of cases and controls.

GeneQuest coronary artery disease study²⁴.

We randomly selected 83 cases and 80 controls, all European Americans. The cases were from Cleveland, and controls were identified by random digit phone dialing in Atlanta, Georgia, USA.

Multiethnic Cohort prostate cancer study.

The Multiethnic Cohort¹⁹ is an ongoing (n = 215,251) study focusing on the effects of diet, genes and environment on the risk of cancer. The cohort samples include four main ethnic groups in Los Angeles and Hawaii. For European Americans, we randomly selected 110 incident cases and 97 cohort controls; for African Americans, we selected 90 incident cases and 69 cohort controls; for Japanese Americans, we selected 121 incident cases and 106 cohort controls; and for Hispanic Americans, we selected 142 incident cases and 124 cohort controls. We followed up in the study of African American prostate samples by genotyping all the missense and ancestry-informative SNPs described below. We genotyped an expanded sample of 469 African-American incident cases and 268 cohort controls from the cohort for all the SNPs and genotyped an additional 5 cases and 208 cohort controls for 31 of the SNPs that had high allele frequency differences across populations before the DNA for these samples ran out.

Bipolar disorder in European Americans.

We obtained 93 DNA samples from Massachusetts General Hospital from individuals with diagnoses of bipolar disorder 1 or bipolar disorder 2. As controls, we used GeneQuest samples (both this and the coronary artery disease study are examples where cases are matched to controls only using self-reported ancestry.)

Schizophrenia.

We obtained samples from 149 cases diagnosed with schizophrenia according to the criteria of the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, and 152 matched controls as part of a study of schizophrenia in the Portuguese population. Samples were descended from continental Portugal (83% of cases, 87% of controls), the Azore islands (13% of cases, 3% of controls) or the Madeira islands (3% of cases, 10% of controls). Some individuals were from Fall River, Rhode Island, but in each of these cases, all four grandparents were from the Azore islands.

For comparison of noncoding to missense SNPs, we studied 50 African American, 88 European American and 42 Asian American population samples. These were identical to those previously studied²⁵, except that the 88 European American samples were replaced by the parents of the 44 samples resequenced previously²⁶.

Choice of markers.

The physical and genetic map positions, along with flanking sequences, of all SNPs used in this study are available from the authors on request.

We obtained noncoding SNPs (67) from the SNP Consortium website. They were identified by comparing a single sequencing read from a diverse panel of individuals with the publicly available genome sequence²⁷. The SNPs were evenly spaced throughout the autosomes, each at least 20 Mb from the others. In practice, only 34–48 of these SNPs genotyped successfully and were of high enough frequency in any study to use in our analysis (the expected number of reference and variant alleles based on the allele frequency and sample size was ≥5).

We identified missense SNPs (100) from a database of SNPs in coding regions of genes²⁸, obtained as part of an effort to catalog SNPs in genes of interest for disease. We used only genes that were not designated in the database or in a published meta-analysis¹⁰ as having any relationship with prostate cancer, coronary artery disease, asthma or atopy. We excluded from the study those SNPs with a minor allele frequency <10% in a multiethnic screening panel. SNPs were chosen to be at least 1 Mb away from each other and from all the noncoding SNPs.

We obtained ancestry-informative SNPs (101) with high allele frequency differences comparing European and African Americans by combining data from ref. 15 with unpublished data from our own laboratory. These SNPs were all chosen to be at least 20 Mb from each other. The average frequency difference comparing west Africans and Europeans was 67%.

Genotyping.

The genotypes collected for this study are available from the authors to the extent that is consistent with the informed consent provided by the study participants. We used matrix-associated laser desorption ionization–time of flight mass spectrometry (MALDI-TOF)²⁹ with 5 ng of DNA per multiplex genotyping reaction to genotype most SNPs in this study. The PCR protocol is described elsewhere²⁵. Error rates with the Sequenom MassARRAY system have been estimated to be ∼0.4% at our laboratory²⁵, although the discrepancy rate in the present data set suggests closer to 0.25% (215 conflicts out of 42,766 genotypes, each done at least in duplicate).

Elimination of poorly performing SNPs.

We removed all SNPs from our analysis that showed Hardy Weinberg P values of <0.01 in at least two of the three diversity samples (CEPH, East Asian and African American). We also excluded SNPs from the analysis if the combined Hardy-Weinberg P value, over all populations excluding African Americans and Hispanic Americans, was <0.01. To calculate the P value, we summed the χ² values for the Hardy-Weinberg test over all n populations for which the statistic could be calculated and assessed significance using a χ² distribution with n degrees of freedom. (We excluded African Americans and Hispanic Americans from the Hardy-Weinberg assessment because different levels of population mixture across individuals in these groups can produce a deficiency of heterozygotes, even with accurate genotyping.) We also excluded from analysis SNPs for those studies in which the genotyping success rates were <75% in either cases or controls²⁵. We also eliminated from analysis SNPs that showed discrepancy rates of >3% in duplicate genotypes.

Detection of population stratification.

We calculated χ² association statistics for all k SNPs in a study, including only those for which the expected number of allele counts (based on the combined frequency in the two population samples) was at least 5. We then summed the values and assessed significance using a χ² distribution with k degrees of freedom¹¹.

Quantitative assessment of population stratification.

For each SNP in each study for which at least 40% of the cases and controls had been successfully genotyped, we calculated χ² values for all SNPs for which the expected number of allele counts (based on the combined frequency in the two population samples) was at least 5.

We carried out a likelihood analysis to estimate the level of stratification consistent with the data in each study. Defining c_j as the association statistic observed at marker j genotyped in n_j cases and m_j controls and f as the χ² distribution with 1 degree of freedom, the likelihood of a given inflation factor due to stratification is simply

a consequence of the fact that the χ² distribution scales with the inflation factor¹². The likelihood at all K markers is then

To estimate a likelihood distribution for the level of stratification, we define a reference sample size (we use n_ref = 1,000 cases and m_ref = 1,000 controls). We then use an equation derived in ref. 12 and confirmed by simulation as in ref. 13 to relate this to the inflation factor applicable to n_j cases and m_j controls. The inflation factor should be different from marker to marker because it scales with sample size:

In this paper we abbreviate λ_1000,1000 as λ₁₀₀₀.

Substituting equation 3 into equation 2 allows us to obtain a likelihood distribution for λ₁₀₀₀. The maximum likelihood estimate for λ₁₀₀₀ is simply the value for which L is maximized, with the requirement that λ₁₀₀₀ ≥ 1. We obtained the likelihood surfaces shown in Figure 2 by plotting the values of L for different λ₁₀₀₀, normalizing by the maximum likelihood (set equal to 1 in Fig. 2). We obtained the upper bound on λ₁₀₀₀ by picking the value such that the likelihood ratio 2log₁₀(L_max/L) = 2.7; that is, the point for which the likelihood was 4.5% of the maximum, corresponding roughly to a P < 0.05 cutoff (one-sided test).

To test for a difference in the distribution of χ² values between missense and noncoding SNPs (Table 2), we compared the random African American, European American and Asian American population samples in our study. For each SNP for which at least 70% both sample sets had been successfully genotyped, we randomly dropped samples until we had the same number at all sites. We then calculated χ² values and used a Mann-Whitney U test to assess whether the empirical distributions of statistics at missense and noncoding SNPs were distinguishable.

URL.

The SNP Consortium website is available at http://snp.cshl.org.

Note: Supplementary information is available on the Nature Genetics website.

References

Thomas, D.C. & Witte, J.S. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol. Biomarkers Prev. 11, 505–512 (2002).
PubMed Google Scholar
Wacholder, S., Rothman, N. & Caporaso, N. Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol. Biomarkers Prev. 11, 513–520 (2002).
PubMed Google Scholar
Ziv, E. & Burchard, E.G. Human population structure and genetic association studies. Pharmacogenomics 4, 431–441 (2003).
Article PubMed Google Scholar
Cardon, L.R. & Palmer, L.J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).
Article PubMed Google Scholar
Ardlie, K.G., Lunetta, K.L. & Seielstad, M. Testing for population subdivision and association in four case-control studies. Am. J. Hum. Genet. 71, 304–311 (2002).
Article CAS PubMed PubMed Central Google Scholar
Schork, N.J. et al. The future of genetic case-control studies. Adv. Genet. 42, 191–212 (2001).
Article CAS PubMed Google Scholar
Hoggart, C.J. et al. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72, 1492–1504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Knowler, W.C., Williams, R.C., Pettitt, D.J. & Steinberg, A.G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet. 43, 520–526 (1988).
CAS PubMed PubMed Central Google Scholar
Kittles, R.A. et al. CYP3A4-V and prostate cancer in African Americans: causal or confounding association because of population stratification? Hum. Genet. 110, 553–560 (2002).
Article PubMed Google Scholar
Lohmueller, K.E., Pearce, C.L., Pike, M., Lander, E.S. & Hirschhorn, J.N. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182 (2003).
Article CAS PubMed Google Scholar
Pritchard, J.K. & Rosenberg, N.A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).
Article CAS PubMed PubMed Central Google Scholar
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Article CAS PubMed Google Scholar
Reich, D.E. & Goldstein, D.B. Detecting association in a case-control study while correcting for population stratification. Genet. Epidemiol. 20, 4–16 (2001).
Article CAS PubMed Google Scholar
Akey, J.M., Zhang, G., Zhang, K., Jin, L. & Shriver, M.D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12, 1805–1814 (2002).
Article CAS PubMed PubMed Central Google Scholar
Parra, E.J. et al. Estimating African American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63, 1839–1851 (1998).
Article CAS PubMed PubMed Central Google Scholar
Pfaff, C.L., Kittles, R.A. & Shriver, M.D. Adjusting for population structure in admixed populations. Genet. Epidemiol. 22, 196–201 (2002).
Article PubMed Google Scholar
Reich, D.E. & Goldstein, D.B. Response to Pfaff et al.: Adjusting for population structure in admixed populations. Genet. Epidemiol. 22, 196–201 (2002).
Article Google Scholar
Bunker, C.H. et al. High prevalence of screening-detected prostate cancer among Afro-Caribbeans: the Tobago prostate cancer survey. Cancer Epidemiol. Biomarkers Prev. 11, 726–729 (2002).
PubMed Google Scholar
Kolonel, L.N. et al. A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics. Am. J. Epidemiol. 151, 346–357 (2000).
Article CAS PubMed Google Scholar
Pritchard, J.K. & Donnelly, P. Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 60, 227–237 (2001).
Article CAS PubMed Google Scholar
Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
CAS PubMed PubMed Central Google Scholar
Siddiqui, A. et al. Association of multidrug resistance in epilepsy with a polymorphism in the drug-transporter gene ABCB1. N. Engl. J. Med. 348, 1442–1448 (2003).
Article CAS PubMed Google Scholar
Ueda, H. et al. Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease. Nature 423, 506–511 (2003).
Article CAS PubMed Google Scholar
Topol, E.J. et al. Single nucleotide polymorphisms in multiple novel thrombospondin genes may be associated with familial premature myocardial infarction. Circulation 104, 2641–2644 (2001).
Article CAS PubMed Google Scholar
Gabriel, S.B. et al. The structure of haplotype blocks in the human genome. Science 296, 2225–2229 (2002).
Article CAS PubMed Google Scholar
Reich, D.E. et al. Linkage disequilibrium in the human genome. Nature 411, 199–204 (2001).
Article CAS PubMed Google Scholar
Sachidanandam, R. et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001).
Article CAS PubMed Google Scholar
Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231–238 (1999).
Article CAS PubMed Google Scholar
Tang, K. et al. Chip-based genotyping by mass spectrometry. Proc. Natl. Acad. Sci. USA 96, 10016–10020 (1999).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank A. Villapakkam for assistance in genotyping and data checking and K. Ardlie, D. Goldstein and C. Haiman for discussions. M.L.F. is supported by a Department of Defense Health Disparity training grant and a Postdoctoral Fellowship for Physicians from the Howard Hughes Medical Institute. D.A. is a Clinical Scholar in Translational Research from the Burroughs Wellcome Fund, as well as a Charles E. Culpeper Medical Scholar. J.N.H. and D.R. are recipients of Career Development Awards from the Burroughs Welcome Fund. T.L.P. is supported by a Canadian Institutes of Health Research Postdoctoral Fellowship and is a NARSAD Young Investigator. This work was supported in part by a grant from the Functional Genomics Program at the Whitehead Institute/MIT Center for Genome Research (supported by Affymetrix, Millennium Pharmaceuticals and Bristol Myers Squibb) and by a grant from the US National Institutes of Health to B.H. and L.K.

Author information

Matthew L Freedman and David Reich: These authors contributed equally to this work.

Authors and Affiliations

Departments of Medicine and Molecular Biology, Massachusetts General Hospital, 55 Fruit Street, Boston, 02114, Massachusetts, USA
Matthew L Freedman, Kathryn L Penney & David Altshuler
Department of Hematology-Oncology, Massachusetts General Hospital, 55 Fruit Street, Boston, 02114, Massachusetts, USA
Matthew L Freedman & Kathryn L Penney
Program in Medical and Population Genetics, Broad Institute, One Kendall Square, Building 300, Cambridge, 02139, Massachusetts, USA
Matthew L Freedman, David Reich, Kathryn L Penney, Gavin J McDonald, Nick Patterson, Stacey B Gabriel, Tracey L Petryshen, Eric S Lander, Pamela Sklar, Joel N Hirschhorn & David Altshuler
Department of Genetics, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, 02115, Massachusetts, USA
David Reich, Gavin J McDonald, Andre A Mignault, Joel N Hirschhorn & David Altshuler
Department of Cardiovascular Medicine, Cleveland Clinic Foundation, Cleveland, Ohio, USA
Eric J Topol
Department of Psychiatry, Harvard Medical School, Boston, Massachusetts, USA
Jordan W Smoller & Pamela Sklar
Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, 149 13th Street, Charlestown, Massachusetts, USA
Jordan W Smoller & Pamela Sklar
Veterans Administration, Syracuse, New York, USA
Carlos N Pato & Michele T Pato
Center for Psychiatric and Molecular Genetics, SUNY/Upstate Medical University, Syracuse, New York, USA
Carlos N Pato & Michele T Pato
Cancer Etiology Program, Cancer Research Center of Hawaii, University of Hawaii, Honolulu, Hawaii, USA
Laurence N Kolonel
Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Eric S Lander
Department of Preventative Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, USA
Brian Henderson
Divisions of Genetics and Endocrinology, Children's Hospital and Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
Joel N Hirschhorn
Diabetes Unit, Massachusetts General Hospital, 55 Fruit Street, Boston, 02114, Massachusetts, USA
David Altshuler

Authors

Matthew L Freedman
View author publications
You can also search for this author in PubMed Google Scholar
David Reich
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn L Penney
View author publications
You can also search for this author in PubMed Google Scholar
Gavin J McDonald
View author publications
You can also search for this author in PubMed Google Scholar
Andre A Mignault
View author publications
You can also search for this author in PubMed Google Scholar
Nick Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Stacey B Gabriel
View author publications
You can also search for this author in PubMed Google Scholar
Eric J Topol
View author publications
You can also search for this author in PubMed Google Scholar
Jordan W Smoller
View author publications
You can also search for this author in PubMed Google Scholar
Carlos N Pato
View author publications
You can also search for this author in PubMed Google Scholar
Michele T Pato
View author publications
You can also search for this author in PubMed Google Scholar
Tracey L Petryshen
View author publications
You can also search for this author in PubMed Google Scholar
Laurence N Kolonel
View author publications
You can also search for this author in PubMed Google Scholar
Eric S Lander
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Sklar
View author publications
You can also search for this author in PubMed Google Scholar
Brian Henderson
View author publications
You can also search for this author in PubMed Google Scholar
Joel N Hirschhorn
View author publications
You can also search for this author in PubMed Google Scholar
David Altshuler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Reich.

Ethics declarations

Competing interests

D.A. is a paid consultant to Genomics Collaborative, which provided the previously published data set (Am. J. Hum. Genet. 71, 304–311; 2002) that was reanalyzed in this paper.

Supplementary information

Supplementary Table 1 (PDF 5 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Freedman, M., Reich, D., Penney, K. et al. Assessing the impact of population stratification on genetic association studies. Nat Genet 36, 388–393 (2004). https://doi.org/10.1038/ng1333

Download citation

Received: 15 October 2003
Accepted: 23 February 2004
Published: 28 March 2004
Issue Date: 01 April 2004
DOI: https://doi.org/10.1038/ng1333

This article is cited by

An overview of DNA methylation-derived trait score methods and applications
- Marta F. Nabais
- Danni A. Gadd
- Naomi R. Wray
Genome Biology (2023)
Using residual regressions to quantify and map signal leakage in genomic prediction
- Bruno D. Valente
- Gustavo de los Campos
- William O. Herring
Genetics Selection Evolution (2023)
Hybrid autoencoder with orthogonal latent space for robust population structure inference
- Meng Yuan
- Hanne Hoskens
- Peter Claes
Scientific Reports (2023)
Proteomic association with age-dependent sex differences in Wisconsin Card Sorting Test performance in healthy Thai subjects
- Chen Chen
- Bupachad Khanthiyong
- Sutisa Nudmamud-Thanoi
Scientific Reports (2023)
Children’s Dopaminergic Genotype is Associated with Maternal Reports of Household Chaos during Middle Childhood
- Jamie M. Gajos
Journal of Child and Family Studies (2023)

Assessing the impact of population stratification on genetic association studies

Abstract

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Refining the impact of genetic evidence on clinical success

Genome-wide association studies

Main