A tutorial on statistical methods for population association studies

Balding, David J.

doi:10.1038/nrg1916

Review Article
Published: 01 October 2006

A tutorial on statistical methods for population association studies

David J. Balding¹

Nature Reviews Genetics volume 7, pages 781–791 (2006)Cite this article

27k Accesses
904 Citations
29 Altmetric
Metrics details

Key Points

Although population association studies are not new, there remain many areas of disagreement over appropriate statistical analyses. This article provides an overview of statistical methods, including areas of controversy and ongoing developments. It does not consider family-based association studies, nor linkage or admixture studies.
I first cover analyses that are preliminary to association testing: testing for Hardy–Weinberg equilibrium; imputing missing genotype data; inferring haplotype from genotype data; measures of linkage disequilibrium and estimates of recombination rates; and choosing tag SNPs.
Among tests of association, I cover case–control, quantitative and ordered phenotypes, and analyses that are based on single SNPs, multiple SNPs and haplotypes. There is a discussion of issues that are relevant to genome-wide association studies.
I discuss Genomic Control and other approaches to the problem of population stratification.
I give particular attention to the problem of multiple testing, and discuss both frequentist and Bayesian approaches to addressing the problem.

Abstract

Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy–Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Log quantile–quantile (QQ) P-value plot for 3,478 single-SNP tests of association.**

**Figure 2: Armitage test of single-SNP association with case–control outcome.**

**Figure 3: Linear regression test of single-SNP associations with continuous outcomes.**

Genome-wide association studies

Article 26 August 2021

Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix

Article Open access 02 December 2023

Rare-variant collapsing analyses for complex traits: guidelines and applications

Article 11 October 2019

References

Jobling, M. A., Hurles, M. E. & Tyler-Smith, C. Human Evolutionary Genetics: Origins Peoples & Disease (Garland Science, New York, 2004).
Google Scholar
Thomas, D. C. Statistical Methods in Genetic Epidemiology (Oxford Univ. Press, 2004). The best general reference for statistical methods in genetic epidemiology; for population association studies it discusses important general issues without specific details on tests and other analyses.
Google Scholar
Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting marker–disease association by testing for Hardy–Weinberg disequilibrium at a marker locus. Am. J. Hum. Genet. 63, 1531–1540 (1998).
CAS PubMed PubMed Central Google Scholar
Wittke-Thompson, J. K., Pluzhnikov, A. & Cox, N. J. Rational inferences about departures from Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 967–986 (2005).
CAS PubMed PubMed Central Google Scholar
Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high-resolution survey of deletion polymorphism in the human genome. Nature Genet. 38, 75–81 (2006).
CAS PubMed Google Scholar
Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Rev. Genet. 7, 552–564 (2006).
CAS PubMed Google Scholar
Guo, S. W. & Thompson, E. A. Performing the exact test of Hardy–Weinberg proportion for multiple alleles. Biometrics 48, 361–372 (1992).
CAS PubMed Google Scholar
Maiste, P. J. & Weir, B. S. A comparison of tests for independence in the FBI RFLP databases. Genetica 96, 125–138 (1995).
CAS PubMed Google Scholar
Wigginton, J. E., Cutler, D. J. & Abecasis, G. R. A note on exact tests of Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 887–893 (2005).
CAS PubMed PubMed Central Google Scholar
Weir, B. S., Hill, W. G. & Cardon, L. R. Allelic association patterns for a dense SNP map. Genet. Epidemiol. 27, 442–450 (2004).
CAS PubMed Google Scholar
Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data (Wiley, New York, 2002).
Google Scholar
Souverein, O. W., Zwinderman, A. H. & Tanck, M. W. T. Multiple imputation of missing genotype data for unrelated individuals. Ann. Hum. Genet. 70, 372–381 (2006).
CAS PubMed Google Scholar
Clayton, D. G. et al. Population structure differential bias and genomic control in a large-scale case–control association study. Nature Genet. 37, 1243–1246 (2005).
CAS PubMed Google Scholar
Marchini, J. et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450 (2006).
CAS PubMed PubMed Central Google Scholar
Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).
CAS PubMed PubMed Central Google Scholar
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
CAS PubMed PubMed Central Google Scholar
Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995).
CAS PubMed Google Scholar
Abecasis, G. R. & Cookson, W. O. C. GOLD — graphical overview of linkage disequilibrium. Bioinformatics 16, 182–183 (2000).
CAS PubMed Google Scholar
Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).
CAS PubMed Google Scholar
Maniatis, N. et al. The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc. Natl Acad. Sci. USA 99, 2228–2233 (2002).
CAS PubMed Google Scholar
Tapper, W. et al. A map of the human genome in linkage disequilibrium units. Proc. Natl Acad. Sci. USA 102, 11835–11839 (2005).
CAS PubMed Google Scholar
Crawford, D. C. et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genet. 36, 700–706 (2004).
CAS PubMed Google Scholar
McVean, G. A. et al. The fine-scale structure of recombination rate variation in the human genome. Science 23, 581–584 (2004).
Google Scholar
Li, N. & Stephens, M. Modelling LD and identifying recombination hotspots from SNP data. Genetics 165, 2213–2233 (2003).
CAS PubMed PubMed Central Google Scholar
Jeffreys. A. J., Kauppi, L. & Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatability complex. Nature Genet. 29, 217–222 (2001).
CAS PubMed Google Scholar
Jeffreys, A. J. & May, C. A. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nature Genet. 36, 151–156 (2004).
CAS PubMed Google Scholar
Ardlie, K. G., Krugylak, L. & Sielstad, M. Patterns of linkage disequilibrium in the human genome. Nature Rev. Genet. 3, 299–309 (2002).
CAS PubMed Google Scholar
Gabriel, S. B. et al. The structure of haplotype blocks in the human genome. Science 296, 2225–2229 (2002).
CAS Google Scholar
Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum. Hered. 56, 18–31 (2003).
PubMed Google Scholar
Stram, D. O. Tag SNP selection for association studies. Genet. Epidem. 27, 365–374 (2004).
Google Scholar
Carlson, C. S. et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106–120 (2004).
CAS Google Scholar
Zeggini, E. et al. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nature Genet. 37, 1320–1322 (2005).
CAS PubMed Google Scholar
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Huang, W. et al. Linkage disequilibrium sharing and haplotype-tagged SNP portability between populations. Proc. Natl Acad. Sci. USA 103, 1418–1421 (2006).
CAS PubMed Google Scholar
Gonzalez-Neira, A. et al. The portability of tagSNPs across populations: a worldwide survey. Genome Res. 16, 323–330 (2006).
CAS PubMed PubMed Central Google Scholar
McCullagh, P. & Nelder, J. A. Generalized Linear Models 2nd edn (Chapman and Hall, London, 1989). Still the best general reference on generalized linear models (includes linear, multinomial and logistic regression as special cases); it is relatively advanced and more gentle introductions are available.
Google Scholar
Sasieni, P. D. From genotypes to genes: doubling the sample size. Biometrics 53, 1253–1261 (1997). A useful reference for comparison of different single-SNP tests of association.
CAS PubMed Google Scholar
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Google Scholar
Freidlin, B., Zheng, G., Li, Z. H. & Gastwirth, J. L. Trend tests for case–control studies of genetic markers: power, sample size and robustness. Hum. Hered. 53, 146–152 (2002).
CAS PubMed Google Scholar
Lunn, D. J., Whittaker, J. C. & Best, N. A Bayesian toolkit for genetic association studies. Genet. Epidemiol. 30, 231–247 (2006).
PubMed Google Scholar
Prentice, R. L. & Pyke, R. Logistic disease incidence models and case–control studies. Biometrika 66, 403–411 (1979).
Google Scholar
Seaman, S. R. & Richardson, S. Equivalence of prospective and retrospective models in the Bayesian analysis of case–control studies. Biometrika 91, 15–25 (2004).
Google Scholar
Cox, D. R. & Hinkley, D. V. Theoretical statistics (Chapman and Hall, London, 1974).
Google Scholar
Wallace, C., Chapman J. M. & Clayton, D. G. Improved power offered by a score test for linkage disequilibrium mapping of quantitative-trait loci by selective genotyping. Am. J. Hum. Genet. 78, 498–504 (2006).
CAS PubMed PubMed Central Google Scholar
Agresti, A. Categorical Data Analysis 2nd edn (Wiley, New York, 2002).
Google Scholar
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
CAS PubMed Google Scholar
Devlin, B. & Roeder, K. Genomic control a new approach to genetic-based association studies. Theor. Pop. Biol. 60, 155–166 (2001).
CAS Google Scholar
Zheng, G., Freidlin, B. & Gastwirth. J. L. Robust genomic control. Am. J. Hum. Genet. 78, 350–356 (2006).
CAS PubMed Google Scholar
Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nature Genet. 36, 512–517 (2004).
CAS PubMed Google Scholar
Setakis, E., Stirnadel, H. & Balding D. J. Logistic regression protects against population structure in genetic association studies. Genome Res. 16, 290–296 (2006).
CAS PubMed PubMed Central Google Scholar
Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).
CAS PubMed PubMed Central Google Scholar
Satten, G., Flanders, W. D. & Yang, Q. Accounting for unmeasured population structure in case–control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68, 466–477 (2001).
CAS PubMed PubMed Central Google Scholar
Hoggart, C. J. et al. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72, 1492–1504 (2003).
CAS PubMed PubMed Central Google Scholar
Delrieu, O. & Bowman, C. Visualizing gene determinants of disease in drug discovery. Pharmacogenomics 7, 311–329 (2006).
CAS PubMed Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).
CAS PubMed Google Scholar
Yu, J. M. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38, 203–208 (2006).
CAS PubMed Google Scholar
Waldron, E. R. B., Whittaker J. C. & Balding D. J. Fine mapping of disease genes via haplotype clustering. Genet. Epidemiol. 30, 170–179 (2006).
CAS PubMed Google Scholar
Clayton, D., Chapman, J. & Cooper, J. The use of unphased multilocus genotype data in indirect association studies. Genet. Epidemiol. 27, 415–428 (2004).
PubMed Google Scholar
Cordell, H. J. & Clayton, D. G. A unified stepwise regression approach for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70, 124–141 (2002).
CAS PubMed Google Scholar
Wang, H. et al. Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170, 465–480 (2005).
CAS PubMed PubMed Central Google Scholar
Clark, A. G. The role of haplotypes in candidate-gene studies. Genet. Epidemiol. 27, 321–333 (2004).
PubMed Google Scholar
Sham, P. Statistics in Human Genetics (Arnold, London, 1998). Still a useful reference for basic linkage and association analyses, but now a little out of date.
Google Scholar
Schaid, D. J. Evaluating associations of haplotypes with traits. Genet. Epidemiol. 27, 348–364 (2004).
PubMed Google Scholar
Tzeng, J. Y., Devlin, B., Wasserman, L. & Roeder, K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet. 72, 891–902 (2003).
CAS PubMed PubMed Central Google Scholar
Lin, D. Y. & Zeng, D. Likelihood-based inference on haplotype effects in genetic association studies. J. Am. Stat. Assoc. 101, 89–104 (2006).
CAS Google Scholar
Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. & Poland, G. A. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–434 (2002).
PubMed Google Scholar
Ke, X. Y. et al. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13, 577–588 (2004).
CAS PubMed Google Scholar
Templeton, A. R., Boerwinkle, E. & Sing C. F. A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. Genetics 117, 343–351 (1987). The first in a series of papers that initiated cladistic and more general clustering approaches to haplotype-based tests of association.
CAS PubMed PubMed Central Google Scholar
Molitor, J., Marjoram, P. & Thomas, D. C. Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet. 73, 1368–1384 (2003).
CAS PubMed PubMed Central Google Scholar
Seltman, H., Roeder, K. & Devlin, B. Evolutionary-based association analysis using haplotype data. Genet. Epidemiol. 25, 48–58 (2003).
PubMed Google Scholar
Durrant, C. et al. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet. 75, 35–43 (2004).
CAS PubMed PubMed Central Google Scholar
Morris, A. P. Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes. Genet. Epidemiol. 29, 91–107 (2005).
PubMed Google Scholar
Beckmann, L., Thomas, D. C., Fischer, C. & Chang-Claude J. Haplotype sharing analysis using Mantel statistics. Hum. Hered. 59, 67–78 (2005).
CAS PubMed Google Scholar
Templeton, A. R. et al. Tree scanning: a method for using haplotype trees in phenotype/genotype association studies. Genetics 169, 441–453 (2005).
CAS PubMed PubMed Central Google Scholar
Tzeng, J. Y., Wang, C. H., Kao, J. T. & Hsiao, C. K. Regression-based association analysis with clustered haplotypes through use of genotypes. Am. J. Hum. Genet. 78, 231–242 (2006).
CAS PubMed Google Scholar
Zollner, S. & Pritchard, J. K. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169, 1071–1092 (2005).
CAS PubMed PubMed Central Google Scholar
Browning, S. R. Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 78, 903–913 (2006).
CAS PubMed PubMed Central Google Scholar
Moore, J. H. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 56, 73–82 (2003).
PubMed Google Scholar
Carlborg, O. & Haley, C. S. Epistasis: too often neglected in complex trait studies? Nature Rev. Genet. 5, 618–625 (2004).
CAS PubMed Google Scholar
Todd, J. A. Statistical false positive or true disease pathway? Nature Genet. 38, 731–733 (2006).
CAS PubMed Google Scholar
Lake, S. L. et al. Estimation and tests of haplotype–environment interaction when linkage phase is ambiguous. Hum. Hered. 55, 56–65 (2003).
CAS PubMed Google Scholar
Millstein, J., Conti, D. V., Gilliland, F. D. & Gauderman, W. J. A testing framework for identifying susceptibility genes in the presence of epistasis. Am. J. Hum. Genet. 78, 15–27 (2006).
CAS PubMed Google Scholar
Piegorsch, W. W., Weinberg, C. R. & Taylor, J. A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case–control studies. Stat. Med. 13, 153–162 (1994).
CAS PubMed Google Scholar
Cordell, H. J. Epistasis: what it means what it doesn't mean and statistical methods to detect it in humans. Hum. Mol. Genet. 11, 2463–2468 (2002).
CAS PubMed Google Scholar
Marchini, J., Donnelly, P. & Cardon, L. R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 37, 413–417 (2005).
CAS PubMed Google Scholar
Storey, J. D. & Tibshirani, R. Statistical significance for genome-wide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).
CAS PubMed Google Scholar
Dudbridge, F., Gusnanto, A. & Koeleman, P. C. Detecting multiple associations in genome-wide studies. Hum. Genomics 2, 310–317 (2006).
CAS PubMed PubMed Central Google Scholar
Ishwaran, H. & Rao, J. S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Am. Stat. Assoc. 98, 438–455 (2003).
Google Scholar
Yi, N. J. et al. Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics 170, 1333–1344 (2005).
CAS PubMed PubMed Central Google Scholar
Zondervan, K. T. & Cardon, L. R. The complex interplay among factors that influence allelic association. Nature Rev. Genet. 5, 238–238 (2004).
CAS Google Scholar
Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).
CAS Google Scholar
Bingham, S. & Riboli, E. Diet and cancer — the European prospective investigation into cancer and nutrition. Nature Rev. Cancer 4, 206–215 (2004).
CAS Google Scholar
Ollier, W., Sprosen, T. & Peakman, T. UK Biobank: from concept to reality. Pharmacogenomics 6, 639–646 (2005).
PubMed Google Scholar
Leschzinger, G. et al. Clinical factors and ABCB1 polymorphisms in prediction of antiepileptic drug response: a prospective cohort study. Lancet Neurol. 5, 668–676 (2006).
Google Scholar
Thompson, E. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 893–918 (Wiley, New York, 2003).
Google Scholar
Holmans, P. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 919–938 (Wiley, New York, 2003).
Google Scholar
Ewens, W. J. & Spielman, R. S. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 961–972 (Wiley, New York, 2003).
Google Scholar
Abecasis, G. R., Cardon, L. R. & Cookson, W. O. C. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000).
CAS PubMed Google Scholar
Van Steen, K. et al. Genomic screening and replication using the same data set in family-based association testing. Nature Genet. 37, 683–691 (2005).
CAS PubMed Google Scholar
Smith, M. W. & O'Brien, S. J. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nature Rev. Genet. 6, 623–266 (2005).
CAS PubMed Google Scholar
Reich, D. et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nature Genet. 37, 1113–1118 (2005).
CAS PubMed Google Scholar
Clayton, D. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 939–960 (Wiley, New York, 2003).
Google Scholar
Cardon, L. R. & Palmer, L. J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).
PubMed Google Scholar
Berger, M. et al. Hidden population substructures in an apparently homogeneous population bias association studies. Eur. J. Hum. Genet. 14, 236–244 (2006).
CAS PubMed Google Scholar
Wang, H. S., Thomas, D. C., Pe'er I. & Stram, D. O. Optimal two-stage genotyping designs for genome-wide association scans. Genet. Epidemiol. 30, 356–368 (2006).
PubMed Google Scholar
Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).
CAS PubMed Google Scholar
Verzilli, C. J., Stallard, N. & Whittaker, J. C. Bayesian graphical models for genomewide association studies. Am. J. Hum. Genet. 79, 100–112 (2006).
CAS PubMed PubMed Central Google Scholar
Dudbridge, F. & Koeleman, P. C. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am. J. Hum. Genet. 75, 424–435 (2004).
CAS PubMed PubMed Central Google Scholar
Hoh, J. & Ott, J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Rev. Genet. 4, 701–709 (2003).
CAS PubMed Google Scholar

Download references

Acknowledgements

I thank W. Astle and E. Waldron for help with drawing figures, and W. Astle, L. Cardon, A. Lewin, D. Lunn, A. Morris, D. Schaid, J. Whittaker and D. Zabaneh for discussions and comments on drafts of the manuscript. The author is supported in part by the UK Medical Research Council.

Author information

Authors and Affiliations

Department of Epidemiology and Public Health, Imperial College, St Marys Campus, Norfolk Place, London, W2 1PG, UK
David J. Balding

Authors

David J. Balding
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The author declares no competing financial interests.

Supplementary information

Supplementary information S1(box) (PDF 70 kb)

Glossary

Haplotype: A combination of alleles at different loci on the same chromosome.
Population stratification: Refers to a situation in which the population of interest includes subgroups of individuals that are on average more related to each other than to other members of the wider population.
Multiple-testing problem: Refers to the problem that arises when many null hypotheses are tested; some significant results are likely even if all the hypotheses are false.
Hardy–Weinberg equilibrium: Holds at a locus in a population when the two alleles within an individual are not statistically associated.
Significance level: Usually denoted, and chosen by the researcher to be the greatest probability of type-1 error that is tolerated for a statistical test. It is conventional to choose α = 5% for the overall analysis, which might consist of many tests each with a much lower significance level.
Test statistic: A numerical summary of the data that is used to measure support for the null hypothesis. Either the test statistic has a known probability distribution (such as χ²) under the null hypothesis, or its null distribution is approximated computationally.
Common-disease common-variant hypothesis: The hypothesis that many genetic variants that underlie complex diseases are common, and therefore susceptible to detection using current population association study designs. An alternative possibility is that genetic contributions to complex diseases arise from many variants, all of which are rare.
Effective population size: The size of a theoretical population that best approximates a given natural population under an assumed model. Human effective population size is often taken to mean the size of a constant-size, panmictic population of breeding adults that generates the same level of polymorphism under neutrality as observed in an actual human population.
Maximum-likelihood estimate: The value of an unknown parameter that maximizes the probability of the observed data under the assumed statistical model.
Phase: The information that is needed to determine the two haplotypes that underlie a multi-locus genotype within a chromosomal segment.
Regression models: A class of statistical models that relate an outcome variable to one or more explanatory variables. The goal might be to predict further values of the outcome variable given the explanatory variables, or to identify a minimal set of explanatory variables with good predictive power.
Prospective study design: Studies in which individuals are followed forward in time and disease events are recorded as they arise. DNA and biomarker samples, and data on environmental exposures and lifestyle factors, are usually obtained at the start of the study.
Retrospective study design: Studies in which individuals are identified for inclusion in the study on the basis of their disease state. Data on previous environmental exposures and lifestyle factors are then recorded, and samples for DNA and biomarker studies might be obtained.
Time to event: Refers to data in which the time to an event of interest is recorded, such as the time from the start of the study to disease onset, if any. This is potentially more informative than simply recording case or control status at the end of the study.
Linkage disequilibrium: The statistical association, within gametes in a population, of the alleles at two loci. Although linkage disequilibrium can be due to linkage, it can also arise at unlinked loci; for example, because of selection or non-random mating.
Type-1 error: The rejection of a true null hypothesis; for example, concluding that HWE does not hold when in fact it does. By contrast, the power of a test is the probability of correctly rejecting a false null hypothesis.
Degrees of freedom: This term is used in different senses both within statistics and in other fields. It can often be interpreted as the number of values that can be defined arbitrarily in the specification of a system; for example, the number of coefficients in a regression model. It is often sufficient to regard degrees of freedom as a parameter that is used to define particular probability distributions.
Bayesian: A statistical school of thought that, in contrast to the frequentist school, holds that inferences about any unknown parameter or hypothesis should be encapsulated in a probability distribution, given the observed data. Bayes theorem is a celebrated result in probability theory that allows one to compute the posterior distribution for an unknown from the observed data and its assumed prior distribution.
Likelihood-ratio test: A statistical test that is based on the ratio of likelihoods under alternative and null hypotheses. If the null hypothesis is a special case of the alternative hypothesis, then the likelihood-ratio statistic typically has a χ² distribution with degrees of freedom equal to the number of additional parameters under the alternative hypothesis.
Multinomial: Describes a variable with a finite number, say k, of possible outcomes; in the cases k = 2 and k = 3, the terms binomial and trinomial are also used.
Principal-components analysis: A statistical technique for summarizing many variables with minimal loss of information: the first principal component is the linear combination of the observed variables with the greatest variance; subsequent components maximize the variance subject to being uncorrelated with the preceding components.
Stepwise selection procedure: Describes a class of statistical procedures that identify from a large set of variables (such as SNPs) a subset that provides a good fit to a chosen statistical model (for example, a regression model that predicts case–control status) by successively including or discarding terms from the model.
Shrinkage methods: In this approach a prior distribution for regression coefficients is concentrated at zero, so that in the absence of a strong signal of association, the corresponding regression coefficient is 'shrunk' to zero. This mitigates the effects of too many variables (degrees of freedom) in the statistical model.
Frequentist: A name for the school of statistical thought in which support for a hypothesis or parameter value is assessed using the probability of the observed data (or more 'extreme' datasets) given the hypothesis or value. Usually contrasted with Bayesian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balding, D. A tutorial on statistical methods for population association studies. Nat Rev Genet 7, 781–791 (2006). https://doi.org/10.1038/nrg1916

Download citation

Issue Date: 01 October 2006
DOI: https://doi.org/10.1038/nrg1916

This article is cited by

Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea
- Xing Li
- Yong Wang
- Feng Cheng
Nature Genetics (2024)
Exploring genetic diversity and ascertaining genetic loci associated with important fruit quality traits in apple (Malus × domestica Borkh.)
- Poonam
- Rajnish Sharma
- Neena Chauhan
Physiology and Molecular Biology of Plants (2023)
Epistasis Detection via the Joint Cumulant
- Randall Reese
- Guifang Fu
- Kenneth Chiu
Statistics in Biosciences (2022)
Sparse principal component analysis based on genome network for correcting cell type heterogeneity in epigenome-wide association studies
- Rui Miao
- Qi Dang
- Yong Liang
Medical & Biological Engineering & Computing (2022)
A robust and stable gene selection algorithm based on graph theory and machine learning
- Subrata Saha
- Ahmed Soliman
- Sanguthevar Rajasekaran
Human Genomics (2021)

A tutorial on statistical methods for population association studies

Key Points

Abstract

Access options

Similar content being viewed by others

Genome-wide association studies

Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix

Rare-variant collapsing analyses for complex traits: guidelines and applications

References

Acknowledgements

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Supplementary information

Supplementary information S1(box) (PDF 70 kb)

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea

Exploring genetic diversity and ascertaining genetic loci associated with important fruit quality traits in apple (Malus × domestica Borkh.)

Epistasis Detection via the Joint Cumulant

Sparse principal component analysis based on genome network for correcting cell type heterogeneity in epigenome-wide association studies

A robust and stable gene selection algorithm based on graph theory and machine learning

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Supplementary information

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links