Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A tutorial on statistical methods for population association studies

Key Points

  • Although population association studies are not new, there remain many areas of disagreement over appropriate statistical analyses. This article provides an overview of statistical methods, including areas of controversy and ongoing developments. It does not consider family-based association studies, nor linkage or admixture studies.

  • I first cover analyses that are preliminary to association testing: testing for Hardy–Weinberg equilibrium; imputing missing genotype data; inferring haplotype from genotype data; measures of linkage disequilibrium and estimates of recombination rates; and choosing tag SNPs.

  • Among tests of association, I cover case–control, quantitative and ordered phenotypes, and analyses that are based on single SNPs, multiple SNPs and haplotypes. There is a discussion of issues that are relevant to genome-wide association studies.

  • I discuss Genomic Control and other approaches to the problem of population stratification.

  • I give particular attention to the problem of multiple testing, and discuss both frequentist and Bayesian approaches to addressing the problem.


Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy–Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Log quantile–quantile (QQ) P-value plot for 3,478 single-SNP tests of association.
Figure 2: Armitage test of single-SNP association with case–control outcome.
Figure 3: Linear regression test of single-SNP associations with continuous outcomes.


  1. 1

    Jobling, M. A., Hurles, M. E. & Tyler-Smith, C. Human Evolutionary Genetics: Origins Peoples & Disease (Garland Science, New York, 2004).

    Google Scholar 

  2. 2

    Thomas, D. C. Statistical Methods in Genetic Epidemiology (Oxford Univ. Press, 2004). The best general reference for statistical methods in genetic epidemiology; for population association studies it discusses important general issues without specific details on tests and other analyses.

    Google Scholar 

  3. 3

    Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting marker–disease association by testing for Hardy–Weinberg disequilibrium at a marker locus. Am. J. Hum. Genet. 63, 1531–1540 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Wittke-Thompson, J. K., Pluzhnikov, A. & Cox, N. J. Rational inferences about departures from Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 967–986 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high-resolution survey of deletion polymorphism in the human genome. Nature Genet. 38, 75–81 (2006).

    CAS  PubMed  Google Scholar 

  6. 6

    Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Rev. Genet. 7, 552–564 (2006).

    CAS  PubMed  Google Scholar 

  7. 7

    Guo, S. W. & Thompson, E. A. Performing the exact test of Hardy–Weinberg proportion for multiple alleles. Biometrics 48, 361–372 (1992).

    CAS  PubMed  Google Scholar 

  8. 8

    Maiste, P. J. & Weir, B. S. A comparison of tests for independence in the FBI RFLP databases. Genetica 96, 125–138 (1995).

    CAS  PubMed  Google Scholar 

  9. 9

    Wigginton, J. E., Cutler, D. J. & Abecasis, G. R. A note on exact tests of Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 887–893 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Weir, B. S., Hill, W. G. & Cardon, L. R. Allelic association patterns for a dense SNP map. Genet. Epidemiol. 27, 442–450 (2004).

    CAS  PubMed  Google Scholar 

  11. 11

    Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data (Wiley, New York, 2002).

    Google Scholar 

  12. 12

    Souverein, O. W., Zwinderman, A. H. & Tanck, M. W. T. Multiple imputation of missing genotype data for unrelated individuals. Ann. Hum. Genet. 70, 372–381 (2006).

    CAS  PubMed  Google Scholar 

  13. 13

    Clayton, D. G. et al. Population structure differential bias and genomic control in a large-scale case–control association study. Nature Genet. 37, 1243–1246 (2005).

    CAS  PubMed  Google Scholar 

  14. 14

    Marchini, J. et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16

    Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995).

    CAS  PubMed  Google Scholar 

  18. 18

    Abecasis, G. R. & Cookson, W. O. C. GOLD — graphical overview of linkage disequilibrium. Bioinformatics 16, 182–183 (2000).

    CAS  PubMed  Google Scholar 

  19. 19

    Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).

    CAS  PubMed  Google Scholar 

  20. 20

    Maniatis, N. et al. The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc. Natl Acad. Sci. USA 99, 2228–2233 (2002).

    CAS  PubMed  Google Scholar 

  21. 21

    Tapper, W. et al. A map of the human genome in linkage disequilibrium units. Proc. Natl Acad. Sci. USA 102, 11835–11839 (2005).

    CAS  PubMed  Google Scholar 

  22. 22

    Crawford, D. C. et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genet. 36, 700–706 (2004).

    CAS  PubMed  Google Scholar 

  23. 23

    McVean, G. A. et al. The fine-scale structure of recombination rate variation in the human genome. Science 23, 581–584 (2004).

    Google Scholar 

  24. 24

    Li, N. & Stephens, M. Modelling LD and identifying recombination hotspots from SNP data. Genetics 165, 2213–2233 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25

    Jeffreys. A. J., Kauppi, L. & Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatability complex. Nature Genet. 29, 217–222 (2001).

    CAS  PubMed  Google Scholar 

  26. 26

    Jeffreys, A. J. & May, C. A. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nature Genet. 36, 151–156 (2004).

    CAS  PubMed  Google Scholar 

  27. 27

    Ardlie, K. G., Krugylak, L. & Sielstad, M. Patterns of linkage disequilibrium in the human genome. Nature Rev. Genet. 3, 299–309 (2002).

    CAS  PubMed  Google Scholar 

  28. 28

    Gabriel, S. B. et al. The structure of haplotype blocks in the human genome. Science 296, 2225–2229 (2002).

    CAS  Google Scholar 

  29. 29

    Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum. Hered. 56, 18–31 (2003).

    PubMed  Google Scholar 

  30. 30

    Stram, D. O. Tag SNP selection for association studies. Genet. Epidem. 27, 365–374 (2004).

    Google Scholar 

  31. 31

    Carlson, C. S. et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106–120 (2004).

    CAS  Google Scholar 

  32. 32

    Zeggini, E. et al. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nature Genet. 37, 1320–1322 (2005).

    CAS  PubMed  Google Scholar 

  33. 33

    The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  34. 34

    Huang, W. et al. Linkage disequilibrium sharing and haplotype-tagged SNP portability between populations. Proc. Natl Acad. Sci. USA 103, 1418–1421 (2006).

    CAS  PubMed  Google Scholar 

  35. 35

    Gonzalez-Neira, A. et al. The portability of tagSNPs across populations: a worldwide survey. Genome Res. 16, 323–330 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36

    McCullagh, P. & Nelder, J. A. Generalized Linear Models 2nd edn (Chapman and Hall, London, 1989). Still the best general reference on generalized linear models (includes linear, multinomial and logistic regression as special cases); it is relatively advanced and more gentle introductions are available.

    Google Scholar 

  37. 37

    Sasieni, P. D. From genotypes to genes: doubling the sample size. Biometrics 53, 1253–1261 (1997). A useful reference for comparison of different single-SNP tests of association.

    CAS  PubMed  Google Scholar 

  38. 38

    Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

    Google Scholar 

  39. 39

    Freidlin, B., Zheng, G., Li, Z. H. & Gastwirth, J. L. Trend tests for case–control studies of genetic markers: power, sample size and robustness. Hum. Hered. 53, 146–152 (2002).

    CAS  PubMed  Google Scholar 

  40. 40

    Lunn, D. J., Whittaker, J. C. & Best, N. A Bayesian toolkit for genetic association studies. Genet. Epidemiol. 30, 231–247 (2006).

    PubMed  Google Scholar 

  41. 41

    Prentice, R. L. & Pyke, R. Logistic disease incidence models and case–control studies. Biometrika 66, 403–411 (1979).

    Google Scholar 

  42. 42

    Seaman, S. R. & Richardson, S. Equivalence of prospective and retrospective models in the Bayesian analysis of case–control studies. Biometrika 91, 15–25 (2004).

    Google Scholar 

  43. 43

    Cox, D. R. & Hinkley, D. V. Theoretical statistics (Chapman and Hall, London, 1974).

    Google Scholar 

  44. 44

    Wallace, C., Chapman J. M. & Clayton, D. G. Improved power offered by a score test for linkage disequilibrium mapping of quantitative-trait loci by selective genotyping. Am. J. Hum. Genet. 78, 498–504 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Agresti, A. Categorical Data Analysis 2nd edn (Wiley, New York, 2002).

    Google Scholar 

  46. 46

    Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    CAS  PubMed  Google Scholar 

  47. 47

    Devlin, B. & Roeder, K. Genomic control a new approach to genetic-based association studies. Theor. Pop. Biol. 60, 155–166 (2001).

    CAS  Google Scholar 

  48. 48

    Zheng, G., Freidlin, B. & Gastwirth. J. L. Robust genomic control. Am. J. Hum. Genet. 78, 350–356 (2006).

    CAS  PubMed  Google Scholar 

  49. 49

    Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nature Genet. 36, 512–517 (2004).

    CAS  PubMed  Google Scholar 

  50. 50

    Setakis, E., Stirnadel, H. & Balding D. J. Logistic regression protects against population structure in genetic association studies. Genome Res. 16, 290–296 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51

    Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52

    Satten, G., Flanders, W. D. & Yang, Q. Accounting for unmeasured population structure in case–control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68, 466–477 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53

    Hoggart, C. J. et al. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72, 1492–1504 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54

    Delrieu, O. & Bowman, C. Visualizing gene determinants of disease in drug discovery. Pharmacogenomics 7, 311–329 (2006).

    CAS  PubMed  Google Scholar 

  55. 55

    Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).

    CAS  PubMed  Google Scholar 

  56. 56

    Yu, J. M. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38, 203–208 (2006).

    CAS  PubMed  Google Scholar 

  57. 57

    Waldron, E. R. B., Whittaker J. C. & Balding D. J. Fine mapping of disease genes via haplotype clustering. Genet. Epidemiol. 30, 170–179 (2006).

    CAS  PubMed  Google Scholar 

  58. 58

    Clayton, D., Chapman, J. & Cooper, J. The use of unphased multilocus genotype data in indirect association studies. Genet. Epidemiol. 27, 415–428 (2004).

    PubMed  Google Scholar 

  59. 59

    Cordell, H. J. & Clayton, D. G. A unified stepwise regression approach for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70, 124–141 (2002).

    CAS  PubMed  Google Scholar 

  60. 60

    Wang, H. et al. Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170, 465–480 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Clark, A. G. The role of haplotypes in candidate-gene studies. Genet. Epidemiol. 27, 321–333 (2004).

    PubMed  Google Scholar 

  62. 62

    Sham, P. Statistics in Human Genetics (Arnold, London, 1998). Still a useful reference for basic linkage and association analyses, but now a little out of date.

    Google Scholar 

  63. 63

    Schaid, D. J. Evaluating associations of haplotypes with traits. Genet. Epidemiol. 27, 348–364 (2004).

    PubMed  Google Scholar 

  64. 64

    Tzeng, J. Y., Devlin, B., Wasserman, L. & Roeder, K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet. 72, 891–902 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65

    Lin, D. Y. & Zeng, D. Likelihood-based inference on haplotype effects in genetic association studies. J. Am. Stat. Assoc. 101, 89–104 (2006).

    CAS  Google Scholar 

  66. 66

    Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. & Poland, G. A. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–434 (2002).

    PubMed  Google Scholar 

  67. 67

    Ke, X. Y. et al. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13, 577–588 (2004).

    CAS  PubMed  Google Scholar 

  68. 68

    Templeton, A. R., Boerwinkle, E. & Sing C. F. A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. Genetics 117, 343–351 (1987). The first in a series of papers that initiated cladistic and more general clustering approaches to haplotype-based tests of association.

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69

    Molitor, J., Marjoram, P. & Thomas, D. C. Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet. 73, 1368–1384 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70

    Seltman, H., Roeder, K. & Devlin, B. Evolutionary-based association analysis using haplotype data. Genet. Epidemiol. 25, 48–58 (2003).

    PubMed  Google Scholar 

  71. 71

    Durrant, C. et al. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet. 75, 35–43 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. 72

    Morris, A. P. Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes. Genet. Epidemiol. 29, 91–107 (2005).

    PubMed  Google Scholar 

  73. 73

    Beckmann, L., Thomas, D. C., Fischer, C. & Chang-Claude J. Haplotype sharing analysis using Mantel statistics. Hum. Hered. 59, 67–78 (2005).

    CAS  PubMed  Google Scholar 

  74. 74

    Templeton, A. R. et al. Tree scanning: a method for using haplotype trees in phenotype/genotype association studies. Genetics 169, 441–453 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. 75

    Tzeng, J. Y., Wang, C. H., Kao, J. T. & Hsiao, C. K. Regression-based association analysis with clustered haplotypes through use of genotypes. Am. J. Hum. Genet. 78, 231–242 (2006).

    CAS  PubMed  Google Scholar 

  76. 76

    Zollner, S. & Pritchard, J. K. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169, 1071–1092 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. 77

    Browning, S. R. Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 78, 903–913 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. 78

    Moore, J. H. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 56, 73–82 (2003).

    PubMed  Google Scholar 

  79. 79

    Carlborg, O. & Haley, C. S. Epistasis: too often neglected in complex trait studies? Nature Rev. Genet. 5, 618–625 (2004).

    CAS  PubMed  Google Scholar 

  80. 80

    Todd, J. A. Statistical false positive or true disease pathway? Nature Genet. 38, 731–733 (2006).

    CAS  PubMed  Google Scholar 

  81. 81

    Lake, S. L. et al. Estimation and tests of haplotype–environment interaction when linkage phase is ambiguous. Hum. Hered. 55, 56–65 (2003).

    CAS  PubMed  Google Scholar 

  82. 82

    Millstein, J., Conti, D. V., Gilliland, F. D. & Gauderman, W. J. A testing framework for identifying susceptibility genes in the presence of epistasis. Am. J. Hum. Genet. 78, 15–27 (2006).

    CAS  PubMed  Google Scholar 

  83. 83

    Piegorsch, W. W., Weinberg, C. R. & Taylor, J. A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case–control studies. Stat. Med. 13, 153–162 (1994).

    CAS  PubMed  Google Scholar 

  84. 84

    Cordell, H. J. Epistasis: what it means what it doesn't mean and statistical methods to detect it in humans. Hum. Mol. Genet. 11, 2463–2468 (2002).

    CAS  PubMed  Google Scholar 

  85. 85

    Marchini, J., Donnelly, P. & Cardon, L. R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 37, 413–417 (2005).

    CAS  PubMed  Google Scholar 

  86. 86

    Storey, J. D. & Tibshirani, R. Statistical significance for genome-wide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

    CAS  PubMed  Google Scholar 

  87. 87

    Dudbridge, F., Gusnanto, A. & Koeleman, P. C. Detecting multiple associations in genome-wide studies. Hum. Genomics 2, 310–317 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  88. 88

    Ishwaran, H. & Rao, J. S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Am. Stat. Assoc. 98, 438–455 (2003).

    Google Scholar 

  89. 89

    Yi, N. J. et al. Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics 170, 1333–1344 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  90. 90

    Zondervan, K. T. & Cardon, L. R. The complex interplay among factors that influence allelic association. Nature Rev. Genet. 5, 238–238 (2004).

    CAS  Google Scholar 

  91. 91

    Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).

    CAS  Google Scholar 

  92. 92

    Bingham, S. & Riboli, E. Diet and cancer — the European prospective investigation into cancer and nutrition. Nature Rev. Cancer 4, 206–215 (2004).

    CAS  Google Scholar 

  93. 93

    Ollier, W., Sprosen, T. & Peakman, T. UK Biobank: from concept to reality. Pharmacogenomics 6, 639–646 (2005).

    PubMed  Google Scholar 

  94. 94

    Leschzinger, G. et al. Clinical factors and ABCB1 polymorphisms in prediction of antiepileptic drug response: a prospective cohort study. Lancet Neurol. 5, 668–676 (2006).

    Google Scholar 

  95. 95

    Thompson, E. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 893–918 (Wiley, New York, 2003).

    Google Scholar 

  96. 96

    Holmans, P. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 919–938 (Wiley, New York, 2003).

    Google Scholar 

  97. 97

    Ewens, W. J. & Spielman, R. S. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 961–972 (Wiley, New York, 2003).

    Google Scholar 

  98. 98

    Abecasis, G. R., Cardon, L. R. & Cookson, W. O. C. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000).

    CAS  PubMed  Google Scholar 

  99. 99

    Van Steen, K. et al. Genomic screening and replication using the same data set in family-based association testing. Nature Genet. 37, 683–691 (2005).

    CAS  PubMed  Google Scholar 

  100. 100

    Smith, M. W. & O'Brien, S. J. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nature Rev. Genet. 6, 623–266 (2005).

    CAS  PubMed  Google Scholar 

  101. 101

    Reich, D. et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nature Genet. 37, 1113–1118 (2005).

    CAS  PubMed  Google Scholar 

  102. 102

    Clayton, D. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 939–960 (Wiley, New York, 2003).

    Google Scholar 

  103. 103

    Cardon, L. R. & Palmer, L. J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).

    PubMed  Google Scholar 

  104. 104

    Berger, M. et al. Hidden population substructures in an apparently homogeneous population bias association studies. Eur. J. Hum. Genet. 14, 236–244 (2006).

    CAS  PubMed  Google Scholar 

  105. 105

    Wang, H. S., Thomas, D. C., Pe'er I. & Stram, D. O. Optimal two-stage genotyping designs for genome-wide association scans. Genet. Epidemiol. 30, 356–368 (2006).

    PubMed  Google Scholar 

  106. 106

    Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).

    CAS  PubMed  Google Scholar 

  107. 107

    Verzilli, C. J., Stallard, N. & Whittaker, J. C. Bayesian graphical models for genomewide association studies. Am. J. Hum. Genet. 79, 100–112 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  108. 108

    Dudbridge, F. & Koeleman, P. C. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am. J. Hum. Genet. 75, 424–435 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  109. 109

    Hoh, J. & Ott, J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Rev. Genet. 4, 701–709 (2003).

    CAS  PubMed  Google Scholar 

Download references


I thank W. Astle and E. Waldron for help with drawing figures, and W. Astle, L. Cardon, A. Lewin, D. Lunn, A. Morris, D. Schaid, J. Whittaker and D. Zabaneh for discussions and comments on drafts of the manuscript. The author is supported in part by the UK Medical Research Council.

Author information



Ethics declarations

Competing interests

The author declares no competing financial interests.

Supplementary information

Related links

Related links


European Bioinformatics Institute

Genetic Analysis Software (includes almost all freely available software for statistical genetic analyses, regularly updated)

Genetic Power Calculator (a useful tool that calculates the power of many simple study designs)

International HapMap Project

Nature Reviews Genetics audio supplement

R genetics package

Wellcome Trust Case Control Consortium (a large, genome-wide association study for eight distinct diseases with common set of controls)



A combination of alleles at different loci on the same chromosome.

Population stratification

Refers to a situation in which the population of interest includes subgroups of individuals that are on average more related to each other than to other members of the wider population.

Multiple-testing problem

Refers to the problem that arises when many null hypotheses are tested; some significant results are likely even if all the hypotheses are false.

Hardy–Weinberg equilibrium

Holds at a locus in a population when the two alleles within an individual are not statistically associated.

Significance level

Usually denoted, and chosen by the researcher to be the greatest probability of type-1 error that is tolerated for a statistical test. It is conventional to choose α = 5% for the overall analysis, which might consist of many tests each with a much lower significance level.

Test statistic

A numerical summary of the data that is used to measure support for the null hypothesis. Either the test statistic has a known probability distribution (such as χ2) under the null hypothesis, or its null distribution is approximated computationally.

Common-disease common-variant hypothesis

The hypothesis that many genetic variants that underlie complex diseases are common, and therefore susceptible to detection using current population association study designs. An alternative possibility is that genetic contributions to complex diseases arise from many variants, all of which are rare.

Effective population size

The size of a theoretical population that best approximates a given natural population under an assumed model. Human effective population size is often taken to mean the size of a constant-size, panmictic population of breeding adults that generates the same level of polymorphism under neutrality as observed in an actual human population.

Maximum-likelihood estimate

The value of an unknown parameter that maximizes the probability of the observed data under the assumed statistical model.


The information that is needed to determine the two haplotypes that underlie a multi-locus genotype within a chromosomal segment.

Regression models

A class of statistical models that relate an outcome variable to one or more explanatory variables. The goal might be to predict further values of the outcome variable given the explanatory variables, or to identify a minimal set of explanatory variables with good predictive power.

Prospective study design

Studies in which individuals are followed forward in time and disease events are recorded as they arise. DNA and biomarker samples, and data on environmental exposures and lifestyle factors, are usually obtained at the start of the study.

Retrospective study design

Studies in which individuals are identified for inclusion in the study on the basis of their disease state. Data on previous environmental exposures and lifestyle factors are then recorded, and samples for DNA and biomarker studies might be obtained.

Time to event

Refers to data in which the time to an event of interest is recorded, such as the time from the start of the study to disease onset, if any. This is potentially more informative than simply recording case or control status at the end of the study.

Linkage disequilibrium

The statistical association, within gametes in a population, of the alleles at two loci. Although linkage disequilibrium can be due to linkage, it can also arise at unlinked loci; for example, because of selection or non-random mating.

Type-1 error

The rejection of a true null hypothesis; for example, concluding that HWE does not hold when in fact it does. By contrast, the power of a test is the probability of correctly rejecting a false null hypothesis.

Degrees of freedom

This term is used in different senses both within statistics and in other fields. It can often be interpreted as the number of values that can be defined arbitrarily in the specification of a system; for example, the number of coefficients in a regression model. It is often sufficient to regard degrees of freedom as a parameter that is used to define particular probability distributions.


A statistical school of thought that, in contrast to the frequentist school, holds that inferences about any unknown parameter or hypothesis should be encapsulated in a probability distribution, given the observed data. Bayes theorem is a celebrated result in probability theory that allows one to compute the posterior distribution for an unknown from the observed data and its assumed prior distribution.

Likelihood-ratio test

A statistical test that is based on the ratio of likelihoods under alternative and null hypotheses. If the null hypothesis is a special case of the alternative hypothesis, then the likelihood-ratio statistic typically has a χ2 distribution with degrees of freedom equal to the number of additional parameters under the alternative hypothesis.


Describes a variable with a finite number, say k, of possible outcomes; in the cases k = 2 and k = 3, the terms binomial and trinomial are also used.

Principal-components analysis

A statistical technique for summarizing many variables with minimal loss of information: the first principal component is the linear combination of the observed variables with the greatest variance; subsequent components maximize the variance subject to being uncorrelated with the preceding components.

Stepwise selection procedure

Describes a class of statistical procedures that identify from a large set of variables (such as SNPs) a subset that provides a good fit to a chosen statistical model (for example, a regression model that predicts case–control status) by successively including or discarding terms from the model.

Shrinkage methods

In this approach a prior distribution for regression coefficients is concentrated at zero, so that in the absence of a strong signal of association, the corresponding regression coefficient is 'shrunk' to zero. This mitigates the effects of too many variables (degrees of freedom) in the statistical model.


A name for the school of statistical thought in which support for a hypothesis or parameter value is assessed using the probability of the observed data (or more 'extreme' datasets) given the hypothesis or value. Usually contrasted with Bayesian.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Balding, D. A tutorial on statistical methods for population association studies. Nat Rev Genet 7, 781–791 (2006).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing