Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# A tutorial on statistical methods for population association studies

## Key Points

• Although population association studies are not new, there remain many areas of disagreement over appropriate statistical analyses. This article provides an overview of statistical methods, including areas of controversy and ongoing developments. It does not consider family-based association studies, nor linkage or admixture studies.

• I first cover analyses that are preliminary to association testing: testing for Hardy–Weinberg equilibrium; imputing missing genotype data; inferring haplotype from genotype data; measures of linkage disequilibrium and estimates of recombination rates; and choosing tag SNPs.

• Among tests of association, I cover case–control, quantitative and ordered phenotypes, and analyses that are based on single SNPs, multiple SNPs and haplotypes. There is a discussion of issues that are relevant to genome-wide association studies.

• I discuss Genomic Control and other approaches to the problem of population stratification.

• I give particular attention to the problem of multiple testing, and discuss both frequentist and Bayesian approaches to addressing the problem.

## Abstract

Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy–Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.

## Access options

from\$8.99

All prices are NET prices.

## References

1. 1

Jobling, M. A., Hurles, M. E. & Tyler-Smith, C. Human Evolutionary Genetics: Origins Peoples & Disease (Garland Science, New York, 2004).

2. 2

Thomas, D. C. Statistical Methods in Genetic Epidemiology (Oxford Univ. Press, 2004). The best general reference for statistical methods in genetic epidemiology; for population association studies it discusses important general issues without specific details on tests and other analyses.

3. 3

Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting marker–disease association by testing for Hardy–Weinberg disequilibrium at a marker locus. Am. J. Hum. Genet. 63, 1531–1540 (1998).

4. 4

Wittke-Thompson, J. K., Pluzhnikov, A. & Cox, N. J. Rational inferences about departures from Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 967–986 (2005).

5. 5

Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high-resolution survey of deletion polymorphism in the human genome. Nature Genet. 38, 75–81 (2006).

6. 6

Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Rev. Genet. 7, 552–564 (2006).

7. 7

Guo, S. W. & Thompson, E. A. Performing the exact test of Hardy–Weinberg proportion for multiple alleles. Biometrics 48, 361–372 (1992).

8. 8

Maiste, P. J. & Weir, B. S. A comparison of tests for independence in the FBI RFLP databases. Genetica 96, 125–138 (1995).

9. 9

Wigginton, J. E., Cutler, D. J. & Abecasis, G. R. A note on exact tests of Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 887–893 (2005).

10. 10

Weir, B. S., Hill, W. G. & Cardon, L. R. Allelic association patterns for a dense SNP map. Genet. Epidemiol. 27, 442–450 (2004).

11. 11

Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data (Wiley, New York, 2002).

12. 12

Souverein, O. W., Zwinderman, A. H. & Tanck, M. W. T. Multiple imputation of missing genotype data for unrelated individuals. Ann. Hum. Genet. 70, 372–381 (2006).

13. 13

Clayton, D. G. et al. Population structure differential bias and genomic control in a large-scale case–control association study. Nature Genet. 37, 1243–1246 (2005).

14. 14

Marchini, J. et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450 (2006).

15. 15

Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).

16. 16

Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

17. 17

Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995).

18. 18

Abecasis, G. R. & Cookson, W. O. C. GOLD — graphical overview of linkage disequilibrium. Bioinformatics 16, 182–183 (2000).

19. 19

Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).

20. 20

Maniatis, N. et al. The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc. Natl Acad. Sci. USA 99, 2228–2233 (2002).

21. 21

Tapper, W. et al. A map of the human genome in linkage disequilibrium units. Proc. Natl Acad. Sci. USA 102, 11835–11839 (2005).

22. 22

Crawford, D. C. et al. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genet. 36, 700–706 (2004).

23. 23

McVean, G. A. et al. The fine-scale structure of recombination rate variation in the human genome. Science 23, 581–584 (2004).

24. 24

Li, N. & Stephens, M. Modelling LD and identifying recombination hotspots from SNP data. Genetics 165, 2213–2233 (2003).

25. 25

Jeffreys. A. J., Kauppi, L. & Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatability complex. Nature Genet. 29, 217–222 (2001).

26. 26

Jeffreys, A. J. & May, C. A. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nature Genet. 36, 151–156 (2004).

27. 27

Ardlie, K. G., Krugylak, L. & Sielstad, M. Patterns of linkage disequilibrium in the human genome. Nature Rev. Genet. 3, 299–309 (2002).

28. 28

Gabriel, S. B. et al. The structure of haplotype blocks in the human genome. Science 296, 2225–2229 (2002).

29. 29

Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum. Hered. 56, 18–31 (2003).

30. 30

Stram, D. O. Tag SNP selection for association studies. Genet. Epidem. 27, 365–374 (2004).

31. 31

Carlson, C. S. et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106–120 (2004).

32. 32

Zeggini, E. et al. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nature Genet. 37, 1320–1322 (2005).

33. 33

The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

34. 34

Huang, W. et al. Linkage disequilibrium sharing and haplotype-tagged SNP portability between populations. Proc. Natl Acad. Sci. USA 103, 1418–1421 (2006).

35. 35

Gonzalez-Neira, A. et al. The portability of tagSNPs across populations: a worldwide survey. Genome Res. 16, 323–330 (2006).

36. 36

McCullagh, P. & Nelder, J. A. Generalized Linear Models 2nd edn (Chapman and Hall, London, 1989). Still the best general reference on generalized linear models (includes linear, multinomial and logistic regression as special cases); it is relatively advanced and more gentle introductions are available.

37. 37

Sasieni, P. D. From genotypes to genes: doubling the sample size. Biometrics 53, 1253–1261 (1997). A useful reference for comparison of different single-SNP tests of association.

38. 38

Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

39. 39

Freidlin, B., Zheng, G., Li, Z. H. & Gastwirth, J. L. Trend tests for case–control studies of genetic markers: power, sample size and robustness. Hum. Hered. 53, 146–152 (2002).

40. 40

Lunn, D. J., Whittaker, J. C. & Best, N. A Bayesian toolkit for genetic association studies. Genet. Epidemiol. 30, 231–247 (2006).

41. 41

Prentice, R. L. & Pyke, R. Logistic disease incidence models and case–control studies. Biometrika 66, 403–411 (1979).

42. 42

Seaman, S. R. & Richardson, S. Equivalence of prospective and retrospective models in the Bayesian analysis of case–control studies. Biometrika 91, 15–25 (2004).

43. 43

Cox, D. R. & Hinkley, D. V. Theoretical statistics (Chapman and Hall, London, 1974).

44. 44

Wallace, C., Chapman J. M. & Clayton, D. G. Improved power offered by a score test for linkage disequilibrium mapping of quantitative-trait loci by selective genotyping. Am. J. Hum. Genet. 78, 498–504 (2006).

45. 45

Agresti, A. Categorical Data Analysis 2nd edn (Wiley, New York, 2002).

46. 46

Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

47. 47

Devlin, B. & Roeder, K. Genomic control a new approach to genetic-based association studies. Theor. Pop. Biol. 60, 155–166 (2001).

48. 48

Zheng, G., Freidlin, B. & Gastwirth. J. L. Robust genomic control. Am. J. Hum. Genet. 78, 350–356 (2006).

49. 49

Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nature Genet. 36, 512–517 (2004).

50. 50

Setakis, E., Stirnadel, H. & Balding D. J. Logistic regression protects against population structure in genetic association studies. Genome Res. 16, 290–296 (2006).

51. 51

Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

52. 52

Satten, G., Flanders, W. D. & Yang, Q. Accounting for unmeasured population structure in case–control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68, 466–477 (2001).

53. 53

Hoggart, C. J. et al. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72, 1492–1504 (2003).

54. 54

Delrieu, O. & Bowman, C. Visualizing gene determinants of disease in drug discovery. Pharmacogenomics 7, 311–329 (2006).

55. 55

Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).

56. 56

Yu, J. M. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38, 203–208 (2006).

57. 57

Waldron, E. R. B., Whittaker J. C. & Balding D. J. Fine mapping of disease genes via haplotype clustering. Genet. Epidemiol. 30, 170–179 (2006).

58. 58

Clayton, D., Chapman, J. & Cooper, J. The use of unphased multilocus genotype data in indirect association studies. Genet. Epidemiol. 27, 415–428 (2004).

59. 59

Cordell, H. J. & Clayton, D. G. A unified stepwise regression approach for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70, 124–141 (2002).

60. 60

Wang, H. et al. Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170, 465–480 (2005).

61. 61

Clark, A. G. The role of haplotypes in candidate-gene studies. Genet. Epidemiol. 27, 321–333 (2004).

62. 62

Sham, P. Statistics in Human Genetics (Arnold, London, 1998). Still a useful reference for basic linkage and association analyses, but now a little out of date.

63. 63

Schaid, D. J. Evaluating associations of haplotypes with traits. Genet. Epidemiol. 27, 348–364 (2004).

64. 64

Tzeng, J. Y., Devlin, B., Wasserman, L. & Roeder, K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet. 72, 891–902 (2003).

65. 65

Lin, D. Y. & Zeng, D. Likelihood-based inference on haplotype effects in genetic association studies. J. Am. Stat. Assoc. 101, 89–104 (2006).

66. 66

Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. & Poland, G. A. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–434 (2002).

67. 67

Ke, X. Y. et al. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13, 577–588 (2004).

68. 68

Templeton, A. R., Boerwinkle, E. & Sing C. F. A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. Genetics 117, 343–351 (1987). The first in a series of papers that initiated cladistic and more general clustering approaches to haplotype-based tests of association.

69. 69

Molitor, J., Marjoram, P. & Thomas, D. C. Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet. 73, 1368–1384 (2003).

70. 70

Seltman, H., Roeder, K. & Devlin, B. Evolutionary-based association analysis using haplotype data. Genet. Epidemiol. 25, 48–58 (2003).

71. 71

Durrant, C. et al. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet. 75, 35–43 (2004).

72. 72

Morris, A. P. Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes. Genet. Epidemiol. 29, 91–107 (2005).

73. 73

Beckmann, L., Thomas, D. C., Fischer, C. & Chang-Claude J. Haplotype sharing analysis using Mantel statistics. Hum. Hered. 59, 67–78 (2005).

74. 74

Templeton, A. R. et al. Tree scanning: a method for using haplotype trees in phenotype/genotype association studies. Genetics 169, 441–453 (2005).

75. 75

Tzeng, J. Y., Wang, C. H., Kao, J. T. & Hsiao, C. K. Regression-based association analysis with clustered haplotypes through use of genotypes. Am. J. Hum. Genet. 78, 231–242 (2006).

76. 76

Zollner, S. & Pritchard, J. K. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169, 1071–1092 (2005).

77. 77

Browning, S. R. Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 78, 903–913 (2006).

78. 78

Moore, J. H. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 56, 73–82 (2003).

79. 79

Carlborg, O. & Haley, C. S. Epistasis: too often neglected in complex trait studies? Nature Rev. Genet. 5, 618–625 (2004).

80. 80

Todd, J. A. Statistical false positive or true disease pathway? Nature Genet. 38, 731–733 (2006).

81. 81

Lake, S. L. et al. Estimation and tests of haplotype–environment interaction when linkage phase is ambiguous. Hum. Hered. 55, 56–65 (2003).

82. 82

Millstein, J., Conti, D. V., Gilliland, F. D. & Gauderman, W. J. A testing framework for identifying susceptibility genes in the presence of epistasis. Am. J. Hum. Genet. 78, 15–27 (2006).

83. 83

Piegorsch, W. W., Weinberg, C. R. & Taylor, J. A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case–control studies. Stat. Med. 13, 153–162 (1994).

84. 84

Cordell, H. J. Epistasis: what it means what it doesn't mean and statistical methods to detect it in humans. Hum. Mol. Genet. 11, 2463–2468 (2002).

85. 85

Marchini, J., Donnelly, P. & Cardon, L. R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 37, 413–417 (2005).

86. 86

Storey, J. D. & Tibshirani, R. Statistical significance for genome-wide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

87. 87

Dudbridge, F., Gusnanto, A. & Koeleman, P. C. Detecting multiple associations in genome-wide studies. Hum. Genomics 2, 310–317 (2006).

88. 88

Ishwaran, H. & Rao, J. S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Am. Stat. Assoc. 98, 438–455 (2003).

89. 89

Yi, N. J. et al. Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics 170, 1333–1344 (2005).

90. 90

Zondervan, K. T. & Cardon, L. R. The complex interplay among factors that influence allelic association. Nature Rev. Genet. 5, 238–238 (2004).

91. 91

Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).

92. 92

Bingham, S. & Riboli, E. Diet and cancer — the European prospective investigation into cancer and nutrition. Nature Rev. Cancer 4, 206–215 (2004).

93. 93

Ollier, W., Sprosen, T. & Peakman, T. UK Biobank: from concept to reality. Pharmacogenomics 6, 639–646 (2005).

94. 94

Leschzinger, G. et al. Clinical factors and ABCB1 polymorphisms in prediction of antiepileptic drug response: a prospective cohort study. Lancet Neurol. 5, 668–676 (2006).

95. 95

Thompson, E. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 893–918 (Wiley, New York, 2003).

96. 96

Holmans, P. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 919–938 (Wiley, New York, 2003).

97. 97

Ewens, W. J. & Spielman, R. S. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 961–972 (Wiley, New York, 2003).

98. 98

Abecasis, G. R., Cardon, L. R. & Cookson, W. O. C. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000).

99. 99

Van Steen, K. et al. Genomic screening and replication using the same data set in family-based association testing. Nature Genet. 37, 683–691 (2005).

100. 100

Smith, M. W. & O'Brien, S. J. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nature Rev. Genet. 6, 623–266 (2005).

101. 101

Reich, D. et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nature Genet. 37, 1113–1118 (2005).

102. 102

Clayton, D. in Handbook of Statistical Genetics 2nd edn (eds Balding D. J., Bishop, M. & Cannings, C.) 939–960 (Wiley, New York, 2003).

103. 103

Cardon, L. R. & Palmer, L. J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).

104. 104

Berger, M. et al. Hidden population substructures in an apparently homogeneous population bias association studies. Eur. J. Hum. Genet. 14, 236–244 (2006).

105. 105

Wang, H. S., Thomas, D. C., Pe'er I. & Stram, D. O. Optimal two-stage genotyping designs for genome-wide association scans. Genet. Epidemiol. 30, 356–368 (2006).

106. 106

Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).

107. 107

Verzilli, C. J., Stallard, N. & Whittaker, J. C. Bayesian graphical models for genomewide association studies. Am. J. Hum. Genet. 79, 100–112 (2006).

108. 108

Dudbridge, F. & Koeleman, P. C. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am. J. Hum. Genet. 75, 424–435 (2004).

109. 109

Hoh, J. & Ott, J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Rev. Genet. 4, 701–709 (2003).

## Acknowledgements

I thank W. Astle and E. Waldron for help with drawing figures, and W. Astle, L. Cardon, A. Lewin, D. Lunn, A. Morris, D. Schaid, J. Whittaker and D. Zabaneh for discussions and comments on drafts of the manuscript. The author is supported in part by the UK Medical Research Council.

Authors

## Ethics declarations

### Competing interests

The author declares no competing financial interests.

## Glossary

Haplotype

A combination of alleles at different loci on the same chromosome.

Population stratification

Refers to a situation in which the population of interest includes subgroups of individuals that are on average more related to each other than to other members of the wider population.

Multiple-testing problem

Refers to the problem that arises when many null hypotheses are tested; some significant results are likely even if all the hypotheses are false.

Hardy–Weinberg equilibrium

Holds at a locus in a population when the two alleles within an individual are not statistically associated.

Significance level

Usually denoted, and chosen by the researcher to be the greatest probability of type-1 error that is tolerated for a statistical test. It is conventional to choose α = 5% for the overall analysis, which might consist of many tests each with a much lower significance level.

Test statistic

A numerical summary of the data that is used to measure support for the null hypothesis. Either the test statistic has a known probability distribution (such as χ2) under the null hypothesis, or its null distribution is approximated computationally.

Common-disease common-variant hypothesis

The hypothesis that many genetic variants that underlie complex diseases are common, and therefore susceptible to detection using current population association study designs. An alternative possibility is that genetic contributions to complex diseases arise from many variants, all of which are rare.

Effective population size

The size of a theoretical population that best approximates a given natural population under an assumed model. Human effective population size is often taken to mean the size of a constant-size, panmictic population of breeding adults that generates the same level of polymorphism under neutrality as observed in an actual human population.

Maximum-likelihood estimate

The value of an unknown parameter that maximizes the probability of the observed data under the assumed statistical model.

Phase

The information that is needed to determine the two haplotypes that underlie a multi-locus genotype within a chromosomal segment.

Regression models

A class of statistical models that relate an outcome variable to one or more explanatory variables. The goal might be to predict further values of the outcome variable given the explanatory variables, or to identify a minimal set of explanatory variables with good predictive power.

Prospective study design

Studies in which individuals are followed forward in time and disease events are recorded as they arise. DNA and biomarker samples, and data on environmental exposures and lifestyle factors, are usually obtained at the start of the study.

Retrospective study design

Studies in which individuals are identified for inclusion in the study on the basis of their disease state. Data on previous environmental exposures and lifestyle factors are then recorded, and samples for DNA and biomarker studies might be obtained.

Time to event

Refers to data in which the time to an event of interest is recorded, such as the time from the start of the study to disease onset, if any. This is potentially more informative than simply recording case or control status at the end of the study.

The statistical association, within gametes in a population, of the alleles at two loci. Although linkage disequilibrium can be due to linkage, it can also arise at unlinked loci; for example, because of selection or non-random mating.

Type-1 error

The rejection of a true null hypothesis; for example, concluding that HWE does not hold when in fact it does. By contrast, the power of a test is the probability of correctly rejecting a false null hypothesis.

Degrees of freedom

This term is used in different senses both within statistics and in other fields. It can often be interpreted as the number of values that can be defined arbitrarily in the specification of a system; for example, the number of coefficients in a regression model. It is often sufficient to regard degrees of freedom as a parameter that is used to define particular probability distributions.

Bayesian

A statistical school of thought that, in contrast to the frequentist school, holds that inferences about any unknown parameter or hypothesis should be encapsulated in a probability distribution, given the observed data. Bayes theorem is a celebrated result in probability theory that allows one to compute the posterior distribution for an unknown from the observed data and its assumed prior distribution.

Likelihood-ratio test

A statistical test that is based on the ratio of likelihoods under alternative and null hypotheses. If the null hypothesis is a special case of the alternative hypothesis, then the likelihood-ratio statistic typically has a χ2 distribution with degrees of freedom equal to the number of additional parameters under the alternative hypothesis.

Multinomial

Describes a variable with a finite number, say k, of possible outcomes; in the cases k = 2 and k = 3, the terms binomial and trinomial are also used.

Principal-components analysis

A statistical technique for summarizing many variables with minimal loss of information: the first principal component is the linear combination of the observed variables with the greatest variance; subsequent components maximize the variance subject to being uncorrelated with the preceding components.

Stepwise selection procedure

Describes a class of statistical procedures that identify from a large set of variables (such as SNPs) a subset that provides a good fit to a chosen statistical model (for example, a regression model that predicts case–control status) by successively including or discarding terms from the model.

Shrinkage methods

In this approach a prior distribution for regression coefficients is concentrated at zero, so that in the absence of a strong signal of association, the corresponding regression coefficient is 'shrunk' to zero. This mitigates the effects of too many variables (degrees of freedom) in the statistical model.

Frequentist

A name for the school of statistical thought in which support for a hypothesis or parameter value is assessed using the probability of the observed data (or more 'extreme' datasets) given the hypothesis or value. Usually contrasted with Bayesian.

## Rights and permissions

Reprints and Permissions

Balding, D. A tutorial on statistical methods for population association studies. Nat Rev Genet 7, 781–791 (2006). https://doi.org/10.1038/nrg1916

• Issue Date:

• ### Association mapping and genomic selection for sorghum adaptation to tropical soils of Brazil in a sorghum multiparental random mating population

• Karine C. Bernardino
• , Cícero B. de Menezes
• , Sylvia M. de Sousa
• , Claudia T. Guimarães
• , Pedro C. S. Carneiro
• , Robert E. Schaffert
• , Leon V. Kochian
• , Barbara Hufnagel
• , Maria Marta Pastina
•  & Jurandir V. Magalhaes

Theoretical and Applied Genetics (2021)

• ### Hierarchical Modelling of Haplotype Effects on a Phylogeny

• Maria Lie Selle
• , Ingelin Steinsland
• , Finn Lindgren
• , Vlatka Cubric-Curik
•  & Gregor Gorjanc

Frontiers in Genetics (2021)

• ### Prediction of Early Childhood Caries Based on Single Nucleotide Polymorphisms Using Neural Networks

• Katarzyna Zaorska
• , Tomasz Szczapa
• , Maria Borysewicz-Lewicka
• , Michał Nowicki
•  & Karolina Gerreth

Genes (2021)

• ### Semi-parametric empirical Bayes factor for genome-wide association studies

• Junji Morisawa
• , Takahiro Otani
• , Jo Nishino
• , Ryo Emoto
• , Kunihiko Takahashi
•  & Shigeyuki Matsui

European Journal of Human Genetics (2021)

• ### Clustering suicidal phenotypes and genetic associations with brain-derived neurotrophic factor in patients with substance use disorders

• Romain Icick
• , Vanessa Bloch
• , Nathalie Prince
• , Emily Karsinti
• , Jean-Pierre Lépine
• , Jean-Louis Laplanche
• , Stéphane Mouly
• , Cynthia Marie-Claire
• , Georges Brousse
• , Frank Bellivier
•  & Florence Vorspan

Translational Psychiatry (2021)