Key Points
-
p-values are commonly used as summaries of evidence for association between a genetic variant and phenotype, but they have an important limitation in that they are unable to quantify how confident one should be that a given SNP is truly associated with a phenotype.
-
Bayesian methods provide an alternative approach to assessing associations. We show that Bayesian analyses are not too difficult and can be rewarding — for example, unlike p-values, a Bayesian probability of association is comparable across SNPs and across studies.
-
For a Bayesian analysis of single-SNP association in a case–control study, we discuss genetic models that can form an alternative to the null hypothesis of no association, in addition to effect-size distributions for the parameters of these models. An alternative Bayesian analysis derives a posterior distribution for effect size, without reference to a null hypothesis.
-
We give an example of a multi-SNP Bayesian analysis for fine-scale mapping and discuss Bayesian approaches to multiple testing and meta-analysis.
-
Broad guidelines are suggested for editors and reviewers of Bayesian analyses.
Abstract
Bayesian statistical methods have recently made great inroads into many areas of science, and this advance is now extending to the assessment of association between genetic variants and disease or other phenotypes. We review these methods, focusing on single-SNP tests in genome-wide association studies. We discuss the advantages of the Bayesian approach over classical (frequentist) approaches in this setting and provide a tutorial on basic analysis steps, including practical guidelines for appropriate prior specification. We demonstrate the use of Bayesian methods for fine mapping in candidate regions, discuss meta-analyses and provide guidance for refereeing manuscripts that contain Bayesian analyses.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Sellke, T., Bayarri, M. J. & Berger, J. O. Calibration of p values for testing precise null hypotheses. Am. Stat. 55, 62–71 (2001).
Sterne, J. A. C. & Davey Smith, G. Sifting the evidence — what's wrong with significance tests? BMJ 322, 226–231 (2001).
Ioannidis, J. P. A. Effect of formal statistical significance on the credibility of observational associations. Am. J. Epidem. 168, 374–383 (2008).
Ayres, K. L. & Balding, D. J. Measuring departures from Hardy–Weinberg: a Markov chain Monte Carlo method for estimating the inbreeding coefficient. Heredity 80, 769–777 (1998).
Shoemaker, J. S., Painter, I. S. & Weir, B. S. Bayesian statistics in genetics — a guide for the uninitiated. Trends Genet. 15, 354–358 (1999).
Beaumont, M. A. & Rannala, B. The Bayesian revolution in genetics. Nature Rev. Genet. 5, 251–261 (2004).
Marjoram, P. & Tavare, S. Modern computational approaches for analysing molecular genetic variation data. Nature Rev. Genet. 7, 759–770 (2006).
O'Hara, R. B., Cano, J. M., Ovaskainen, O., Teplitsky, C. & Alho, J. S. Bayesian approaches in evolutionary quantitative genetics. J. Evol. Biol. 21, 949–957 (2008).
Wakefield, J. Bayesian methods for examining Hardy–Weinberg equilibrium. Biometrics 13 May 2009 (doi:10.1111/j.1541-0420.2009.01267.x).
Lunn, D. J., Whittaker, J. C. & Best, N. A Bayesian toolkit for genetic association studies. Genet. Epidem. 30, 231–247 (2006).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet. 39, 906–913 (2007). The supplementary material of this article includes a review of frequentist tests and BFs for single-SNP association and a brief review of the Laplace approximation. In particular, it describes the Bayesian analysis methods implemented in the SNPTEST software.
Servin, B. & Stephens, M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 3, e114 (2007). This paper includes a description of several of the Bayesian analysis methods that are implemented in the BIMBAM software, including the Bayesian multi-SNP analysis methods that we used in this Review.
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). A landmark paper because of the size of the studies, the pioneering use of unphenotyped common controls for a range of diseases and the large number of novel genetic associations reported. The authors also advocate the use of Bayesian approaches for evaluating evidence of association, which was reported alongside traditional p -values for the first time in a major study.
Wakefield, J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am. J. Hum. Genet. 81, 208–227 (2007).
Hosking, F. J., Sterne, J. A. C., Smith, G. D. & Green, P. J. Inference from genome-wide association studies using a novel Markov model. Genet. Epidem. 32, 497–504 (2008).
Verzilli, C. et al. Bayesian meta-analysis of genetic association studies with different sets of markers. Am. J. Hum. Genet. 82, 859–872 (2008).
Fridley, B. L. Bayesian variable and model selection methods for genetic association studies. Genet. Epidem. 33, 27–37 (2009).
Newcombe, P. J. et al. Multilocus Bayesian meta-analysis of gene–disease associations. Am. J. Hum. Genet. 84, 567–580 (2009).
Wakefield, J. Reporting and interpretation in genome-wide association studies. Intern. J. Epidem. 37, 641–653 (2008).
Guan, Y. & Stephens, M. Practical issues in imputation-based association mapping. PLoS Genet. 4, e1000279 (2008). This article includes a detailed discussion of the advantages of Bayesian methods over frequentist methods when assessing associations with imputed SNPs.
Balding, D. J. A tutorial on statistical methods for population association studies. Nature Rev. Genet. 7, 781–791 (2006). This Review covers: preliminary analyses (of Hardy–Weinberg and linkage equilibria, inference of phase and missing genotypes); single-SNP tests of association for binary, continuous and ordinal outcomes; multi-SNP and haplotype analyses; and dealing with population stratification and multiple-testing issues, largely within the frequentist framework.
Jeffreys, H. Theory of Probability (Oxford Univ. Press, 1961).
Good, I. J. The Bayes/non-Bayes compromise: a brief review. J. Am. Stat. Assoc. 87, 597–606 (1992).
Seaman, S. R. & Richardson, S. Equivalence of prospective and retrospective models in the Bayesian analysis of case–control studies, Biometrika 91, 15–25 (2004).
Freidlin, B., Zheng, G., Li, Z. H. & Gastwirth, J. L. Trend tests for case–control studies of genetic markers: power, sample size and robustness. Hum. Hered. 53, 146–152 (2002).
The SEARCH Collaborative Group. SLCO1B1 variants and statin-induced myopathy — a genomewide study. N. Engl. J. Med. 359, 789–799 (2008).
Scott, L. J. et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345 (2009).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996).
Hoggart, C. J., Whittaker, J. C., De Iorio, M. & Balding, D. J. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4, e1000130 (2008).
Kavvoura, F. K. & Ioannidis, J. P. A. Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls. Hum. Genet. 123, 1–14 (2008).
Van Houwelingen, H. & Lebrec, J. P. in Meta-analysis and Combining Information in Genetics and Genomics (eds Guerra, R. et al.) 49–66 (CRC Press, 2009).
Ioannidis, J. P., Patsopoulos, N. A. & Evangelou, E. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE 2, e841 (2007).
Lunn, D. J., Thomas, A., Best, N. & Spiegelhalter, D. WinBUGS — a Bayesian modelling framework: concepts, structure, and extensibility. Stat. Comput. 10, 325–337 (2000).
Thompson, J. R., Minelli, C., Abrams, K. R., Thakkinstian, A. & Attia, J. Combining information from related meta-analyses of genetic association studies. J. R. Stat. Soc. C 57, 103–115 (2008).
Hoggart, C. J., Clark, T. G., De Iorio, M., Whittaker, J. C. & Balding, D. J. Genome-wide significance for dense SNP and resequencing data. Genet. Epidem. 32, 179–185 (2008).
Veyrieras, J.-B. et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008).
Lee, S.-I. et al. Learning a prior on regulatory potential from eQTL data. PLoS Genet. 5, e1000358 (2009).
Chen, R. et al. FitSNPs: highly differentially expressed genes are more likely to have variants associated with disease. Genome Biol. 9, R170 (2008).
Tachmazidou, I., Andrew, T., Verzilli, C. J., Johnson, M. R. & De Iorio, M. Bayesian survival analysis in genetic association studies. Bioinformatics 24, 2030–2036 (2008).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate — a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. B 64, 479–498 (2002).
Wakefield, J. Bayes factors for genome-wide association studies: comparison with P-values. Genet. Epidem. 33, 79–86 (2009). This is the last in a sequence of three single-author papers published by Wakefield in successive years. This paper uses the approximate BF introduced in Reference 14 to highlight what can be regarded as implicit assumptions in the use of standard p -values as the primary summaries of evidence for association.
Wang, W. Y. S., Barratt, B. J., Clayton, D. G. & Todd, J. A. Genome-wide association studies: theoretical and practical concerns. Nature Rev. Genet. 6, 109–118 (2005).
Gorlov, I. P., Gorlova, O. Y., Sunyaev, S. R., Spitz, M. R. & Amos, C. I. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82, 100–112 (2008).
Greenland, S. Multiple comparisons and association selection in general epidemiology. Intern. J. Epidem. 37, 430–434 (2008).
Scheipl, F. & Kneib, T. Locally adaptive Bayesian P-splines with a normal-exponential-gamma prior. Comput. Stat. Data Anal. 53, 3533–3552 (2009).
Reiner, A. P. et al. Polymorphisms of the HNF1A gene encoding hepatocyte nuclear factor-1α are associated with C-reactive protein. Am. J. Hum. Genet. 82, 1193–1201 (2008).
Acknowledgements
We thank C. Hoggart for providing R code to compute the normal-exponential-gamma probability density function and J. Wakefield for helpful discussions and critical reading of an early draft. We thank R. Krauss for access to the CRP genotype and phenotype data that we analysed here. We are also grateful to W. Astle, A. Ramasamy, L. Bottolo, L. Coin, P. O'Reilly and H. Eleftherohorinou for discussions. The authors' work is supported in part by National Institutes of Health grants HL084689 (to M.S.) and EP/C533542 (to D.J.B.).
Author information
Authors and Affiliations
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary information S1 (box)
(PDF 670 kb)
Related links
Glossary
- Frequentist
-
A statistical school of thought in which inferences about unknowns are justified not with reference to probabilities for the inferred value, but on the basis of measures of performance under imaginary repetitions of the procedure that was used to make the inference.
- Population association
-
Also known as true association. An association between a SNP and a phenotype that is present in the population from which a sample is taken. A population association can arise owing to population structure, but for simplicity we assume here that this possibility has been eliminated (for example, by covariate adjustment) and hence that population associations are caused by a functional SNP, either directly or through linkage disequilibrium.
- p-value
-
The probability, if the null hypothesis were true, that an imaginary future repetition of the study would generate stronger evidence for association than that actually observed. A p-value is conventionally interpreted as measuring the strength of evidence for association, but there is no simple relationship between a p-value and the probability that the association is genuine.
- Power
-
For a given population association, the power of a statistical test is the probability that the null hypothesis is rejected under imaginary repetitions of the study.
- Bayesian
-
A statistical school of thought that holds that inferences about any unknown parameter or hypothesis should be encapsulated in a probability distribution, given the observed data. Computing this posterior probability distribution usually proceeds by specifying a prior distribution that summarizes knowledge about the unknown before the observed data are considered, and then using Bayes' theorem to transform the prior distribution into a posterior distribution.
- Meta-analysis
-
The combination of the results of multiple scientific studies that address the same, or similar, hypotheses.
- Posterior probability of association
-
The probability that a SNP is truly associated with a phenotype. The posterior probability of association depends on modelling assumptions that should be made explicit in a careful analysis.
- Likelihood ratio
-
The ratio of the probabilities of the observed data for two different values of the unknown parameter(s) under a given statistical model.
- Odds
-
The probability of the occurrence of a particular event (for example, the onset of disease) divided by the probability of the event not occurring. It is often mathematically convenient to transform a probability, which must lie between zero and one, to odds, which can take any positive value.
- Bonferroni correction
-
When multiple hypotheses are tested, the Bonferroni correction to the overall desired significance level (α) is obtained by dividing it by the number of tests (k), so that each hypothesis is rejected if p-value < α/k.
- False discovery rate
-
For a sequence of hypothesis tests, the false discovery rate is the proportion of times H0 is true among those tests for which H0 is rejected.
- Odds ratio
-
The odds ratio comparing, for example, two genotypes is the odds for individuals with the first genotype divided by the odds for individuals with the second genotype.
- Logistic regression
-
A regression model for binary outcomes (such as case and control) in which the logarithm of the odds is related linearly to one or more predictors, such as SNP minor allele count(s).
- Laplace approximation
-
A method for approximating the integral of a (possibly multidimensional) probability density based on replacing that density by a Gaussian probability density with the same mean and variance–covariance matrix.
- Maximum-likelihood estimate
-
The maximum-likelihood estimate of an unknown parameter in a statistical model is the value of the parameter that maximizes the probability under the model of the observed data.
- Statin
-
A class of drugs that is used to lower cholesterol levels in people with, or at risk of, cardiovascular disease.
- Genotype imputation method
-
A method for estimating ('imputing') the unobserved genotypes of study subjects, both for individuals with missing or unreliable genotypes at a genotyped SNP and for all individuals at an ungenotyped SNP.
- Hardy–Weinberg equilibrium
-
This holds at a given locus in a given population when the two alleles of individuals in the population are mutually independent.
Rights and permissions
About this article
Cite this article
Stephens, M., Balding, D. Bayesian statistical methods for genetic association studies. Nat Rev Genet 10, 681–690 (2009). https://doi.org/10.1038/nrg2615
Issue Date:
DOI: https://doi.org/10.1038/nrg2615
This article is cited by
-
Prioritization of genes associated with type 2 diabetes mellitus for functional studies
Nature Reviews Endocrinology (2023)
-
Semi-parametric empirical Bayes factor for genome-wide association studies
European Journal of Human Genetics (2021)
-
Bayesian statistics and modelling
Nature Reviews Methods Primers (2021)
-
Moving far, staying close: red fox dispersal patterns revealed by SNP genotyping
Conservation Genetics (2021)
-
Association of germline genetic variants with breast cancer-specific survival in patient subgroups defined by clinic-pathological variables related to tumor biology and type of systemic treatment
Breast Cancer Research (2021)