Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Basic statistical analysis in genetic case-control studies


This protocol describes how to perform basic statistical analysis in a population-based genetic association case-control study. The steps described involve the (i) appropriate selection of measures of association and relevance of disease models; (ii) appropriate selection of tests of association; (iii) visualization and interpretation of results; (iv) consideration of appropriate methods to control for multiple testing; and (v) replication strategies. Assuming no previous experience with software such as PLINK, R or Haploview, we describe how to use these popular tools for handling single-nucleotide polymorphism data in order to carry out tests of association and visualize and interpret results. This protocol assumes that data quality assessment and control has been performed, as described in a previous protocol, so that samples and markers deemed to have the potential to introduce bias to the study have been identified and removed. Study design, marker selection and quality control of case-control studies have also been discussed in earlier protocols. The protocol should take 1 h to complete.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Figure 1: LD plot.
Figure 2: Quantile-quantile plots.
Figure 3: Manhattan plot.


  1. Zondervan, K.T. & Cardon, L.R. Designing candidate gene and genome-wide case-control association studies. Nat. Protoc. 2, 2492–2501 (2007).

    Article  CAS  Google Scholar 

  2. Pettersson, F.H. et al. Marker selection for genetic case-control association studies. Nat. Protoc. 4, 743–752 (2009).

    Article  CAS  Google Scholar 

  3. Anderson, C.A. et al. Data quality control in genetic-case control association studies. Nat. Protoc. 5, 1564–1573 (2010).

    Article  CAS  Google Scholar 

  4. Morris, A.P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).

    Article  Google Scholar 

  5. Cho, E.Y. et al. Genome-wide association analysis and replication of coronary artery disease in South Korea suggests a causal variant common to diverse populations. Heart Asia 2, 104–108 (2010).

    PubMed  PubMed Central  Google Scholar 

  6. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  7. The International HapMap Project. Nature 426, 789–796 (2003).

  8. Anderson, C.A. et al. Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am. J. Hum. Genet. 83, 112–119 (2008).

    Article  CAS  Google Scholar 

  9. Camp, N.J. Genomewide transmission/disequilibrium testing—consideration of the genotypic relative risks at disease loci. Am. J. Hum. Genet. 61, 1424–1430 (1997).

    Article  CAS  Google Scholar 

  10. Balding, D.J., Bishop, M. & Cannings, C. Handbook of Statistical Genetics (John Wiley & Sons Ltd., 2003).

  11. Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. Discrete Multivariate Analysis: Theory and Practice (MIT Press, 557, 1975).

  12. Cochran, W.G. Some methods for strengthening the common chi-squared test. Biometrics 10 (1954).

    Article  Google Scholar 

  13. Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

    Article  Google Scholar 

  14. Rice, J.A. Mathematical Statistics and Data Analysis (Duxbury Press, 1995).

  15. Sidak, Z. On multivariate normal probabilities of rectangles: their dependence on correlations. Ann. Math. Statist. 39, 1425–1434 (1968).

    Article  Google Scholar 

  16. Sidak, Z. On probabilities of rectangles in multivariate Student distributions: their dependence on correlations. Ann. Math. Statist. 42, 169–175 (1971).

    Article  Google Scholar 

  17. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).

    Google Scholar 

  18. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. Royal Statist. Soc. Series B-Methodological 57, 289–300 (1995).

    Google Scholar 

  19. Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).

    Article  Google Scholar 

  20. Westfall, P.H. & Young, S.S. Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment xvii, 340 p. (John Wiley & Sons, 1993).

  21. Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).

    Article  Google Scholar 

  22. Hoggart, C.J., Clark, T.G., De Iorio, M., Whittaker, J.C. & Balding, D.J. Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32, 179–185 (2008).

    Article  Google Scholar 

  23. Pe'er, I., Yelensky, R., Altshuler, D. & Daly, M.J. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32, 381–385 (2008).

    Article  Google Scholar 

  24. Weir, B.S., Hill, W.G. & Cardon, L.R. Allelic association patterns for a dense SNP map. Genet. Epidemiol. 27, 442–450 (2004).

    Article  CAS  Google Scholar 

  25. Knowler, W.C., Williams, R.C., Pettitt, D.J. & Steinberg, A.G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet. 43, 520–526 (1988).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    Article  CAS  Google Scholar 

  27. de Bakker, P.I. et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 17, R122–R128 (2008).

    Article  CAS  Google Scholar 

  28. Clarke, G.M., Carter, K.W., Palmer, L.J., Morris, A.P. & Cardon, L.R. Fine mapping versus replication in whole-genome association studies. Am. J. Hum. Genet. 81, 995–1005 (2007).

    Article  CAS  Google Scholar 

  29. Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209–213 (2006).

    Article  CAS  Google Scholar 

  30. Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Optimal designs for two-stage genome-wide association studies. Genet. Epidemiol. 31, 776–788 (2007).

    Article  Google Scholar 

  31. R Development Core Team.. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2009).

  32. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  Google Scholar 

  33. Barrett, J.C., Fry, B., Maller, J. & Daly, M.J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).

    Article  CAS  Google Scholar 

  34. Fox, J. An R and S-Plus Companion to Applied Regression, xvi, 312 p. (Sage Publications, 2002).

  35. Nyholt, D.R. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am. J. Hum. Genet. 74, 765–769 (2004).

    Article  CAS  Google Scholar 

  36. Hosmer, D.W. & Lemeshow, S. Applied Logistic Regression, xii, 373 p. (Wiley, 2000).

  37. Dalgaard, P. Introductory Statistics with R, xvi, 363 p. (Springer, 2008).

  38. Pettersson, F., Jonsson, O. & Cardon, L.R. GOLDsurfer: three dimensional display of linkage disequilibrium. Bioinformatics 20, 3241–3243 (2004).

    Article  CAS  Google Scholar 

  39. Pettersson, F., Morris, A.P., Barnes, M.R. & Cardon, L.R. Goldsurfer2 (Gs2): a comprehensive tool for the analysis and visualization of genome wide association studies. BMC Bioinformatics 9, 138 (2008).

    Article  Google Scholar 

Download references


G.M.C. is funded by the Wellcome Trust. F.H.P. is funded by the Welcome Trust. C.A.A. is funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. is supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. is supported by a Wellcome Trust Research Career Development Fellowship.

Author information

Authors and Affiliations



G.M.C. wrote the first draft of the manuscript, wrote scripts and performed analyses. G.M.C., C.A.A., A.P.M. and K.T.Z. revised the manuscript and designed the protocol. L.R.C. conceived the protocol.

Corresponding author

Correspondence to Geraldine M Clarke.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data 1

Example genome wide association (GTA) data. (ZIP 201229 kb)

Supplementary Data 2

Example candidate gene 9 (CG) data. (ZIP 66 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Clarke, G., Anderson, C., Pettersson, F. et al. Basic statistical analysis in genetic case-control studies. Nat Protoc 6, 121–133 (2011).

Download citation

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing