Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Data quality control in genetic case-control association studies


This protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We also detail other quality-control procedures, including the use of SMARTPCA software for the identification of ancestral outliers. These platforms were selected because they are user-friendly, widely used and computationally efficient. Steps needed to detect and establish a disease association using case-control data are not discussed here. Issues concerning study design and marker selection in case-control studies have been discussed in our earlier protocols. This protocol, which is routinely used in our labs, should take approximately 8 h to complete.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Figure 1: Genotype failure rate versus heterozygosity across all individuals in the study.
Figure 2: Ancestry clustering based on genome-wide association data.
Figure 3: Histogram of missing data rate across all individuals passing 'per-individual' quality control.


  1. Zondervan, K.T. & Cardon, L.R. Designing candidate gene and genome-wide case–control association studies. Nat. Protoc. 2, 2492–2501 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  3. Anderson, C.A. et al. Investigation of Crohn's disease risk loci in ulcerative colitis further defines their molecular relationship. Gastroenterology 136, 396–399 (2009).

    Article  Google Scholar 

  4. Teo, Y.Y. et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23, 2741–2746 (2007).

    Article  CAS  PubMed  Google Scholar 

  5. Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nat. Genet. 37, 1243–1246 (2005).

    Article  CAS  PubMed  Google Scholar 

  6. Marchini, J., Howie, B., Myers, S.R., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

    Article  CAS  PubMed  Google Scholar 

  7. Silverberg, M.S. et al. Ulcerative colitis-risk loci on chromosomes 1p36 and 12q15 found by genome-wide association study. Nat. Genet. 41, 216–220 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Pompanon, F., Bonin, A., Bellemain, E. & Taberlet, P. Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6, 847–859 (2005).

    Article  CAS  PubMed  Google Scholar 

  9. Price, A.L. et al. Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 83, 132–135 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Cardon, L.R. & Palmer, L.J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).

    Article  PubMed  Google Scholar 

  12. Campbell, C.D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).

    Article  CAS  PubMed  Google Scholar 

  13. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1616–1617 (1996).

    Article  Google Scholar 

  14. Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    Article  CAS  PubMed  Google Scholar 

  16. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  17. The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).

  18. Fisher, S.A. et al. Genetic determinants of ulcerative colitis include the ECM1 locus and five loci implicated in Crohn's disease. Nat. Genet. 40, 710–712 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Wittke-Thompson, J.K., Pluzhnikov, A. & Cox, N.J. Rational inferences about departures from Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 967–986 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Meyre, D. et al. Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat. Genet. 41, 157–159 (2009).

    Article  CAS  PubMed  Google Scholar 

  21. Moskvina, V., Craddock, N., Holmans, P., Owen, M.J. & O'Donovan, M.C. Effects of differential genotyping error rate on the type I error probability of case–control studies. Hum. Hered. 61, 55–64 (2006).

    Article  PubMed  Google Scholar 

  22. Plagnol, V., Cooper, J.D., Todd, J.A. & Clayton, D.G. A method to address differential bias in genotyping in large-scale association studies. PLoS Genet. 3, e74 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Morris, A.P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).

    Article  PubMed  Google Scholar 

  24. Pettersson, F.H. et al. Marker selection for genetic case–control association studies. Nat. Protoc. 4, 743–752 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. R Development Core Team. R: a language and environment for statistical computing. (2005).

  26. Aulchenko, Y.S., Ripke, S., Isaacs, A. & van Duijn, C.M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007).

    Article  CAS  PubMed  Google Scholar 

  27. Pettersson, F., Morris, A.P., Barnes, M.R. & Cardon, L.R. Goldsurfer2 (Gs2): a comprehensive tool for the analysis and visualization of genome wide association studies. BMC Bioinformatics 9, 138 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Pettersson, F., Jonsson, O. & Cardon, L.R. GOLDsurfer: three dimensional display of linkage disequilibrium. Bioinformatics 20, 3241–3243 (2004).

    Article  CAS  PubMed  Google Scholar 

Download references


C.A.A. was funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. was supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. was supported by a Wellcome Trust Research Career Development Fellowship.

Author information

Authors and Affiliations



C.A.A. wrote the first draft of the article. C.A.A. wrote scripts and performed analyses. C.A.A., F.H.P., G.M.C., A.P.M. and K.T.Z. revised the article. C.A.A., L.R.C., A.P.M. and K.T.Z. designed the protocol.

Corresponding authors

Correspondence to Carl A Anderson or Krina T Zondervan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data

Simulated dataset for use with the protocol, contains the following files: 8813 2010-03-09 11:12 high-LD-regions.txt imiss-vs-het.Rscript pca-populations.txt plot-IBD.Rscript plot-pca-results.Rscript raw-GWA-data.ped (file size ~2.5GB uncompressed) (ZIP 451047 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Anderson, C., Pettersson, F., Clarke, G. et al. Data quality control in genetic case-control association studies. Nat Protoc 5, 1564–1573 (2010).

Download citation

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing