Data quality control in genetic case-control association studies


This protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We also detail other quality-control procedures, including the use of SMARTPCA software for the identification of ancestral outliers. These platforms were selected because they are user-friendly, widely used and computationally efficient. Steps needed to detect and establish a disease association using case-control data are not discussed here. Issues concerning study design and marker selection in case-control studies have been discussed in our earlier protocols. This protocol, which is routinely used in our labs, should take approximately 8 h to complete.

Figure 1: Genotype failure rate versus heterozygosity across all individuals in the study.
Figure 2: Ancestry clustering based on genome-wide association data.
Figure 3: Histogram of missing data rate across all individuals passing 'per-individual' quality control.


C.A.A. was funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. was supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. was supported by a Wellcome Trust Research Career Development Fellowship.

Author information

Authors and Affiliations



C.A.A. wrote the first draft of the article. C.A.A. wrote scripts and performed analyses. C.A.A., F.H.P., G.M.C., A.P.M. and K.T.Z. revised the article. C.A.A., L.R.C., A.P.M. and K.T.Z. designed the protocol.

Corresponding authors

Correspondence to Carl A Anderson or Krina T Zondervan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data

Simulated dataset for use with the protocol, contains the following files: 8813 2010-03-09 11:12 high-LD-regions.txt imiss-vs-het.Rscript pca-populations.txt plot-IBD.Rscript plot-pca-results.Rscript raw-GWA-data.ped (file size ~2.5GB uncompressed) (ZIP 451047 kb)

