Data quality control in genetic case-control association studies

Anderson, Carl A; Pettersson, Fredrik H; Clarke, Geraldine M; Cardon, Lon R; Morris, Andrew P; Zondervan, Krina T

doi:10.1038/nprot.2010.116

Protocol
Published: 26 August 2010

Data quality control in genetic case-control association studies

Carl A Anderson^1,2,
Fredrik H Pettersson¹,
Geraldine M Clarke¹,
Lon R Cardon³,
Andrew P Morris¹ &
…
Krina T Zondervan¹

Nature Protocols volume 5, pages 1564–1573 (2010)Cite this article

29k Accesses
782 Citations
22 Altmetric
Metrics details

Subjects

Abstract

This protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We also detail other quality-control procedures, including the use of SMARTPCA software for the identification of ancestral outliers. These platforms were selected because they are user-friendly, widely used and computationally efficient. Steps needed to detect and establish a disease association using case-control data are not discussed here. Issues concerning study design and marker selection in case-control studies have been discussed in our earlier protocols. This protocol, which is routinely used in our labs, should take approximately 8 h to complete.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Genotype failure rate versus heterozygosity across all individuals in the study.**

**Figure 2: Ancestry clustering based on genome-wide association data.**

**Figure 3: Histogram of missing data rate across all individuals passing 'per-individual' quality control.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Genome-wide association studies

Article 26 August 2021

References

Zondervan, K.T. & Cardon, L.R. Designing candidate gene and genome-wide case–control association studies. Nat. Protoc. 2, 2492–2501 (2007).
Article CAS PubMed PubMed Central Google Scholar
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Anderson, C.A. et al. Investigation of Crohn's disease risk loci in ulcerative colitis further defines their molecular relationship. Gastroenterology 136, 396–399 (2009).
Article Google Scholar
Teo, Y.Y. et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23, 2741–2746 (2007).
Article CAS PubMed Google Scholar
Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nat. Genet. 37, 1243–1246 (2005).
Article CAS PubMed Google Scholar
Marchini, J., Howie, B., Myers, S.R., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Article CAS PubMed Google Scholar
Silverberg, M.S. et al. Ulcerative colitis-risk loci on chromosomes 1p36 and 12q15 found by genome-wide association study. Nat. Genet. 41, 216–220 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pompanon, F., Bonin, A., Bellemain, E. & Taberlet, P. Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6, 847–859 (2005).
Article CAS PubMed Google Scholar
Price, A.L. et al. Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 83, 132–135 (2008).
Article CAS PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Cardon, L.R. & Palmer, L.J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).
Article PubMed Google Scholar
Campbell, C.D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).
Article CAS PubMed Google Scholar
Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1616–1617 (1996).
Article Google Scholar
Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Article PubMed PubMed Central Google Scholar
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
Fisher, S.A. et al. Genetic determinants of ulcerative colitis include the ECM1 locus and five loci implicated in Crohn's disease. Nat. Genet. 40, 710–712 (2008).
Article CAS PubMed PubMed Central Google Scholar
Wittke-Thompson, J.K., Pluzhnikov, A. & Cox, N.J. Rational inferences about departures from Hardy–Weinberg equilibrium. Am. J. Hum. Genet. 76, 967–986 (2005).
Article CAS PubMed PubMed Central Google Scholar
Meyre, D. et al. Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat. Genet. 41, 157–159 (2009).
Article CAS PubMed Google Scholar
Moskvina, V., Craddock, N., Holmans, P., Owen, M.J. & O'Donovan, M.C. Effects of differential genotyping error rate on the type I error probability of case–control studies. Hum. Hered. 61, 55–64 (2006).
Article PubMed Google Scholar
Plagnol, V., Cooper, J.D., Todd, J.A. & Clayton, D.G. A method to address differential bias in genotyping in large-scale association studies. PLoS Genet. 3, e74 (2007).
Article PubMed PubMed Central Google Scholar
Morris, A.P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).
Article PubMed Google Scholar
Pettersson, F.H. et al. Marker selection for genetic case–control association studies. Nat. Protoc. 4, 743–752 (2009).
Article CAS PubMed PubMed Central Google Scholar
R Development Core Team. R: a language and environment for statistical computing. (2005).
Aulchenko, Y.S., Ripke, S., Isaacs, A. & van Duijn, C.M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007).
Article CAS PubMed Google Scholar
Pettersson, F., Morris, A.P., Barnes, M.R. & Cardon, L.R. Goldsurfer2 (Gs2): a comprehensive tool for the analysis and visualization of genome wide association studies. BMC Bioinformatics 9, 138 (2008).
Article PubMed PubMed Central Google Scholar
Pettersson, F., Jonsson, O. & Cardon, L.R. GOLDsurfer: three dimensional display of linkage disequilibrium. Bioinformatics 20, 3241–3243 (2004).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

C.A.A. was funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. was supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. was supported by a Wellcome Trust Research Career Development Fellowship.

Author information

Authors and Affiliations

Genetic and Genomic Epidemiology Unit, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
Carl A Anderson, Fredrik H Pettersson, Geraldine M Clarke, Andrew P Morris & Krina T Zondervan
Statistical Genetics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.,
Carl A Anderson
GlaxoSmithKline, King of Prussia, Pennsylvania, USA
Lon R Cardon

Authors

Carl A Anderson
View author publications
You can also search for this author in PubMed Google Scholar
Fredrik H Pettersson
View author publications
You can also search for this author in PubMed Google Scholar
Geraldine M Clarke
View author publications
You can also search for this author in PubMed Google Scholar
Lon R Cardon
View author publications
You can also search for this author in PubMed Google Scholar
Andrew P Morris
View author publications
You can also search for this author in PubMed Google Scholar
Krina T Zondervan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.A.A. wrote the first draft of the article. C.A.A. wrote scripts and performed analyses. C.A.A., F.H.P., G.M.C., A.P.M. and K.T.Z. revised the article. C.A.A., L.R.C., A.P.M. and K.T.Z. designed the protocol.

Corresponding authors

Correspondence to Carl A Anderson or Krina T Zondervan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data

Simulated dataset for use with the protocol, contains the following files: hapmap3r2_CEU.CHB.JPT.YRI.founders.no-at-cg-snps.bed hapmap3r2_CEU.CHB.JPT.YRI.founders.no-at-cg-snps.bim 8813 2010-03-09 11:12 hapmap3r2_CEU.CHB.JPT.YRI.founders.no-at-cg-snps.fam hapmap3r2_CEU.CHB.JPT.YRI.no-at-cg-snps.txt high-LD-regions.txt imiss-vs-het.Rscript pca-populations.txt plot-IBD.Rscript plot-pca-results.Rscript raw-GWA-data.map raw-GWA-data.ped (file size ~2.5GB uncompressed) raw-GWA-data.prune.in run-diffmiss-qc.pl run-IBD-QC.pl (ZIP 451047 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anderson, C., Pettersson, F., Clarke, G. et al. Data quality control in genetic case-control association studies. Nat Protoc 5, 1564–1573 (2010). https://doi.org/10.1038/nprot.2010.116

Download citation

Published: 26 August 2010
Issue Date: September 2010
DOI: https://doi.org/10.1038/nprot.2010.116

This article is cited by

Public platform with 39,472 exome control samples enables association studies without genotype sharing
- Mykyta Artomov
- Alexander A. Loboda
- Mark J. Daly
Nature Genetics (2024)
Dorsal visual stream and LIMK1: hemideletion, haplotype, and enduring effects in children with Williams syndrome
- J. Shane Kippenhan
- Michael D. Gregory
- Karen F. Berman
Journal of Neurodevelopmental Disorders (2023)
Large-scale genome sequencing redefines the genetic footprints of high-altitude adaptation in Tibetans
- Wangshan Zheng
- Yaoxi He
- Bing Su
Genome Biology (2023)
Genetic association of PRKCD and CARD9 polymorphisms with Vogt–Koyanagi–Harada disease in the Chinese Han population
- Chunya Zhou
- Shiya Cai
- Jianmin Hu
Human Genomics (2023)
The shared genetic architecture of smoking behaviours and psychiatric disorders: evidence from a population-based longitudinal study in England
- Olesya Ajnakina
- Andrew Steptoe
BMC Genomic Data (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.