Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Genotype imputation for genome-wide association studies

Key Points

  • We review the statistical methods available for carrying out genotype imputation and compare their properties and performance.

  • We also review the downstream uses of imputation, including boosting the power of genome-wide association studies, fine-mapping and allowing comparisons between studies.

  • Several factors influence imputation accuracy, such as reference panel and study sample combination, sample size, genotyping chip and allele frequency.

  • Both Bayesian and frequentist methods can be used to impute SNP genotypes to test for association.

  • We review and compare the information metrics that are commonly used when carrying out quality control of imputed genotype data.

Abstract

In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Post-imputation information measures.

References

  1. Frazer, K., Ballinger, D., Cox, D., Hinds, D., Stuve, L. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).

    Article  CAS  PubMed  Google Scholar 

  2. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet. 39, 906–913 (2007).

    Article  CAS  PubMed  Google Scholar 

  3. Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Statist. Soc. B 62, 605–635 (2000).

    Article  Google Scholar 

  4. Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).

    Article  Google Scholar 

  7. Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009). This paper describes the IMPUTE v2 method and carries out a comprehensive evaluation of several methods. This reference should be read as the follow-on from Reference 2, which describes IMPUTE v1.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Servin, B. & Stephens, M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 3, e114 (2007). The paper that describes the BIMBAM method for Bayesian multi-SNP and single SNP analysis using imputed data. Should be read together with Reference 8, which describes fastPHASE.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Guan, Y. & Stephens, M. Practical issues in imputation-based association mapping. PLoS Genet. 4, e1000279 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Kennedy, J., Mandoiu, I. & Pasaniuc, B. Genotype error detection using hidden Markov models of haplotype diversity. J. Comput. Biol. 15, 1155–1171 (2008).

    Article  CAS  PubMed  Google Scholar 

  12. Browning, S. Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 78, 903–913 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Browning, S. & Browning, B. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Browning, B. & Browning, S. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Browning, S. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 124 439–450 (2008). References 12–15 are a series of papers that describe the model underlying the BEAGLE method.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Lin, D., Hu, Y. & Huang, B. Simple and efficient analysis of disease association with missing genotype data. Am. J. Hum. Genet. 82, 444–452 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Nicolae, D. Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet. Epidemiol. 30, 718–727 (2006).

    Article  PubMed  Google Scholar 

  19. Johnson, G. et al. Haplotype tagging for the identification of common disease genes. Nature Genet. 29, 233–237 (2001).

    Article  CAS  PubMed  Google Scholar 

  20. Evans, D., Cardon, L. & Morris, A. Genotype prediction using a dense map of SNPs. Genet. Epidemiol. 27, 375–384 (2004).

    Article  PubMed  Google Scholar 

  21. De Bakker, P. et al. Efficiency and power in genetic association studies. Nature Genet. 37, 1217–1223 (2005).

    Article  CAS  PubMed  Google Scholar 

  22. Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995).

    CAS  PubMed  Google Scholar 

  23. Pastorino, R. et al. Association between protective and deleterious HLA alleles with multiple sclerosis in Central East Sardinia. PLoS ONE 4, e6526 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Burdick, J., Chen, W., Abecasis, G. & Cheung, V. In silico method for inferring genotypes in pedigrees. Nature Genet. 38, 1002–1004 (2006).

    Article  CAS  PubMed  Google Scholar 

  25. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet. 40, 1068 –1075 (2008).

  26. Spencer, C. C. A., Su, Z., Donnelly, P. & Marchini, J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 5, e1000477 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Pei, Y., Li, J., Zhang, L., Papasian, C. & Deng, H. Analyses and comparison of accuracy of different genotype imputation methods. PLoS ONE 3, e3551 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Hao, K., Chudin, E., McElwee, J. & Schadt, E. E. Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet. 10, 27 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Huang, L., Li, Y., Singleton, A., Hardy, J., Abecasis, G. et al. Genotype-imputation accuracy across worldwide human populations. Am. J. Hum. Genet. 84, 235–250 (2009). A useful reference that illustrates the performance of imputation in a range worldwide human populations when using the HapMap 2 reference panels.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Pasaniuc, B., Sankararaman, S., Kimmel, G. & Halperin, E. Inference of locus-specific ancestry in closely related populations. Bioinformatics 25, 213–221 (2009).

    Article  Google Scholar 

  31. Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genet. 40, 638–645 (2008).

    Article  CAS  PubMed  Google Scholar 

  32. Zeggini, E. et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336–1341 (2007). One of the earliest examples of the use of imputation in meta-analysis. This paper combined three GWA studies and was able to identify several novel associations.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Lindgren, C. M. et al. Genome-wide association scan meta-analysis identifies three loci influencing adiposity and fat distribution. PLoS Genet. 5, e1000508 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Wakefield, J. Bayes factors for genome-wide association studies: comparison with p-values. Genet. Epidemiol. 33, 79–86 (2009).

    Article  PubMed  Google Scholar 

  35. Stephens, M. & Balding, D. Bayesian statistical methods for genetic association studies. Nature Rev. Genet. 10, 681–690 (2009). An excellent Review on the subject of using Bayesian statistical methods in association studies with a particular focus on the calculation, choice of priors and the interpretation of single SNP Bayes factors.

    Article  CAS  PubMed  Google Scholar 

  36. Marchini, J. & Howie, B. Comparing algorithms for genotype imputation. Am. J. Hum. Genet. 83, 535–539 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Stephens, M., Smith, N. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Carlson, C. et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106–120 (2004).

    Article  CAS  PubMed  Google Scholar 

  39. Elston, R. & Stewart, J. A general model for the genetic analysis of pedigree data. Hum. Hered. 21, 523–542 (1971).

    Article  CAS  PubMed  Google Scholar 

  40. Lander, E. & Green, P. Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84, 2363–2367 (1987).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Cooper, J. et al. Meta-analysis of genome-wide association study data identifies additional type 1 diabetes risk loci. Nature Genet. 40, 1399–1401 (2008).

    Article  CAS  PubMed  Google Scholar 

  42. Houlston, R. et al. Meta-analysis of genome-wide association data identifies four new susceptibility loci for colorectal cancer. Nature Genet. 40, 1426–1435 (2008).

    Article  CAS  PubMed  Google Scholar 

  43. De Jager, P. et al. Meta-analysis of genome scans and replication identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci. Nature Genet. 41, 776–82 (2009).

    Article  CAS  PubMed  Google Scholar 

  44. Loos, R. J. F. et al. Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nature Genet. 40, 768–75 (2008).

    Article  CAS  PubMed  Google Scholar 

  45. de Bakker, P. et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 17, R122–R128 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Zollner, S. & Pritchard, J. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169, 1071–1092 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Minichiello, M. & Durbin, R. Mapping trait loci by use of inferred ancestral recombination graphs. Am. J. Hum. Genet. 79, 910–922 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Su, Z., Cardin, N., Wellcome Trust Case Control Consortium, Donnelly, P. & Marchini, J. A Bayesian method for detecting and characterizing allelic heterogeneity and boosting signals in genome-wide association studies. Stat. Sci. 24, 430–450 (2009).

    Article  Google Scholar 

  49. Browning, B. & Browning, S. Efficient multilocus association testing for whole genome association studies using localized haplotype clustering. Genet. Epidemiol. 31, 365–375 (2007).

    Article  PubMed  Google Scholar 

  50. Leslie, S., Donnelly, P. & McVean, G. A statistical method for predicting classical HLA alleles from SNP data. Am. J. Hum. Genet. 82, 48–56 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Marchini, J. et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Louis, T. A. Finding the observed information matrix when using the EM algorithm. J.Royal Stat. Soc.B 44, 226–233.

    Google Scholar 

  54. Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data 2nd edn (Wiley, Hoboken,2002).

    Book  Google Scholar 

  55. Liu, J. Z. et al. (2010) Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nature Genet. 42, 436–440 (2010).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

B.N.H. was funded by a National Science Foundation Graduate Research Fellowship and the Overseas Research Students Awards Scheme. J.M. acknowledges support from the Medical Research Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jonathan Marchini.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary information S1

Alignment of reference and study datasets (PDF 82 kb)

Supplementary information S2

HMM-based methods (XLS 36 kb)

Supplementary information S3

Imputation information measures (PDF 146 kb)

Supplementary information S4

(PDF 262 kb)

Supplementary information S5

(PDF 648 kb)

Supplementary information S6

(PDF 257 kb)

Supplementary information S7

Testing for association at imputed SNPs (PDF 176 kb)

Supplementary information S8

(PDF 82 kb)

Supplementary information S9

(PDF 79 kb)

Supplementary information S10

(PDF 74 kb)

Supplementary information S11

(PDF 156 kb)

Related links

Related links

FURTHER INFORMATION

1000 Genomes Project

BEAGLE

fastPHASE and BIMBAM

GEDI

IMPUTE v1 and v2, SNPTEST and HAPGEN

MACH, MACH2DAT and MACH2QTL

PLINK

ProbABEL

SNPMSTAT

TUNA

UNPHASED

Glossary

Hidden Markov model

A class of statistical model that can be used to relate an observed process across the genome to an underlying, unobserved process of interest. Such models have been used to estimate population structure and admixture, for genotype imputation and for mutiple testing.

Linkage disequlibrium

The statistical association within gametes in a population of the alleles at two loci. Although linkage disequilibrium can be due to linkage, it can also arise at unlinked loci — for example, because of selection or non-random mating.

Expectation-maximization algorithm

A method for finding maximum-likelihood estimates of parameters in statistical models, in which the model depends on unobserved latent variables. It is an iterative method which alternates between performing an expectation (E) step and a maximization (M) step.

Identical by state

Two or more alleles are identical by state if they are identical. Alleles which are identical by state may or may not be identical by descent owing to the possibility of multiple mutation events.

Identical by descent

Two or more alleles are identical by descent if they are identical copies of the same ancestral allele.

Best-guess genotype

Most imputation methods provide a probabilistic prediction of the missing genotypes. The best guess genotype is that genotype which has the largest probability.

Calibration

The probabilities of events predicted by a probability model are said to be well calibrated if they accurately estimate the proportion of times the events occur. For imputation, a method is well calibrated if genotypes that are predicted with probability p are correct 100p% of the time.

Proportional hazards model

A class of survival models in statistics. Survival models relate the time that passes before some event occurs to one or more covariates that may influence that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate.

Bayesian

A statistical school of thought in which the posterior probability distribution for any unknown parameter or hypothesis given the observed data is used to carry out inference. Bayes theorem is used to construct the posterior distribution using the observed data and a prior distribution, often allowing the incorporation of useful knowledge into the analysis.

Frequentist

A name for the school of statistical thought in which support for a hypothesis or parameter value is assessed using the probability of the observed data (or more extreme data sets) given the hypothesis or value. These theories are usually contrasted with Bayesian models.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marchini, J., Howie, B. Genotype imputation for genome-wide association studies. Nat Rev Genet 11, 499–511 (2010). https://doi.org/10.1038/nrg2796

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2796

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing