The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.
- International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
- Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). et al.
- 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
- Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010). &
- Genotype imputation. Annu. Rev. Genomics Hum. Genet. 10, 387–406 (2009). , , &
- A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009). , &
- In silico method for inferring genotypes in pedigrees. Nat. Genet. 38, 1002–1004 (2006). , , &
- Family-based association tests for genomewide association scans. Am. J. Hum. Genet. 81, 913–926 (2007). &
- Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008). et al.
- Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
- A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007). , , , &
- MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010). , , , &
- Isolates and their potential use in complex gene mapping efforts. Curr. Opin. Genet. Dev. 14, 316–323 (2004). &
- Use of population isolates for mapping complex traits. Nat. Rev. Genet. 1, 182–190 (2000). , &
- A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341–1345 (2007). et al.
- A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450 (2006). et al.
- A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012). , &
- New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet. 39, 1045–1051 (2007). et al.
- Women's Health Initiative Study Group. Design of the Women's Health Initiative clinical trial and observational study. The Women's Health Initiative Study Group. Control. Clin. Trials 19, 61–109 (1998).
- Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am. J. Hum. Genet. 77, 754–767 (2005). &
- Genome-wide scan reveals association of psoriasis with IL-23 and NF-κB pathways. Nat. Genet. 41, 199–204 (2009). et al.
- A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73, 1162–1169 (2003). &
- A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006). &
- A maximization technique occurring in statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41, 164–171 (1970). , , &
- A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009). &
- Supplementary Text and Figures (479K)
Supplementary Note, Supplementary Table 1 and Supplementary Figures 1–4