Reference-based phasing using the Haplotype Reference Consortium panel

Journal name:
Nature Genetics
Volume:
48,
Pages:
1443–1448
Year published:
DOI:
doi:10.1038/ng.3679
Received
Accepted
Published online

Abstract

Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ~20× speedup and ~10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes–based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

At a glance

Figures

  1. Schematic of the Eagle2 core phasing algorithm.
    Figure 1: Schematic of the Eagle2 core phasing algorithm.

    (a) Given diploid genotypes from a target sample along with a haploid reference set of conditioning haplotypes, our algorithm proceeds in two steps. We use the positional Burrows-Wheeler transform (PBWT)20 to generate a 'hedge' of haplotype prefix trees rooted at markers spaced across the chromosome. These trees encode haplotype prefix frequencies, represented here by branch thicknesses. (b) We explore a small set of high-probability diplotypes (i.e., complementary pairs of phased haplotypes), estimating diplotype probabilities under a haplotype-copying model by summing over possible recombination points. For each possible choice of recombination points, the HapHedge data structure allows rapid lookup of haplotype segment frequencies. This illustration is meant to provide intuition for the overall approach; our optimized software implementation first 'condenses' reference haplotypes based on the target genotypes. Details are provided in Supplementary Figure 1 and the Supplementary Note.

  2. Running time and accuracy of reference-based phasing in UK Biobank benchmarks.
    Figure 2: Running time and accuracy of reference-based phasing in UK Biobank benchmarks.

    (a,b) We benchmarked Eagle2 and other available methods by phasing UK Biobank trio children using a reference panel generated from Nref = 15,000, 30,000, 50,000 or 100,000 other UK Biobank samples. CPU time per target genome on a 2.27 GHz Intel Xeon L5640 processor (a). We analyzed 174,595 markers on chromosomes 1, 5, 10, 15 and 20, representing ~25% of the genome, and scaled up running times by a factor of 4; Supplementary Table 3. Mean switch error rate over 70 European-ancestry trios (b; error bars, s.e.m.). (c,d) CPU time and mean switch error rate as a function of the number of conditioning haplotypes used by SHAPEIT2 and Eagle2 (relative to the default values of K = 100 and 10,000, respectively) for the Nref = 30,000 benchmark. Eagle1 does not have such a parameter, so we displayed its performance as a horizontal line. Numeric data and additional benchmarks varying the number of conditioning haplotypes used with Nref = 15,000, 50,000 and 100,000 are provided in Supplementary Table 2.

  3. Accuracy of reference-based phasing in GERA benchmarks.
    Figure 3: Accuracy of reference-based phasing in GERA benchmarks.

    We phased trio parents in each GERA sub-cohort using a reference panel generated from all other nonfamilial samples in the same sub-cohort. We ran each method with default parameter settings on all 22 autosomes and computed aggregate mean switch error rates; error bars, s.e.m. Standard errors for the European-ancestry sub-cohort are over 400 parent samples. Standard errors for the other three sub-cohorts are over 25 SNP blocks. Numeric data and additional benchmarks varying the number of conditioning haplotypes used by each method are provided in Supplementary Table 4.

  4. Accuracy of reference-based phasing using the 1000 Genomes and HRC panels.
    Figure 4: Accuracy of reference-based phasing using the 1000 Genomes and HRC panels.

    We phased 32 trio children from the 1000 Genomes CEU population using either the 1000 Genomes Phase 3 reference panel or the Haplotype Reference Consortium panel (excluding trios in either case). We analyzed chromosome 1, and to emulate a typical use case, we restricted the data to 31,853 markers (genotyped on 23andMe chips). We plotted mean switch error rates; error bars, s.e.m. over samples. Numeric data and additional benchmarks on other 1000 Genomes populations are provided in Supplementary Table 5.

  5. Running time and accuracy of cohort-based phasing in the UK Biobank cohort.
    Figure 5: Running time and accuracy of cohort-based phasing in the UK Biobank cohort.

    (a,b) We benchmarked Eagle2 and other available phasing methods on N = 5,000, 15,000, 50,000 and 150,000 UK Biobank samples (including trio children and excluding trio parents). Total wall clock time for genome-wide phasing on a 16-core 2.60 GHz Intel Xeon E5-2650 v2 processor (a). We analyzed a total of 174,595 markers on chromosomes 1, 5, 10, 15 and 20, representing ~25% of the genome, and scaled up running times by a factor of 4; see Supplementary Table 8 for per-chromosome data. SHAPEIT2 was unable to complete the N = 50,000 chromosome 1 and chromosome 5 analyses, and was uanble to complete any of the N = 150,000 analyses in 5 d, our run-time limit for single compute jobs. Mean switch error rate over 70 European-ancestry trios (b; error bars, s.e.m.). Numeric data and additional benchmarks varying the number of conditioning haplotypes used by Eagle2 are provided in Supplementary Table 7.

References

  1. Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215223 (2011).
  2. Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703714 (2011).
  3. Stephens, M., Smith, N.J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978989 (2001).
  4. Halperin, E. & Eskin, E. Haplotype reconstruction from genotype data using Imperfect Phylogeny. Bioinformatics 20, 18421849 (2004).
  5. Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449462 (2005).
  6. Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629644 (2006).
  7. Browning, S.R. & Browning, B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 10841097 (2007).
  8. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 10681075 (2008).
  9. Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210223 (2009).
  10. Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179181 (2011).
  11. Williams, A.L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238251 (2012).
  12. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 56 (2013).
  13. Loh, P.-R., Palamara, P.F. & Price, A.L. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811816 (2016).
  14. O'Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817820 (2016).
  15. Snyder, M.W., Adey, A., Kitzman, J.O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344358 (2015).
  16. van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J.K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 10611063 (2015).
  17. Kumasaka, N., Knights, A.J. & Gaffney, D.J. Fine-mapping cellular QTLs with RASQUAL and ATAC–seq. Nat. Genet. 48, 206213 (2016).
  18. Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. http://dx.doi.org/10.1038/ng.3656 (published online 29 August, 2016).
  19. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. http://dx.doi.org/10.1038/ng.3643 (published online 22 August 2016).
  20. Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 12661272 (2014).
  21. Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 22132233 (2003).
  22. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
  23. Kvale, M.N. et al. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 10511060 (2015).
  24. Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 12851295 (2015).
  25. Browning, B.L. & Browning, S.R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116126 (2016).
  26. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 6874 (2015).
  27. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955959 (2012).
  28. He, D., Han, B. & Eskin, E. Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data. J. Comput. Biol. 20, 8092 (2013).
  29. Delaneau, O., Howie, B., Cox, A.J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687696 (2013).
  30. Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 19741980 (2016).
  31. Chang, C.C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

Download references

Author information

Affiliations

  1. Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Po-Ru Loh,
    • Pier Francesco Palamara,
    • Hilary K Finucane &
    • Alkes L Price
  2. Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.

    • Po-Ru Loh,
    • Pier Francesco Palamara &
    • Alkes L Price
  3. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK.

    • Petr Danecek,
    • Shane McCarthy &
    • Richard Durbin
  4. Center for Biomedicine, European Academy of Bozen/Bolzano (EURAC), affiliated with the University of Lübeck, Bolzano, Italy.

    • Christian Fuchsberger
  5. Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.

    • Christian Fuchsberger &
    • Goncalo R Abecasis
  6. Department of Computer Science, Harvard University, Cambridge, Massachusetts, USA.

    • Yakir A Reshef
  7. Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Hilary K Finucane
  8. Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Medical University of Innsbruck, Innsbruck, Austria.

    • Sebastian Schoenherr &
    • Lukas Forer
  9. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Alkes L Price

Contributions

P.-R.L. and A.L.P. designed the study. P.-R.L., P.F.P., Y.A.R., and H.K.F. developed the algorithm. P.-R.L. wrote the software. P.-R.L. and P.D. performed experiments. P.D. and S.M. incorporated the software into the Sanger Imputation Service. C.F., S.S., and L.F. incorporated the software into the Michigan Imputation Server. All authors analyzed data and wrote the paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (2,015 KB)

    Supplementary Figures 1 and 2, Supplementary Tables 1–13 and Supplementary Note.

Additional data