Reference-based phasing using the Haplotype Reference Consortium panel

Loh, Po-Ru; Danecek, Petr; Palamara, Pier Francesco; Fuchsberger, Christian; A Reshef, Yakir; K Finucane, Hilary; Schoenherr, Sebastian; Forer, Lukas; McCarthy, Shane; Abecasis, Goncalo R; Durbin, Richard; L Price, Alkes

doi:10.1038/ng.3679

Technical Report
Published: 03 October 2016

Reference-based phasing using the Haplotype Reference Consortium panel

Po-Ru Loh^1,2,
Petr Danecek³,
Pier Francesco Palamara^1,2,
Christian Fuchsberger^4,5,
Yakir A Reshef⁶,
Hilary K Finucane^1,7,
Sebastian Schoenherr⁸,
Lukas Forer⁸,
Shane McCarthy ORCID: orcid.org/0000-0002-2715-4187³,
Goncalo R Abecasis⁵,
Richard Durbin ORCID: orcid.org/0000-0002-9130-1006³ &
…
Alkes L Price^1,2,9

Nature Genetics volume 48, pages 1443–1448 (2016)Cite this article

12k Accesses
797 Citations
38 Altmetric
Metrics details

Subjects

Abstract

Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes–based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Schematic of the Eagle2 core phasing algorithm.**

**Figure 2: Running time and accuracy of reference-based phasing in UK Biobank benchmarks.**

**Figure 3: Accuracy of reference-based phasing in GERA benchmarks.**

**Figure 4: Accuracy of reference-based phasing using the 1000 Genomes and HRC panels.**

**Figure 5: Running time and accuracy of cohort-based phasing in the UK Biobank cohort.**

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank

Article Open access 29 June 2023

Accurate, scalable and integrative haplotype estimation

Article Open access 28 November 2019

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

References

Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Article CAS Google Scholar
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS Google Scholar
Stephens, M., Smith, N.J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).
Article CAS Google Scholar
Halperin, E. & Eskin, E. Haplotype reconstruction from genotype data using Imperfect Phylogeny. Bioinformatics 20, 1842–1849 (2004).
Article CAS Google Scholar
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).
Article CAS Google Scholar
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
Article CAS Google Scholar
Browning, S.R. & Browning, B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
Article CAS Google Scholar
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
Article CAS Google Scholar
Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
Article CAS Google Scholar
Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2011).
Article Google Scholar
Williams, A.L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012).
Article CAS Google Scholar
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Article CAS Google Scholar
Loh, P.-R., Palamara, P.F. & Price, A.L. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811–816 (2016).
Article CAS Google Scholar
O'Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016).
Article CAS Google Scholar
Snyder, M.W., Adey, A., Kitzman, J.O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).
Article CAS Google Scholar
van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J.K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).
Article CAS Google Scholar
Kumasaka, N., Knights, A.J. & Gaffney, D.J. Fine-mapping cellular QTLs with RASQUAL and ATAC–seq. Nat. Genet. 48, 206–213 (2016).
Article CAS Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. http://dx.doi.org/10.1038/ng.3656 (published online 29 August, 2016).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. http://dx.doi.org/10.1038/ng.3643 (published online 22 August 2016).
Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).
Article CAS Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
CAS PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article Google Scholar
Kvale, M.N. et al. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1051–1060 (2015).
Article Google Scholar
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
Article Google Scholar
Browning, B.L. & Browning, S.R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).
Article CAS Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
Article CAS Google Scholar
He, D., Han, B. & Eskin, E. Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data. J. Comput. Biol. 20, 80–92 (2013).
Article CAS Google Scholar
Delaneau, O., Howie, B., Cox, A.J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).
Article CAS Google Scholar
Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016).
Article CAS Google Scholar
Chang, C.C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article Google Scholar

Download references

Acknowledgements

We are grateful to S. Linderman, N. Patterson, L. O'Connor, A. Gusev, and B. van de Geijn for helpful discussions. This research was conducted using the UK Biobank Resource. P.L., P.P., and A.L.P. were supported by US National Institutes of Health grants R01 HG006399 and R01 MH101244 and fellowship F32 HG007805. P.D., S.M., R.D., and the Sanger Institute HRC server were supported by Wellcome Trust grant WT098051. C.F., G.R.A., and the Michigan Imputation Server were supported by the Austrian Science Fund (FWF) grant J-3401 and US National Institutes of Health grants HG007022 and HL117626. H.K.F. was supported by the Fannie and John Hertz Foundation. Computational analyses were performed on the Orchestra High Performance Compute Cluster at Harvard Medical School, which is partially supported by grant NCRR 1S10RR028832-01, and on the Lisa Genetic Cluster Computer hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003, principal investigator D. Posthuma) along with a supplement from the Dutch Brain Foundation and the VU University Amsterdam.

Author information

Authors and Affiliations

Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
Po-Ru Loh, Pier Francesco Palamara, Hilary K Finucane & Alkes L Price
Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
Po-Ru Loh, Pier Francesco Palamara & Alkes L Price
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
Petr Danecek, Shane McCarthy & Richard Durbin
Center for Biomedicine, European Academy of Bozen/Bolzano (EURAC), affiliated with the University of Lübeck, Bolzano, Italy
Christian Fuchsberger
Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, USA
Christian Fuchsberger & Goncalo R Abecasis
Department of Computer Science, Harvard University, Cambridge, Massachusetts, USA
Yakir A Reshef
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Hilary K Finucane
Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Medical University of Innsbruck, Innsbruck, Austria
Sebastian Schoenherr & Lukas Forer
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
Alkes L Price

Authors

Po-Ru Loh
View author publications
You can also search for this author in PubMed Google Scholar
Petr Danecek
View author publications
You can also search for this author in PubMed Google Scholar
Pier Francesco Palamara
View author publications
You can also search for this author in PubMed Google Scholar
Christian Fuchsberger
View author publications
You can also search for this author in PubMed Google Scholar
Yakir A Reshef
View author publications
You can also search for this author in PubMed Google Scholar
Hilary K Finucane
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Schoenherr
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Forer
View author publications
You can also search for this author in PubMed Google Scholar
Shane McCarthy
View author publications
You can also search for this author in PubMed Google Scholar
Goncalo R Abecasis
View author publications
You can also search for this author in PubMed Google Scholar
Richard Durbin
View author publications
You can also search for this author in PubMed Google Scholar
Alkes L Price
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.-R.L. and A.L.P. designed the study. P.-R.L., P.F.P., Y.A.R., and H.K.F. developed the algorithm. P.-R.L. wrote the software. P.-R.L. and P.D. performed experiments. P.D. and S.M. incorporated the software into the Sanger Imputation Service. C.F., S.S., and L.F. incorporated the software into the Michigan Imputation Server. All authors analyzed data and wrote the paper.

Corresponding authors

Correspondence to Po-Ru Loh or Alkes L Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1 and 2, Supplementary Tables 1–13 and Supplementary Note. (PDF 1968 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loh, PR., Danecek, P., Palamara, P. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet 48, 1443–1448 (2016). https://doi.org/10.1038/ng.3679

Download citation

Received: 06 May 2016
Accepted: 29 August 2016
Published: 03 October 2016
Issue Date: November 2016
DOI: https://doi.org/10.1038/ng.3679

This article is cited by

Exploratory study of cold hypersensitivity in Japanese women: genetic associations and somatic symptom burden
- Xuefeng Wu
- Tetsuhiro Yoshino
- Masaru Mimura
Scientific Reports (2024)
Transposable elements mediate genetic effects altering the expression of nearby genes in colorectal cancer
- Nikolaos M. R. Lykoskoufis
- Evarist Planet
- Emmanouil T. Dermitzakis
Nature Communications (2024)
Genetic evidence for T-wave area from 12-lead electrocardiograms to monitor cardiovascular diseases in patients taking diabetes medications
- Mengling Qi
- Haoyang Zhang
- Huiying Zhao
Human Genetics (2024)
Genome-wide association and Mendelian randomization analysis provide insights into the shared genetic architecture between high-dimensional electrocardiographic features and ischemic heart disease
- Xinfeng Wang
- Mengling Qi
- Huiying Zhao
Human Genetics (2024)
Selection, optimization and validation of ten chronic disease polygenic risk scores for clinical implementation in diverse US populations
- Niall J. Lennon
- Leah C. Kottyan
- Eimear E. Kenny
Nature Medicine (2024)