Abstract
Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes–based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
Access options
Subscribe to Journal
Get full journal access for 1 year
70,80 €
only 5,90 € per issue
All prices include VAT for France.
Rent or Buy article
Get time limited or full article access on ReadCube.
from$8.99
All prices are NET prices.
References
- 1.
Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
- 2.
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
- 3.
Stephens, M., Smith, N.J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).
- 4.
Halperin, E. & Eskin, E. Haplotype reconstruction from genotype data using Imperfect Phylogeny. Bioinformatics 20, 1842–1849 (2004).
- 5.
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).
- 6.
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
- 7.
Browning, S.R. & Browning, B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
- 8.
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
- 9.
Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
- 10.
Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2011).
- 11.
Williams, A.L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012).
- 12.
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
- 13.
Loh, P.-R., Palamara, P.F. & Price, A.L. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811–816 (2016).
- 14.
O'Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016).
- 15.
Snyder, M.W., Adey, A., Kitzman, J.O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).
- 16.
van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J.K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015).
- 17.
Kumasaka, N., Knights, A.J. & Gaffney, D.J. Fine-mapping cellular QTLs with RASQUAL and ATAC–seq. Nat. Genet. 48, 206–213 (2016).
- 18.
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. http://dx.doi.org/10.1038/ng.3656 (published online 29 August, 2016).
- 19.
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. http://dx.doi.org/10.1038/ng.3643 (published online 22 August 2016).
- 20.
Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).
- 21.
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
- 22.
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
- 23.
Kvale, M.N. et al. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1051–1060 (2015).
- 24.
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
- 25.
Browning, B.L. & Browning, S.R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).
- 26.
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
- 27.
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
- 28.
He, D., Han, B. & Eskin, E. Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data. J. Comput. Biol. 20, 80–92 (2013).
- 29.
Delaneau, O., Howie, B., Cox, A.J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).
- 30.
Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016).
- 31.
Chang, C.C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Acknowledgements
We are grateful to S. Linderman, N. Patterson, L. O'Connor, A. Gusev, and B. van de Geijn for helpful discussions. This research was conducted using the UK Biobank Resource. P.L., P.P., and A.L.P. were supported by US National Institutes of Health grants R01 HG006399 and R01 MH101244 and fellowship F32 HG007805. P.D., S.M., R.D., and the Sanger Institute HRC server were supported by Wellcome Trust grant WT098051. C.F., G.R.A., and the Michigan Imputation Server were supported by the Austrian Science Fund (FWF) grant J-3401 and US National Institutes of Health grants HG007022 and HL117626. H.K.F. was supported by the Fannie and John Hertz Foundation. Computational analyses were performed on the Orchestra High Performance Compute Cluster at Harvard Medical School, which is partially supported by grant NCRR 1S10RR028832-01, and on the Lisa Genetic Cluster Computer hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003, principal investigator D. Posthuma) along with a supplement from the Dutch Brain Foundation and the VU University Amsterdam.
Author information
Affiliations
Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
- Po-Ru Loh
- , Pier Francesco Palamara
- , Hilary K Finucane
- & Alkes L Price
Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.
- Po-Ru Loh
- , Pier Francesco Palamara
- & Alkes L Price
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK.
- Petr Danecek
- , Shane McCarthy
- & Richard Durbin
Center for Biomedicine, European Academy of Bozen/Bolzano (EURAC), affiliated with the University of Lübeck, Bolzano, Italy.
- Christian Fuchsberger
Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.
- Christian Fuchsberger
- & Goncalo R Abecasis
Department of Computer Science, Harvard University, Cambridge, Massachusetts, USA.
- Yakir A Reshef
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
- Hilary K Finucane
Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Medical University of Innsbruck, Innsbruck, Austria.
- Sebastian Schoenherr
- & Lukas Forer
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
- Alkes L Price
Authors
Search for Po-Ru Loh in:
Search for Petr Danecek in:
Search for Pier Francesco Palamara in:
Search for Christian Fuchsberger in:
Search for Yakir A Reshef in:
Search for Hilary K Finucane in:
Search for Sebastian Schoenherr in:
Search for Lukas Forer in:
Search for Shane McCarthy in:
Search for Goncalo R Abecasis in:
Search for Richard Durbin in:
Search for Alkes L Price in:
Contributions
P.-R.L. and A.L.P. designed the study. P.-R.L., P.F.P., Y.A.R., and H.K.F. developed the algorithm. P.-R.L. wrote the software. P.-R.L. and P.D. performed experiments. P.D. and S.M. incorporated the software into the Sanger Imputation Service. C.F., S.S., and L.F. incorporated the software into the Michigan Imputation Server. All authors analyzed data and wrote the paper.
Competing interests
The authors declare no competing financial interests.
Corresponding authors
Correspondence to Po-Ru Loh or Alkes L Price.
Supplementary information
PDF files
- 1.
Supplementary Text and Figures
Supplementary Figures 1 and 2, Supplementary Tables 1–13 and Supplementary Note.
Rights and permissions
To obtain permission to re-use content from this article visit RightsLink.
About this article
Further reading
-
1.
Genotype imputation for Han Chinese population using Haplotype Reference Consortium as reference
Human Genetics (2018)
-
2.
Translational Psychiatry (2018)
-
3.
Die Forschungsgruppe Datenbanken und Informationssysteme an der Universität Innsbruck
Datenbank-Spektrum (2018)
-
4.
Genome Medicine (2018)
-
5.
Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations
Nature (2018)