Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ~20× speedup and ~10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes–based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
At a glance
- The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011). , , , &
- Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011). &
- A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001). , &
- Haplotype reconstruction from genotype data using Imperfect Phylogeny. Bioinformatics 20, 1842–1849 (2004). &
- Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005). &
- A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006). &
- Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007). &
- Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008). et al.
- A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009). &
- A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2011). , &
- Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012). , , , &
- Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013). , &
- Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811–816 (2016). , &
- Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016). et al.
- Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015). , , &
- WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods 12, 1061–1063 (2015). , , &
- Fine-mapping cellular QTLs with RASQUAL and ATAC–seq. Nat. Genet. 48, 206–213 (2016). , &
- Next-generation genotype imputation service and methods. Nat. Genet. http://dx.doi.org/10.1038/ng.3656 (published online 29 August, 2016). et al.
- A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. http://dx.doi.org/10.1038/ng.3643 (published online 22 August 2016). et al.
- Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).
- Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003). &
- UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015). et al.
- Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1051–1060 (2015). et al.
- Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015). et al.
- Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016). &
- 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
- Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012). , , , &
- Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data. J. Comput. Biol. 20, 80–92 (2013). , &
- Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013). , , , &
- Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016). , , &
- Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). et al.
- Supplementary Text and Figures (2,015 KB)
Supplementary Figures 1 and 2, Supplementary Tables 1–13 and Supplementary Note.