Fast and accurate long-range phasing in a UK Biobank cohort

Loh, Po-Ru; Palamara, Pier Francesco; Price, Alkes L

doi:10.1038/ng.3571

Technical Report
Published: 06 June 2016

Fast and accurate long-range phasing in a UK Biobank cohort

Po-Ru Loh^1,2,
Pier Francesco Palamara^1,2 &
Alkes L Price^1,2,3

Nature Genetics volume 48, pages 811–816 (2016)Cite this article

6961 Accesses
177 Citations
24 Altmetric
Metrics details

Subjects

Abstract

Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1–2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Eagle algorithm and example phase calls after each step.**

**Figure 2: Computational cost and accuracy of phasing methods.**

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank

Article Open access 29 June 2023

Rapid genotype imputation from sequence with reference panels

Article 03 June 2021

Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes

Article Open access 29 June 2023

References

Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS PubMed PubMed Central Google Scholar
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Article CAS PubMed Google Scholar
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
Article CAS PubMed Google Scholar
Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
Article PubMed PubMed Central Google Scholar
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
Article CAS PubMed PubMed Central Google Scholar
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).
Article CAS PubMed PubMed Central Google Scholar
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
Article CAS PubMed PubMed Central Google Scholar
Browning, S.R. & Browning, B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
Article CAS PubMed PubMed Central Google Scholar
Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
Article CAS PubMed PubMed Central Google Scholar
Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012).
Article CAS Google Scholar
Williams, A.L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012).
Article CAS PubMed PubMed Central Google Scholar
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Article CAS PubMed Google Scholar
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
Article CAS PubMed PubMed Central Google Scholar
Stefansson, H. et al. Common variants conferring risk of schizophrenia. Nature 460, 744–747 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kong, A. et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868–874 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kong, A. et al. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099–1103 (2010).
Article CAS PubMed Google Scholar
Thorleifsson, G. et al. Common variants near CAV1 and CAV2 are associated with primary open-angle glaucoma. Nat. Genet. 42, 906–909 (2010).
Article CAS PubMed PubMed Central Google Scholar
Holm, H. et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat. Genet. 43, 316–320 (2011).
Article CAS PubMed PubMed Central Google Scholar
Rafnar, T. et al. Mutations in BRIP1 confer high risk of ovarian cancer. Nat. Genet. 43, 1104–1107 (2011).
Article CAS PubMed Google Scholar
Gudmundsson, J. et al. Discovery of common variants associated with low TSH levels and thyroid cancer risk. Nat. Genet. 44, 319–322 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gudmundsson, J. et al. A study based on whole-genome sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat. Genet. 44, 1326–1329 (2012).
Article CAS PubMed PubMed Central Google Scholar
Helgason, H. et al. A rare nonsynonymous sequence variant in C3 is associated with high risk of age-related macular degeneration. Nat. Genet. 45, 1371–1374 (2013).
Article CAS PubMed Google Scholar
Kong, A. et al. Common and low-frequency variants associated with genome-wide recombination rate. Nat. Genet. 46, 11–16 (2014).
Article CAS PubMed Google Scholar
Steinthorsdottir, V. et al. Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes. Nat. Genet. 46, 294–298 (2014).
Article CAS PubMed Google Scholar
Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Article CAS PubMed Google Scholar
Steinberg, S. et al. Loss-of-function variants in ABCA7 confer risk of Alzheimer's disease. Nat. Genet. 47, 445–447 (2015).
Article CAS PubMed Google Scholar
Helgason, H. et al. Loss-of-function variants in ATM confer risk of gastric cancer. Nat. Genet. 47, 906–910 (2015).
Article CAS PubMed Google Scholar
Palin, K., Campbell, H., Wright, A.F., Wilson, J.F. & Durbin, R. Identity-by-descent-based phasing and imputation in founder populations using graphical models. Genet. Epidemiol. 35, 853–860 (2011).
Article PubMed PubMed Central Google Scholar
O'Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).
Article PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Gusev, A. et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 19, 318–326 (2009).
Article CAS PubMed PubMed Central Google Scholar
Browning, B.L. & Browning, S.R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88, 173–182 (2011).
Article CAS PubMed PubMed Central Google Scholar
Browning, B.L. & Browning, S.R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).
Article PubMed PubMed Central Google Scholar
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
Article PubMed PubMed Central Google Scholar
Galinsky, K.J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).
Article CAS PubMed PubMed Central Google Scholar
O'Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. http://dx.doi.org/10.1038/ng.3583 (2016).
Delaneau, O., Howie, B., Cox, A.J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).
Article CAS PubMed PubMed Central Google Scholar
Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).
Article CAS PubMed PubMed Central Google Scholar
Browning, B.L. & Browning, S.R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).
Article CAS PubMed PubMed Central Google Scholar
Chen, C.-Y. et al. Improved ancestry inference using weights from external reference panels. Bioinformatics 29, 1399–1406 (2013).
Article CAS PubMed PubMed Central Google Scholar
Henn, B.M. et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS One 7, e34267 (2012).
Article CAS PubMed PubMed Central Google Scholar
Huang, L., Bercovici, S., Rodriguez, J.M. & Batzoglou, S. An effective filter for IBD detection in large data sets. PLoS One 9, e92713 (2014).
Article PubMed PubMed Central Google Scholar
Rodriguez, J.M., Bercovici, S., Huang, L., Frostig, R. & Batzoglou, S. Parente2: a fast and accurate method for detecting identity by descent. Genome Res. 25, 280–289 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bulik-Sullivan, B.K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Indyk, P. & Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. in Proc. 30th Ann. ACM Symposium Theory Computing 604–613 (ACM, 1998).
Gionis, A., Indyk, P. & Motwani, R. Similarity search in high dimensions via hashing. in Proc. 25th VLDB Conf. vol. 99, 518–529 (Morgan Kaufmann Publishers, 1999).
Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
CAS PubMed PubMed Central Google Scholar
Chang, C.C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article PubMed PubMed Central Google Scholar
Kvale, M.N. et al. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1051–1060 (2015).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We are grateful to G. Bhatia, S. Gusev, M. Lipson, B. Pasaniuc, N. Patterson, and N. Zaitlen for helpful discussions. This research was conducted using the UK Biobank Resource and was supported by US National Institutes of Health grants R01 HG006399 and R01 MH101244 and US National Institutes of Health fellowship F32 HG007805. Computational analyses were performed on the Orchestra High-Performance Compute Cluster at Harvard Medical School, which is partially supported by grant NCRR 1S10RR028832-01, and on the Lisa Genetic Cluster Computer hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003, principal investigator D. Posthuma) along with a supplement from the Dutch Brain Foundation and VU University Amsterdam.

Author information

Authors and Affiliations

Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
Po-Ru Loh, Pier Francesco Palamara & Alkes L Price
Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
Po-Ru Loh, Pier Francesco Palamara & Alkes L Price
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
Alkes L Price

Authors

Po-Ru Loh
View author publications
You can also search for this author in PubMed Google Scholar
Pier Francesco Palamara
View author publications
You can also search for this author in PubMed Google Scholar
Alkes L Price
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.-R.L. and P.F.P. designed the algorithm. P.-R.L. implemented the algorithm and performed experiments. P.-R.L. and A.L.P. analyzed data and wrote the manuscript.

Corresponding authors

Correspondence to Po-Ru Loh or Alkes L Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Comparison of in-sample imputation and standard GWAS imputation.

Standard GWAS imputation differs from in-sample imputation in three ways. First, GWAS imputation usually involves imputing sequence data from a reference panel into a (genotyped but not sequenced) target sample, which typically requires phasing the sequenced reference (possibly using read information³⁶), phasing the target sample (possibly using the phased reference), and imputing reference data into the target sample; in contrast, in-sample imputation involves only one sample, serving as both target and reference, that is simultaneously phased and imputed. Second, GWAS imputation pipelines produce probabilistic allele ‘dosage’ estimates, whereas phasing methods produce hard calls at missing genotypes (thus achieving suboptimal imputation R²). Third, typical GWAS impute sequenced SNPs into target samples that are fully typed at a set of ascertained array SNPs, whereas phasing methods impute missing data at ascertained SNPs. (The latter task may be slightly harder than the former, as genotyping arrays are sometimes optimized to minimize redundancy among ascertained SNPs; thus, the LD between a typical ascertained SNP and its closest ascertained proxy may be lower than the LD between a typical sequenced SNP and its closest ascertained proxy. On the other hand, the fact that rare variants on genotyping arrays are typically enriched in densely typed fine-mapping regions may make in-sample imputation easier.) For all of these reasons, different algorithms are typically used for phasing versus GWAS imputation (e.g., SHAPEIT^10,12 versus IMPUTE^2,55 and MaCH⁴ versus minimac^5,56).

Supplementary Figure 2 In-sample imputation accuracy of Eagle and SHAPEIT2.

We randomly masked 2% of the genotypes in all N = 150,000 UK Biobank samples and phased the first 40 cM of chromosome 10 using Eagle (on the full cohort) and SHAPEIT2 (on all samples at once with either K = 100 (default) or 200 states as well as in N = 50,000 and 15,000 batches), imputing all masked genotypes in the process. (a) Accuracy of the imputed genotypes on the subset of 120,000 UK samples curated by UK Biobank for GWAS (~80% of all samples), stratified by MAF in those samples. (b) Accuracy of the imputed genotypes on subsets of samples defined by self-reported ancestry, stratified by MAF in those samples. The five largest ancestry groups in the data set were British (137,178 samples), Irish (3,977), “any other white background” (4,760), Indian (1,324), and Caribbean (1,028). The British and Irish results were nearly identical (Supplementary Table 11), so we did not plot Irish results to improve readability. For the ancestry groups with <5,000 samples, we plotted results only for MAF bins corresponding to an expected minor allele count of ≥2 among masked samples. Error bars, s.e.m. Numerical data are provided in Supplementary Tables 9 and 11.

Supplementary information

Supplementary Text and Figures

Supplementary Note, Supplementary Figures 1 and 2, and Supplementary Tables 1–15. (PDF 1702 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loh, PR., Palamara, P. & Price, A. Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet 48, 811–816 (2016). https://doi.org/10.1038/ng.3571

Download citation

Received: 03 September 2015
Accepted: 22 April 2016
Published: 06 June 2016
Issue Date: July 2016
DOI: https://doi.org/10.1038/ng.3571

This article is cited by

Transposable elements mediate genetic effects altering the expression of nearby genes in colorectal cancer
- Nikolaos M. R. Lykoskoufis
- Evarist Planet
- Emmanouil T. Dermitzakis
Nature Communications (2024)
Exploratory study of cold hypersensitivity in Japanese women: genetic associations and somatic symptom burden
- Xuefeng Wu
- Tetsuhiro Yoshino
- Masaru Mimura
Scientific Reports (2024)
Genetic architecture distinguishes tinnitus from hearing loss
- Royce E. Clifford
- Adam X. Maihofer
- Caroline M. Nievergelt
Nature Communications (2024)
Multi-ancestry genome-wide association meta-analysis of Parkinson’s disease
- Jonggeol Jeffrey Kim
- Dan Vitale
- Ignacio Mata
Nature Genetics (2024)
Longitudinal multi-omics study reveals common etiology underlying association between plasma proteome and BMI trajectories in adolescent and young adult twins
- Gabin Drouard
- Fiona A. Hagenbeek
- Jaakko Kaprio
BMC Medicine (2023)