Technical Report | Published:

Fast and accurate long-range phasing in a UK Biobank cohort

Nature Genetics volume 48, pages 811816 (2016) | Download Citation

Subjects

Abstract

Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1–2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    & Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

  2. 2.

    , , , & A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

  3. 3.

    & Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

  4. 4.

    , , , & MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

  5. 5.

    , , , & Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).

  6. 6.

    & Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005).

  7. 7.

    & A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

  8. 8.

    & Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

  9. 9.

    & A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).

  10. 10.

    , & A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012).

  11. 11.

    , , , & Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012).

  12. 12.

    , & Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

  13. 13.

    et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).

  14. 14.

    et al. Common variants conferring risk of schizophrenia. Nature 460, 744–747 (2009).

  15. 15.

    et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868–874 (2009).

  16. 16.

    et al. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099–1103 (2010).

  17. 17.

    et al. Common variants near CAV1 and CAV2 are associated with primary open-angle glaucoma. Nat. Genet. 42, 906–909 (2010).

  18. 18.

    et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat. Genet. 43, 316–320 (2011).

  19. 19.

    et al. Mutations in BRIP1 confer high risk of ovarian cancer. Nat. Genet. 43, 1104–1107 (2011).

  20. 20.

    et al. Discovery of common variants associated with low TSH levels and thyroid cancer risk. Nat. Genet. 44, 319–322 (2012).

  21. 21.

    et al. A study based on whole-genome sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat. Genet. 44, 1326–1329 (2012).

  22. 22.

    et al. A rare nonsynonymous sequence variant in C3 is associated with high risk of age-related macular degeneration. Nat. Genet. 45, 1371–1374 (2013).

  23. 23.

    et al. Common and low-frequency variants associated with genome-wide recombination rate. Nat. Genet. 46, 11–16 (2014).

  24. 24.

    et al. Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes. Nat. Genet. 46, 294–298 (2014).

  25. 25.

    et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).

  26. 26.

    et al. Loss-of-function variants in ABCA7 confer risk of Alzheimer's disease. Nat. Genet. 47, 445–447 (2015).

  27. 27.

    et al. Loss-of-function variants in ATM confer risk of gastric cancer. Nat. Genet. 47, 906–910 (2015).

  28. 28.

    , , , & Identity-by-descent-based phasing and imputation in founder populations using graphical models. Genet. Epidemiol. 35, 853–860 (2011).

  29. 29.

    et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).

  30. 30.

    et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

  31. 31.

    et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 19, 318–326 (2009).

  32. 32.

    & A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88, 173–182 (2011).

  33. 33.

    & Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).

  34. 34.

    et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).

  35. 35.

    et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).

  36. 36.

    et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. (2016).

  37. 37.

    , , , & Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).

  38. 38.

    Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).

  39. 39.

    & Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).

  40. 40.

    et al. Improved ancestry inference using weights from external reference panels. Bioinformatics 29, 1399–1406 (2013).

  41. 41.

    et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS One 7, e34267 (2012).

  42. 42.

    , , & An effective filter for IBD detection in large data sets. PLoS One 9, e92713 (2014).

  43. 43.

    , , , & Parente2: a fast and accurate method for detecting identity by descent. Genome Res. 25, 280–289 (2015).

  44. 44.

    et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

  45. 45.

    & Approximate nearest neighbors: towards removing the curse of dimensionality. in Proc. 30th Ann. ACM Symposium Theory Computing 604–613 (ACM, 1998).

  46. 46.

    , & Similarity search in high dimensions via hashing. in Proc. 25th VLDB Conf. vol. 99, 518–529 (Morgan Kaufmann Publishers, 1999).

  47. 47.

    & Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).

  48. 48.

    et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

  49. 49.

    et al. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1051–1060 (2015).

Download references

Acknowledgements

We are grateful to G. Bhatia, S. Gusev, M. Lipson, B. Pasaniuc, N. Patterson, and N. Zaitlen for helpful discussions. This research was conducted using the UK Biobank Resource and was supported by US National Institutes of Health grants R01 HG006399 and R01 MH101244 and US National Institutes of Health fellowship F32 HG007805. Computational analyses were performed on the Orchestra High-Performance Compute Cluster at Harvard Medical School, which is partially supported by grant NCRR 1S10RR028832-01, and on the Lisa Genetic Cluster Computer hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003, principal investigator D. Posthuma) along with a supplement from the Dutch Brain Foundation and VU University Amsterdam.

Author information

Affiliations

  1. Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Po-Ru Loh
    • , Pier Francesco Palamara
    •  & Alkes L Price
  2. Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.

    • Po-Ru Loh
    • , Pier Francesco Palamara
    •  & Alkes L Price
  3. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

    • Alkes L Price

Authors

  1. Search for Po-Ru Loh in:

  2. Search for Pier Francesco Palamara in:

  3. Search for Alkes L Price in:

Contributions

P.-R.L. and P.F.P. designed the algorithm. P.-R.L. implemented the algorithm and performed experiments. P.-R.L. and A.L.P. analyzed data and wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Po-Ru Loh or Alkes L Price.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Note, Supplementary Figures 1 and 2, and Supplementary Tables 1–15.

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/ng.3571

Further reading