Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Haplotype estimation for biobank-scale data sets

Abstract

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as 0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Performance on the UK BiLEVE chromosome 20 data set.

Similar content being viewed by others

References

  1. Hellenthal, G. et al. A genetic atlas of human admixture history. Science 343, 747–751 (2014).

    Article  CAS  Google Scholar 

  2. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  3. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

    Article  CAS  Google Scholar 

  4. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).

    Article  CAS  Google Scholar 

  5. Hoffmann, T.J. et al. Imputation of the rare HOXB13 G84E mutation and cancer risk in a large population-based cohort. PLoS Genet. 11, e1004930 (2015).

    Article  Google Scholar 

  6. Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).

    Article  Google Scholar 

  7. Wain, L.V. et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 3, 769–781 (2015).

    Article  Google Scholar 

  8. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

    Article  CAS  Google Scholar 

  9. O'Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).

    Article  Google Scholar 

  10. Martin, H.C. et al. Multicohort analysis of the maternal age effect on recombination. Nat. Commun. 6, 7846 (2015).

    Article  CAS  Google Scholar 

  11. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).

    Article  CAS  Google Scholar 

  12. Williams, A.L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012).

    Article  CAS  Google Scholar 

  13. Loh, P.-R., Palamara, P.F. & Price, A.L. Fast and accurate long-range phasing and imputation in a UK Biobank cohort. Nat. Genet. http://dx.doi.org/10.1038/ng.3571 (2016).

  14. Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

    Article  CAS  Google Scholar 

  15. Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

    Article  Google Scholar 

  16. Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

    Article  Google Scholar 

  17. Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012).

    Article  CAS  Google Scholar 

  18. Koga, H., Ishibashi, T. & Watanabe, T. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl. Inf. Syst. 12, 25–53 (2007).

    Article  Google Scholar 

  19. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. BioRxiv http://dx.doi.org/10.1101/035170 (2015).

  20. Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics http://dx.doi.org/10.1093/bioinformatics/btw065 (2016).

  21. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982).

    Article  Google Scholar 

  22. Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).

    Article  CAS  Google Scholar 

  23. Hinch, A.G. et al. The landscape of recombination in African Americans. Nature 476, 170–175 (2011).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

UK BiLEVE was funded by a Medical Research Council (MRC) strategic award to M.T., I.H., D. Strachan and L.W. (MC_PC_12010). We would like to thank the UK BiLEVE study investigators, array design team, analysis group and steering group. This research has been conducted using the UK Biobank Resource. This article presents independent research funded partially by the National Institute for Health Research (NIHR). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the UK Department of Health. J.M. acknowledges support from the ERC (grant 617306).

Author information

Authors and Affiliations

Authors

Contributions

J.O'C. and J.M. designed the experiments. J.O'C. derived the clustering algorithm and implemented it. O.D. refactored the HMM code. J.O'C. and K.S. worked on the IBD HMM and analyzed the data sets. J.O'C., K.S. and J.M. carried out the experiments. N.S., L.W., I.H. and M.T. provided the UK BiLEVE data set. J.M. and O.D. supervised the research. J.O'C., K.S., J.-F.Z., O.D. and J.M. wrote the manuscript.

Corresponding author

Correspondence to Jonathan Marchini.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 SHAPEIT3 performance by varying the cluster size parameter.

SHAPEIT3 running time (left) and SER (right) against increasing values of the cluster size parameter (M). SHAPEIT2 performance is denoted by the horizontal red line (no M parameter). These results are for the UK BiLEVE test set (N = 49,074) for the region at chr. 20: 10–20 Mb (2,518 SNPs). Computation was performed on an Intel Xeon E5-2690 CPU using four threads. Increasing M improves accuracy at the cost of increased computation time.

Supplementary Figure 2 Number of perfect matches as iteration and sample size vary.

The average number of 2-Mb windows (of 31 windows) per individual where a Hamming distance of zero was found with a conditioning haplotype, plotted against the iteration number of the SHAPEIT3 algorithm. Lines are colored by sample size. The number of perfect match windows increases rather dramatically with sample size. When a perfect match is found, the SHAPEIT3 algorithm skips HMM calculations in that window. This suggests that, as sample sizes increase, less time needs to be spent on inference per individual, owing to longer haplotype matches. The dip at iteration 16 occurs after the pruning step of the algorithm, where the HMM step needs to be recalculated regardless of matching haplotypes because of the graph structure having changed; we resample all haplotypes after this step.

Supplementary Figure 3 Estimated haplotypes in trio children phased using SHAPEIT3 on the full UKB data set of 152,112 samples.

Each line represents the phase of a single individual. Changes in color between dark and light blue represent switch errors. Switch errors were determined by comparing the estimated phase to that inferred when using the trio parents.

Supplementary Figure 4 Estimated haplotypes in trio children phased using SHAPEIT3 together with an unrelated subset of 10,000 UKB samples.

Each line represents the phase of a single individual. Changes in color between dark and light blue represent switch errors. Switch errors were determined by comparing the estimated phase to that inferred when using the trio parents.

Supplementary Figure 5 Estimated haplotypes in trio children phased using SHAPEIT3 together with an unrelated subset of 1,000 UKB samples.

Each line represents the phase of a single individual. Changes in color between dark and light blue represent switch errors. Switch errors were determined by comparing the estimated phase to that inferred when using the trio parents.

Supplementary Figure 6 Example of IBD sharing of siblings inferred by the HMM before and after filtering.

Black circles denote the IBS state at each marker across chromosome 20. The colored bars represent the IBD state inferred by our HMM before (top) and after (bottom) filtering out stretches of IBD less than 10 cM in length. The filtered state sequence provides a more conservative estimate of IBD1 sharing and is closer to the IBD patterns we would expect to see in a sibling pair.

Supplementary Figure 7 Comparison of switch error rates for HAPI-UR 3× and SHAPEIT3.

Methods were run on N = 49,074 UK BiLEVE samples in total. Accuracy was compared for each of the 384 phased individuals. Left, prior to filtering out short IBD sequences; right, after filtering out short IBD sequences. Dashed red lines indicate an error rate of 5%. After filtering, the rates are slightly lower for both methods and have lower variance. This suggests that filtering did indeed remove some stretches that induced incorrect phasing in the validation haplotypes.

Supplementary Figure 8 Effect of IBD filtering on switch error rate.

Variation in the SER achieved by SHAPEIT3 applied to all N = 49,074 UK BiLEVE samples. Switch error is measured on phased haplotypes derived form the 384 likely sibling pairs. Left, prior to filtering out short IBD sequences; right, after filtering out short IBD sequences. Horizontal and vertical dashed red lines represent the median error rate and median yield, respectively. In both cases, there is a small preponderance of individuals with SERs above the median at lower yields relative to the number above the median at higher yields (top left quadrant versus top right quadrant). This decreases after filtering from 110:72 to 96:86.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8, Supplementary Tables 1–4 and Supplementary Note. (PDF 2340 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

O'Connell, J., Sharp, K., Shrine, N. et al. Haplotype estimation for biobank-scale data sets. Nat Genet 48, 817–820 (2016). https://doi.org/10.1038/ng.3583

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3583

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics