Haplotype estimation for biobank-scale data sets

O'Connell, Jared; Sharp, Kevin; Shrine, Nick; Wain, Louise; Hall, Ian; Tobin, Martin; Zagury, Jean-Francois; Delaneau, Olivier; Marchini, Jonathan

doi:10.1038/ng.3583

Technical Report
Published: 06 June 2016

Haplotype estimation for biobank-scale data sets

Jared O'Connell^1,2^na1,
Kevin Sharp²^na1,
Nick Shrine³,
Louise Wain³,
Ian Hall⁴,
Martin Tobin ORCID: orcid.org/0000-0002-3596-7874³,
Jean-Francois Zagury⁵,
Olivier Delaneau⁶^na2 &
…
Jonathan Marchini ORCID: orcid.org/0000-0003-0610-8322^1,2^na2

Nature Genetics volume 48, pages 817–820 (2016)Cite this article

4713 Accesses
108 Citations
21 Altmetric
Metrics details

Subjects

Abstract

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Performance on the UK BiLEVE chromosome 20 data set.**

Accurate, scalable and integrative haplotype estimation

Article Open access 28 November 2019

Olivier Delaneau, Jean-François Zagury, … Emmanouil T. Dermitzakis

Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies

Article 28 August 2023

Pouria Salehi Nowbandegani, Anthony Wilder Wohns, … Luke J. O’Connor

A generalized linear mixed model association tool for biobank-scale data

Article 04 November 2021

Longda Jiang, Zhili Zheng, … Jian Yang

References

Hellenthal, G. et al. A genetic atlas of human admixture history. Science 343, 747–751 (2014).
Article CAS Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
Article CAS Google Scholar
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
Article CAS Google Scholar
Hoffmann, T.J. et al. Imputation of the rare HOXB13 G84E mutation and cancer risk in a large population-based cohort. PLoS Genet. 11, e1004930 (2015).
Article Google Scholar
Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).
Article Google Scholar
Wain, L.V. et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 3, 769–781 (2015).
Article Google Scholar
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Article CAS Google Scholar
O'Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).
Article Google Scholar
Martin, H.C. et al. Multicohort analysis of the maternal age effect on recombination. Nat. Commun. 6, 7846 (2015).
Article CAS Google Scholar
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
Article CAS Google Scholar
Williams, A.L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91, 238–251 (2012).
Article CAS Google Scholar
Loh, P.-R., Palamara, P.F. & Price, A.L. Fast and accurate long-range phasing and imputation in a UK Biobank cohort. Nat. Genet. http://dx.doi.org/10.1038/ng.3571 (2016).
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
Article CAS Google Scholar
Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
Article Google Scholar
Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Article Google Scholar
Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012).
Article CAS Google Scholar
Koga, H., Ishibashi, T. & Watanabe, T. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl. Inf. Syst. 12, 25–53 (2007).
Article Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. BioRxiv http://dx.doi.org/10.1101/035170 (2015).
Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics http://dx.doi.org/10.1093/bioinformatics/btw065 (2016).
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982).
Article Google Scholar
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Article CAS Google Scholar
Hinch, A.G. et al. The landscape of recombination in African Americans. Nature 476, 170–175 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

UK BiLEVE was funded by a Medical Research Council (MRC) strategic award to M.T., I.H., D. Strachan and L.W. (MC_PC_12010). We would like to thank the UK BiLEVE study investigators, array design team, analysis group and steering group. This research has been conducted using the UK Biobank Resource. This article presents independent research funded partially by the National Institute for Health Research (NIHR). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the UK Department of Health. J.M. acknowledges support from the ERC (grant 617306).

Author information

Jared O'Connell and Kevin Sharp: These authors contributed equally to this work.
Olivier Delaneau and Jonathan Marchini: These authors jointly supervised this work.

Authors and Affiliations

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
Jared O'Connell & Jonathan Marchini
Department of Statistics, University of Oxford, Oxford, UK
Jared O'Connell, Kevin Sharp & Jonathan Marchini
Department of Health Sciences, University of Leicester, Leicester, UK
Nick Shrine, Louise Wain & Martin Tobin
School of Medicine, University of Nottingham, Nottingham, UK
Ian Hall
Chaire de Bioinformatique, Laboratoire Génomique, Bioinformatique et Applications (Equipe d'Accueil 4627), Conservatoire National des Arts et Métiers, Paris, France
Jean-Francois Zagury
Département de Génétique et Développement, University of Geneva, Geneva, Switzerland
Olivier Delaneau

Authors

Jared O'Connell
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Sharp
View author publications
You can also search for this author in PubMed Google Scholar
Nick Shrine
View author publications
You can also search for this author in PubMed Google Scholar
Louise Wain
View author publications
You can also search for this author in PubMed Google Scholar
Ian Hall
View author publications
You can also search for this author in PubMed Google Scholar
Martin Tobin
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Francois Zagury
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Delaneau
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Marchini
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.O'C. and J.M. designed the experiments. J.O'C. derived the clustering algorithm and implemented it. O.D. refactored the HMM code. J.O'C. and K.S. worked on the IBD HMM and analyzed the data sets. J.O'C., K.S. and J.M. carried out the experiments. N.S., L.W., I.H. and M.T. provided the UK BiLEVE data set. J.M. and O.D. supervised the research. J.O'C., K.S., J.-F.Z., O.D. and J.M. wrote the manuscript.

Corresponding author

Correspondence to Jonathan Marchini.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 SHAPEIT3 performance by varying the cluster size parameter.

SHAPEIT3 running time (left) and SER (right) against increasing values of the cluster size parameter (M). SHAPEIT2 performance is denoted by the horizontal red line (no M parameter). These results are for the UK BiLEVE test set (N = 49,074) for the region at chr. 20: 10–20 Mb (2,518 SNPs). Computation was performed on an Intel Xeon E5-2690 CPU using four threads. Increasing M improves accuracy at the cost of increased computation time.

Supplementary Figure 2 Number of perfect matches as iteration and sample size vary.

The average number of 2-Mb windows (of 31 windows) per individual where a Hamming distance of zero was found with a conditioning haplotype, plotted against the iteration number of the SHAPEIT3 algorithm. Lines are colored by sample size. The number of perfect match windows increases rather dramatically with sample size. When a perfect match is found, the SHAPEIT3 algorithm skips HMM calculations in that window. This suggests that, as sample sizes increase, less time needs to be spent on inference per individual, owing to longer haplotype matches. The dip at iteration 16 occurs after the pruning step of the algorithm, where the HMM step needs to be recalculated regardless of matching haplotypes because of the graph structure having changed; we resample all haplotypes after this step.

Supplementary Figure 3 Estimated haplotypes in trio children phased using SHAPEIT3 on the full UKB data set of 152,112 samples.

Each line represents the phase of a single individual. Changes in color between dark and light blue represent switch errors. Switch errors were determined by comparing the estimated phase to that inferred when using the trio parents.

Supplementary Figure 4 Estimated haplotypes in trio children phased using SHAPEIT3 together with an unrelated subset of 10,000 UKB samples.

Each line represents the phase of a single individual. Changes in color between dark and light blue represent switch errors. Switch errors were determined by comparing the estimated phase to that inferred when using the trio parents.

Supplementary Figure 5 Estimated haplotypes in trio children phased using SHAPEIT3 together with an unrelated subset of 1,000 UKB samples.

Each line represents the phase of a single individual. Changes in color between dark and light blue represent switch errors. Switch errors were determined by comparing the estimated phase to that inferred when using the trio parents.

Supplementary Figure 6 Example of IBD sharing of siblings inferred by the HMM before and after filtering.

Black circles denote the IBS state at each marker across chromosome 20. The colored bars represent the IBD state inferred by our HMM before (top) and after (bottom) filtering out stretches of IBD less than 10 cM in length. The filtered state sequence provides a more conservative estimate of IBD1 sharing and is closer to the IBD patterns we would expect to see in a sibling pair.

Supplementary Figure 7 Comparison of switch error rates for HAPI-UR 3× and SHAPEIT3.

Methods were run on N = 49,074 UK BiLEVE samples in total. Accuracy was compared for each of the 384 phased individuals. Left, prior to filtering out short IBD sequences; right, after filtering out short IBD sequences. Dashed red lines indicate an error rate of 5%. After filtering, the rates are slightly lower for both methods and have lower variance. This suggests that filtering did indeed remove some stretches that induced incorrect phasing in the validation haplotypes.

Supplementary Figure 8 Effect of IBD filtering on switch error rate.

Variation in the SER achieved by SHAPEIT3 applied to all N = 49,074 UK BiLEVE samples. Switch error is measured on phased haplotypes derived form the 384 likely sibling pairs. Left, prior to filtering out short IBD sequences; right, after filtering out short IBD sequences. Horizontal and vertical dashed red lines represent the median error rate and median yield, respectively. In both cases, there is a small preponderance of individuals with SERs above the median at lower yields relative to the number above the median at higher yields (top left quadrant versus top right quadrant). This decreases after filtering from 110:72 to 96:86.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8, Supplementary Tables 1–4 and Supplementary Note. (PDF 2340 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

O'Connell, J., Sharp, K., Shrine, N. et al. Haplotype estimation for biobank-scale data sets. Nat Genet 48, 817–820 (2016). https://doi.org/10.1038/ng.3583

Download citation

Received: 12 November 2015
Accepted: 09 May 2016
Published: 06 June 2016
Issue Date: July 2016
DOI: https://doi.org/10.1038/ng.3583

This article is cited by

Polygenic risk score-based phenome-wide association study of head and neck cancer across two large biobanks
- Young Chan Lee
- Sang-Hyuk Jung
- Dokyoon Kim
BMC Medicine (2024)
Genetic architecture distinguishes tinnitus from hearing loss
- Royce E. Clifford
- Adam X. Maihofer
- Caroline M. Nievergelt
Nature Communications (2024)
Comparison of multiple imputation and other methods for the analysis of imputed genotypes
- Paul L. Auer
- Gao Wang
- Suzanne M. Leal
BMC Genomics (2023)
Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci
- Gianmarco Mignogna
- Caitlin E. Carey
- Andrea Ganna
Nature Human Behaviour (2023)
Genetic network structure of 13 psychiatric disorders in the general population
- Hong Kyu Ihm
- Hyejin Kim
- Woojae Myung
European Archives of Psychiatry and Clinical Neuroscience (2023)