Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Rubinacci, Simone; Ribeiro, Diogo M.; Hofmeister, Robin J.; Delaneau, Olivier

doi:10.1038/s41588-020-00756-0

Technical Report
Published: 07 January 2021

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Nature Genetics volume 53, pages 120–126 (2021)Cite this article

15k Accesses
113 Citations
92 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 20 January 2021

This article has been updated

Abstract

Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance and running time of low-coverage sequencing phasing and imputation.**

**Fig. 3: Comparison of low-coverage and SNP array imputation.**

**Fig. 4: Functional variant analysis across low-coverage and SNP array call sets.**

Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes

Article Open access 29 June 2023

ParseCNV2: efficient sequencing tool for copy number variation genome-wide association studies

Article 01 November 2022

Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts

Article Open access 28 January 2020

Data availability

The 1000 Genomes Project phase 3 dataset sequenced at high coverage by the New York Genome Center is available on the European Nucleotide Archive under accession no. PRJEB31736. The publicly available subset of the HRC dataset is available from the European Genome-phenome Archive at the European Bioinformatics Institute (EBI) under accession no. EGAS00001001710. The Genome in A Bottle data for sample NA12878 is available at the National Center for Biotechnology Information ftp website: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878. The subset of the 1000 Genomes samples genotyped on Affymetrix6.0 is available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/. GnomAD v.3 is available at https://gnomad.broadinstitute.org/downloads. The list of positions used to simulate the SNP arrays is available at https://www.well.ox.ac.uk/~wrayner/strand/. The RNA-seq data are part of the Geuvadis study and are available at the EBI ArrayExpress under accession code no. E-GEUV-1. The ENCODE project was accessed using accession nos. integration_data_jan2011 for the lymphoblastoid cell line-specific protein binding sites, ENCSR000EJD for the DNase-hypersensitive sites and ENCSR000AKC for locations with H3K27ac histone modifications. The results shown in Fig. 3a,b are a subset of the configurations tested. A full view of the results in available at the GLIMPSE website (European population: https://odelaneau.github.io/GLIMPSE/rsquare_eur.html, African-American population: https://odelaneau.github.io/GLIMPSE/rsquare_asw.html). The full raw data used to generate Fig. 3a,b and the benchmark shown on the website are available at the GLIMPSE repository (https://github.com/odelaneau/GLIMPSE/tree/master/docs/data/rsquare). Source data are provided with this paper.

Code availability

GLIMPSE is available from https://github.com/odelaneau/GLIMPSE and https://odelaneau.github.io/GLIMPSE/.

Change history

20 January 2021
An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

Brody, J. A. et al. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat. Genet. 49, 1560–1563 (2017).
Article CAS PubMed PubMed Central Google Scholar
Alex Buerkle, C. & Gompert, Z. Population genomics based on low coverage sequencing: how low should we go? Mol. Ecol. 22, 3028–3035 (2013).
Article CAS PubMed Google Scholar
Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).
Article CAS PubMed PubMed Central Google Scholar
Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631–635 (2012).
Article CAS PubMed PubMed Central Google Scholar
Cai, N. et al. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591 (2015).
Article CAS Google Scholar
Gilly, A. et al. Very low-depth sequencing in a founder population identifies a cardioprotective APOC3 signal missed by genome-wide imputation. Hum. Mol. Genet. 25, 2360–2365 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gilly, A. et al. Very low-depth whole-genome sequencing in complex trait association studies. Bioinformatics 35, 2555–2561 (2019).
Article CAS PubMed Google Scholar
Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).
Article CAS PubMed PubMed Central Google Scholar
Spiliopoulou, A., Colombo, M., Orchard, P., Agakov, F. & McKeigue, P. GeneImp: fast imputation to large reference panels using genotype likelihoods from ultralow coverage sequencing. Genetics 206, 91–104 (2017).
Article PubMed PubMed Central Google Scholar
Wasik, K. et al. Comparing low-pass sequencing and genotyping for trait mapping in pharmacogenetics. Preprint at bioRxiv https://doi.org/10.1101/632141 (2019).
Davies, R. W., Flint, J., Myers, S. & Mott, R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 48, 965–969 (2016).
Article CAS PubMed PubMed Central Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
Article CAS PubMed PubMed Central Google Scholar
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Article CAS PubMed Google Scholar
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).
Article PubMed PubMed Central Google Scholar
Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).
Article CAS PubMed PubMed Central Google Scholar
Rubinacci, S., Delaneau, O. & Marchini, J. Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet. 16, e1009049 (2020).
Article CAS PubMed PubMed Central Google Scholar
Browning, B. L. & Browning, S. R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).
Article CAS PubMed PubMed Central Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pritchard, J. K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS PubMed Google Scholar
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).
Article CAS PubMed Google Scholar
Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet. 103, 338–348 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Article CAS PubMed PubMed Central Google Scholar
Delaneau, O. et al. Chromatin three-dimensional interactions mediate genetic effects on gene expression. Science 364, eaat8266 (2019).
Article CAS PubMed Google Scholar
Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article Google Scholar
Brown, A. A. et al. Predicting causal variants affecting expression by using whole-genome sequencing and RNA-seq from multiple human tissues. Nat. Genet. 49, 1747–1751 (2017).
Article CAS PubMed Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Preprint at bioRxiv https://doi.org/10.1101/563866 (2019).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article CAS PubMed PubMed Central Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Delaneau, O. et al. A complete tool set for molecular QTL discovery and analysis. Nat. Commun. 8, 15452 (2017).
Article CAS PubMed PubMed Central Google Scholar
Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).
Article CAS PubMed PubMed Central Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was funded by a Swiss National Science Foundation project grant no. PP00P3_176977. The New York Genome Center 1000 Genomes data were generated at the New York Genome Center with funds provided by a National Human Genome Research Institute grant no. 3UM1HG008901–03S1. We thank S. Carmi for useful comments on the preprint version of the manuscript.

Author information

Authors and Affiliations

Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
Simone Rubinacci, Diogo M. Ribeiro, Robin J. Hofmeister & Olivier Delaneau
Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
Simone Rubinacci, Diogo M. Ribeiro, Robin J. Hofmeister & Olivier Delaneau

Authors

Simone Rubinacci
View author publications
You can also search for this author in PubMed Google Scholar
Diogo M. Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Robin J. Hofmeister
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Delaneau
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.R., D.M.R. and O.D. designed the study, performed the experiments and drafted the paper. S.R. and O.D. developed the algorithm and wrote the software. S.R., R.J.H. and O.D. created the website. O.D. supervised the project. All authors reviewed the final manuscript.

Corresponding author

Correspondence to Olivier Delaneau.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Garrett Hellenthal, Sam Morris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Read count distribution of downsampled sequencing data.

The y-axis shows the fractions of genotypes covered by 0 to 11 sequencing reads across multiple downsampled coverages from 0.1x to 4.0x. The color bars show the observed fractions in the downsampled data while the black dots and lines show the expected fractions assuming coverage is Poisson distributed.

Extended Data Fig. 2 Phasing performance of subsets of EUR and ASW samples.

Performance of the GLIMPSE (blue line) and SHAPEIT4 (black line) phasing algorithms. SHAPEIT4 has been run to rephase the genotype calls produced by GLIMPSE as it can only handle hard called genotypes. Validation genotypes were generated using an Affymetrix 6.0 SNP array. Validation haplotypes were derived thanks to additional samples being genotyped allowing to form multiple duos and trios.

Extended Data Fig. 3 Genotype discordance.

Genotype discordance stratified by minor-allele-frequency for the 1x coverage European population dataset on chromosome 1. The reference panel used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). The genotype discordance is shown for (A) all genotypes, and split between (B) major/major, (C) major/minor or (D) minor/minor genotypes in the validation dataset.

Extended Data Fig. 4 Zoomed-in genotype discordance for MAF > 1%.

Genotype discordance stratified by minor-allele-frequency (MAF > 1%) for the 1x coverage European population dataset on chromosome 1. The reference panels used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). The genotype discordance is shown for (A) all genotypes, and split between (B) major/major, (C) major/minor or (D) minor/minor genotypes in the validation dataset.

Extended Data Fig. 5 Non-reference discordance.

Non-reference discordance (NRD) stratified by non-reference allele frequency for the 1x coverage European population dataset on chromosome 1. The reference panels used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). (A.) Non-reference allele frequency > 0.01%; (B.) Non-reference allele frequency > 1%. The NRD is calculated as $\left( {e_{rr} + e_{ra} + e_{aa}} \right)/\left( {m_{ra} + m_{aa} + e_{rr} + e_{ra} + e_{aa}} \right)$, where e_rr, e_ra and e_aa are the counts of the mismatches for the homozygous reference, heterozygous and homozygous alternative genotypes, while m_ra and m_aa are the counts of the matches at heterozygous and homozygous alternative genotypes.

Extended Data Fig. 6 Calibration of genotype posteriors for 1.0x coverage.

(A.) Calibration of genotype posterior probabilities of different imputation methods for 1.0x coverage European dataset on chromosome 1. The reference panels used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). Imputed genotypes are binned according to the posterior probability distribution (x-axis) and plotted against the percentage of concordance against high coverage data (y-axis). (B.) Number of genotypes per probability bin.

Extended Data Fig. 7 Running time of imputation methods.

Running time of low-coverage sequencing imputation methods for the European population chromosome 1 dataset. We only ran GENEIMP on 1x coverage data. For BEAGLE and GENEIMP we only show reference panel size up to 5,000 samples due to time limits. The vertical axis is on a log scale.

Extended Data Fig. 8 Memory usage of imputation methods.

Memory usage of low-coverage sequencing methods for the European population chromosome 1 dataset. We only ran GENEIMP on 1x coverage data. For BEAGLE and GENEIMP we only show reference panel size up to 5,000 samples due to time limits. LOIMPUTE imputes a single sample at the time, therefore the reported memory usage is for a single sample, while we report the memory usage for the full cohort of 503 individuals for all other methods. The vertical axis is on a log scale.

Extended Data Fig. 9 Lead eQTL overlap and association p-value mean absolute error.

(A) Overlap between lead eQTLs identified in high-coverage and each low-coverage and SNP array dataset. eQTL mapping was performed independently for each dataset (FDR 5%; MAF > = 1%). eGenes in which the lead eQTL p-value was tied with another variant’s p-value (for example due to perfect linkage disequilibrium) were excluded, as the choice of variant for being the lead eQTL in these cases is arbitrary. The total number genes assessed after filtering was 5037. (B) Mean absolute error between -log10 p-values of association obtained for high-coverage lead eQTLs and those obtained in each dataset for the same set of variants. All high coverage lead eQTLs (that is a variant for each of the 16894 genes) were considered here, regardless of significance level. The scatterplots detail the -log10 p-values used to calculate the mean absolute errors for several relevant low-coverages and SNP arrays.

Supplementary information

Supplementary Information

Supplementary Note, Figs. 1–15, and Tables 1 and 2

Reporting Summary

Source data

Source Data Fig. 2

Statistical source data

Source Data Fig. 3

Statistical source data

Source Data Fig. 4

Statistical source data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rubinacci, S., Ribeiro, D.M., Hofmeister, R.J. et al. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet 53, 120–126 (2021). https://doi.org/10.1038/s41588-020-00756-0

Download citation

Received: 20 April 2020
Accepted: 20 November 2020
Published: 07 January 2021
Issue Date: January 2021
DOI: https://doi.org/10.1038/s41588-020-00756-0

This article is cited by

A cautionary tale of low-pass sequencing and imputation with respect to haplotype accuracy
- David Wragg
- Wengang Zhang
- Dylan N. Clements
Genetics Selection Evolution (2024)
The hazards of genotype imputation when mapping disease susceptibility variants
- Winston Lau
- Aminah Ali
- Nikolas Maniatis
Genome Biology (2024)
Assessing the impact of post-mortem damage and contamination on imputation performance in ancient DNA
- Antonio Garrido Marques
- Simone Rubinacci
- Bárbara Sousa da Mota
Scientific Reports (2024)
Accurate detection of identity-by-descent segments in human ancient DNA
- Harald Ringbauer
- Yilei Huang
- David Reich
Nature Genetics (2024)
A cost-effective sequencing method for genetic studies combining high-depth whole exome and low-depth whole genome
- Claude Bhérer
- Robert Eveleigh
- Daniel Taliun
npj Genomic Medicine (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

Change history

20 January 2021

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links