Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

A Publisher Correction to this article was published on 20 January 2021

This article has been updated

Abstract

Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of GLIMPSE.
Fig. 2: Performance and running time of low-coverage sequencing phasing and imputation.
Fig. 3: Comparison of low-coverage and SNP array imputation.
Fig. 4: Functional variant analysis across low-coverage and SNP array call sets.

Data availability

The 1000 Genomes Project phase 3 dataset sequenced at high coverage by the New York Genome Center is available on the European Nucleotide Archive under accession no. PRJEB31736. The publicly available subset of the HRC dataset is available from the European Genome-phenome Archive at the European Bioinformatics Institute (EBI) under accession no. EGAS00001001710. The Genome in A Bottle data for sample NA12878 is available at the National Center for Biotechnology Information ftp website: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878. The subset of the 1000 Genomes samples genotyped on Affymetrix6.0 is available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/. GnomAD v.3 is available at https://gnomad.broadinstitute.org/downloads. The list of positions used to simulate the SNP arrays is available at https://www.well.ox.ac.uk/~wrayner/strand/. The RNA-seq data are part of the Geuvadis study and are available at the EBI ArrayExpress under accession code no. E-GEUV-1. The ENCODE project was accessed using accession nos. integration_data_jan2011 for the lymphoblastoid cell line-specific protein binding sites, ENCSR000EJD for the DNase-hypersensitive sites and ENCSR000AKC for locations with H3K27ac histone modifications. The results shown in Fig. 3a,b are a subset of the configurations tested. A full view of the results in available at the GLIMPSE website (European population: https://odelaneau.github.io/GLIMPSE/rsquare_eur.html, African-American population: https://odelaneau.github.io/GLIMPSE/rsquare_asw.html). The full raw data used to generate Fig. 3a,b and the benchmark shown on the website are available at the GLIMPSE repository (https://github.com/odelaneau/GLIMPSE/tree/master/docs/data/rsquare). Source data are provided with this paper.

Code availability

GLIMPSE is available from https://github.com/odelaneau/GLIMPSE and https://odelaneau.github.io/GLIMPSE/.

Change history

  • 20 January 2021

    An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

  1. 1.

    Brody, J. A. et al. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat. Genet. 49, 1560–1563 (2017).

    CAS  Article  Google Scholar 

  2. 2.

    Alex Buerkle, C. & Gompert, Z. Population genomics based on low coverage sequencing: how low should we go? Mol. Ecol. 22, 3028–3035 (2013).

    CAS  Article  Google Scholar 

  3. 3.

    Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).

    CAS  Article  Google Scholar 

  4. 4.

    Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631–635 (2012).

    CAS  Article  Google Scholar 

  5. 5.

    Cai, N. et al. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591 (2015).

    CAS  Article  Google Scholar 

  6. 6.

    Gilly, A. et al. Very low-depth sequencing in a founder population identifies a cardioprotective APOC3 signal missed by genome-wide imputation. Hum. Mol. Genet. 25, 2360–2365 (2016).

    CAS  Article  Google Scholar 

  7. 7.

    Gilly, A. et al. Very low-depth whole-genome sequencing in complex trait association studies. Bioinformatics 35, 2555–2561 (2019).

    CAS  Article  Google Scholar 

  8. 8.

    Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).

    CAS  Article  Google Scholar 

  9. 9.

    Spiliopoulou, A., Colombo, M., Orchard, P., Agakov, F. & McKeigue, P. GeneImp: fast imputation to large reference panels using genotype likelihoods from ultralow coverage sequencing. Genetics 206, 91–104 (2017).

    Article  Google Scholar 

  10. 10.

    Wasik, K. et al. Comparing low-pass sequencing and genotyping for trait mapping in pharmacogenetics. Preprint at bioRxiv https://doi.org/10.1101/632141 (2019).

  11. 11.

    Davies, R. W., Flint, J., Myers, S. & Mott, R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 48, 965–969 (2016).

    CAS  Article  Google Scholar 

  12. 12.

    Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

    CAS  Article  Google Scholar 

  14. 14.

    Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).

    Article  Google Scholar 

  15. 15.

    Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).

    CAS  Article  Google Scholar 

  16. 16.

    Rubinacci, S., Delaneau, O. & Marchini, J. Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet. 16, e1009049 (2020).

    CAS  Article  Google Scholar 

  17. 17.

    Browning, B. L. & Browning, S. R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).

    CAS  Article  Google Scholar 

  18. 18.

    Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  19. 19.

    McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).

    CAS  Article  Google Scholar 

  20. 20.

    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    CAS  Article  Google Scholar 

  21. 21.

    Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    CAS  Article  Google Scholar 

  22. 22.

    Pritchard, J. K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).

    CAS  Article  Google Scholar 

  23. 23.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    CAS  Article  Google Scholar 

  24. 24.

    Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).

    CAS  Article  Google Scholar 

  25. 25.

    Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet. 103, 338–348 (2018).

    CAS  Article  Google Scholar 

  26. 26.

    Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

    CAS  Article  Google Scholar 

  27. 27.

    Delaneau, O. et al. Chromatin three-dimensional interactions mediate genetic effects on gene expression. Science 364, eaat8266 (2019).

    CAS  Article  Google Scholar 

  28. 28.

    Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    Article  Google Scholar 

  29. 29.

    Brown, A. A. et al. Predicting causal variants affecting expression by using whole-genome sequencing and RNA-seq from multiple human tissues. Nat. Genet. 49, 1747–1751 (2017).

    CAS  Article  Google Scholar 

  30. 30.

    Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Preprint at bioRxiv https://doi.org/10.1101/563866 (2019).

  31. 31.

    Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    CAS  Article  Google Scholar 

  32. 32.

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    CAS  Article  Google Scholar 

  33. 33.

    Delaneau, O. et al. A complete tool set for molecular QTL discovery and analysis. Nat. Commun. 8, 15452 (2017).

    CAS  Article  Google Scholar 

  34. 34.

    Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).

    CAS  Article  Google Scholar 

  35. 35.

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

This work was funded by a Swiss National Science Foundation project grant no. PP00P3_176977. The New York Genome Center 1000 Genomes data were generated at the New York Genome Center with funds provided by a National Human Genome Research Institute grant no. 3UM1HG008901–03S1. We thank S. Carmi for useful comments on the preprint version of the manuscript.

Author information

Affiliations

Authors

Contributions

S.R., D.M.R. and O.D. designed the study, performed the experiments and drafted the paper. S.R. and O.D. developed the algorithm and wrote the software. S.R., R.J.H. and O.D. created the website. O.D. supervised the project. All authors reviewed the final manuscript.

Corresponding author

Correspondence to Olivier Delaneau.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Garrett Hellenthal, Sam Morris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Read count distribution of downsampled sequencing data.

The y-axis shows the fractions of genotypes covered by 0 to 11 sequencing reads across multiple downsampled coverages from 0.1x to 4.0x. The color bars show the observed fractions in the downsampled data while the black dots and lines show the expected fractions assuming coverage is Poisson distributed.

Extended Data Fig. 2 Phasing performance of subsets of EUR and ASW samples.

Performance of the GLIMPSE (blue line) and SHAPEIT4 (black line) phasing algorithms. SHAPEIT4 has been run to rephase the genotype calls produced by GLIMPSE as it can only handle hard called genotypes. Validation genotypes were generated using an Affymetrix 6.0 SNP array. Validation haplotypes were derived thanks to additional samples being genotyped allowing to form multiple duos and trios.

Extended Data Fig. 3 Genotype discordance.

Genotype discordance stratified by minor-allele-frequency for the 1x coverage European population dataset on chromosome 1. The reference panel used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). The genotype discordance is shown for (A) all genotypes, and split between (B) major/major, (C) major/minor or (D) minor/minor genotypes in the validation dataset.

Extended Data Fig. 4 Zoomed-in genotype discordance for MAF > 1%.

Genotype discordance stratified by minor-allele-frequency (MAF > 1%) for the 1x coverage European population dataset on chromosome 1. The reference panels used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). The genotype discordance is shown for (A) all genotypes, and split between (B) major/major, (C) major/minor or (D) minor/minor genotypes in the validation dataset.

Extended Data Fig. 5 Non-reference discordance.

Non-reference discordance (NRD) stratified by non-reference allele frequency for the 1x coverage European population dataset on chromosome 1. The reference panels used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). (A.) Non-reference allele frequency > 0.01%; (B.) Non-reference allele frequency > 1%. The NRD is calculated as \(\left( {e_{rr} + e_{ra} + e_{aa}} \right)/\left( {m_{ra} + m_{aa} + e_{rr} + e_{ra} + e_{aa}} \right)\), where err, era and eaa are the counts of the mismatches for the homozygous reference, heterozygous and homozygous alternative genotypes, while mra and maa are the counts of the matches at heterozygous and homozygous alternative genotypes.

Extended Data Fig. 6 Calibration of genotype posteriors for 1.0x coverage.

(A.) Calibration of genotype posterior probabilities of different imputation methods for 1.0x coverage European dataset on chromosome 1. The reference panels used are a subset of 5,000 samples (solid lines), and 20,000 samples from the HRC (dashed lines). Imputed genotypes are binned according to the posterior probability distribution (x-axis) and plotted against the percentage of concordance against high coverage data (y-axis). (B.) Number of genotypes per probability bin.

Extended Data Fig. 7 Running time of imputation methods.

Running time of low-coverage sequencing imputation methods for the European population chromosome 1 dataset. We only ran GENEIMP on 1x coverage data. For BEAGLE and GENEIMP we only show reference panel size up to 5,000 samples due to time limits. The vertical axis is on a log scale.

Extended Data Fig. 8 Memory usage of imputation methods.

Memory usage of low-coverage sequencing methods for the European population chromosome 1 dataset. We only ran GENEIMP on 1x coverage data. For BEAGLE and GENEIMP we only show reference panel size up to 5,000 samples due to time limits. LOIMPUTE imputes a single sample at the time, therefore the reported memory usage is for a single sample, while we report the memory usage for the full cohort of 503 individuals for all other methods. The vertical axis is on a log scale.

Extended Data Fig. 9 Lead eQTL overlap and association p-value mean absolute error.

(A) Overlap between lead eQTLs identified in high-coverage and each low-coverage and SNP array dataset. eQTL mapping was performed independently for each dataset (FDR 5%; MAF > = 1%). eGenes in which the lead eQTL p-value was tied with another variant’s p-value (for example due to perfect linkage disequilibrium) were excluded, as the choice of variant for being the lead eQTL in these cases is arbitrary. The total number genes assessed after filtering was 5037. (B) Mean absolute error between -log10 p-values of association obtained for high-coverage lead eQTLs and those obtained in each dataset for the same set of variants. All high coverage lead eQTLs (that is a variant for each of the 16894 genes) were considered here, regardless of significance level. The scatterplots detail the -log10 p-values used to calculate the mean absolute errors for several relevant low-coverages and SNP arrays.

Supplementary information

Supplementary Information

Supplementary Note, Figs. 1–15, and Tables 1 and 2

Reporting Summary

Source data

Source Data Fig. 2

Statistical source data

Source Data Fig. 3

Statistical source data

Source Data Fig. 4

Statistical source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rubinacci, S., Ribeiro, D.M., Hofmeister, R.J. et al. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet 53, 120–126 (2021). https://doi.org/10.1038/s41588-020-00756-0

Download citation

Further reading

  • Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays

    • Jeremiah H. Li
    • , Chase A. Mazur
    • , Tomaz Berisa
    •  & Joseph K. Pickrell

    Genome Research (2021)

  • Characterization of a haplotype-reference panel for genotyping by low-pass sequencing in Swiss Large White pigs

    • Adéla Nosková
    • , Meenu Bhati
    • , Naveen Kumar Kadri
    • , Danang Crysnanto
    • , Stefan Neuenschwander
    • , Andreas Hofer
    •  & Hubert Pausch

    BMC Genomics (2021)

  • The genomic history of the Aegean palatial civilizations

    • Florian Clemente
    • , Martina Unterländer
    • , Olga Dolgova
    • , Carlos Eduardo G. Amorim
    • , Francisco Coroado-Santos
    • , Samuel Neuenschwander
    • , Elissavet Ganiatsou
    • , Diana I. Cruz Dávalos
    • , Lucas Anchieri
    • , Frédéric Michaud
    • , Laura Winkelbach
    • , Jens Blöcher
    • , Yami Ommar Arizmendi Cárdenas
    • , Bárbara Sousa da Mota
    • , Eleni Kalliga
    • , Angelos Souleles
    • , Ioannis Kontopoulos
    • , Georgia Karamitrou-Mentessidi
    • , Olga Philaniotou
    • , Adamantios Sampson
    • , Dimitra Theodorou
    • , Metaxia Tsipopoulou
    • , Ioannis Akamatis
    • , Paul Halstead
    • , Kostas Kotsakis
    • , Dushka Urem-Kotsou
    • , Diamantis Panagiotopoulos
    • , Christina Ziota
    • , Sevasti Triantaphyllou
    • , Olivier Delaneau
    • , Jeffrey D. Jensen
    • , J. Víctor Moreno-Mayar
    • , Joachim Burger
    • , Vitor C. Sousa
    • , Oscar Lao
    • , Anna-Sapfo Malaspinas
    •  & Christina Papageorgopoulou

    Cell (2021)

  • Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations

    • Alicia R. Martin
    • , Elizabeth G. Atkinson
    • , Sinéad B. Chapman
    • , Anne Stevenson
    • , Rocky E. Stroud
    • , Tamrat Abebe
    • , Dickens Akena
    • , Melkam Alemayehu
    • , Fred K. Ashaba
    • , Lukoye Atwoli
    • , Tera Bowers
    • , Lori B. Chibnik
    • , Mark J. Daly
    • , Timothy DeSmet
    • , Sheila Dodge
    • , Abebaw Fekadu
    • , Steven Ferriera
    • , Bizu Gelaye
    • , Stella Gichuru
    • , Wilfred E. Injera
    • , Roxanne James
    • , Symon M. Kariuki
    • , Gabriel Kigen
    • , Karestan C. Koenen
    • , Edith Kwobah
    • , Joseph Kyebuzibwa
    • , Lerato Majara
    • , Henry Musinguzi
    • , Rehema M. Mwema
    • , Benjamin M. Neale
    • , Carter P. Newman
    • , Charles R.J.C. Newton
    • , Joseph K. Pickrell
    • , Raj Ramesar
    • , Welelta Shiferaw
    • , Dan J. Stein
    • , Solomon Teferra
    • , Celia van der Merwe
    •  & Zukiswa Zingela

    The American Journal of Human Genetics (2021)

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing