Extremely low-coverage sequencing and imputation increases power for genome-wide association studies

Pasaniuc, Bogdan; Rohland, Nadin; McLaren, Paul J; Garimella, Kiran; Zaitlen, Noah; Li, Heng; Gupta, Namrata; Neale, Benjamin M; Daly, Mark J; Sklar, Pamela; Sullivan, Patrick F; Bergen, Sarah; Moran, Jennifer L; Hultman, Christina M; Lichtenstein, Paul; Magnusson, Patrik; Purcell, Shaun M; Haas, David W; Liang, Liming; Sunyaev, Shamil; Patterson, Nick; de Bakker, Paul I W; Reich, David; Price, Alkes L

doi:10.1038/ng.2283

Analysis
Published: 20 May 2012

Extremely low-coverage sequencing and imputation increases power for genome-wide association studies

Bogdan Pasaniuc^1,2,3,
Nadin Rohland^3,4,
Paul J McLaren^3,5,
Kiran Garimella³,
Noah Zaitlen^1,2,3,
Heng Li³,
Namrata Gupta³,
Benjamin M Neale^3,6,
Mark J Daly^3,6,
Pamela Sklar^7,8,
Patrick F Sullivan⁹,
Sarah Bergen³,
Jennifer L Moran³,
Christina M Hultman¹⁰,
Paul Lichtenstein¹⁰,
Patrik Magnusson¹⁰,
Shaun M Purcell^3,6,
David W Haas¹¹,
Liming Liang^1,2,3,
Shamil Sunyaev^3,5,
Nick Patterson³,
Paul I W de Bakker^3,5,12,13,
David Reich^3,4^na1 &
…
Alkes L Price^1,2,3^na1

Nature Genetics volume 44, pages 631–635 (2012)Cite this article

8465 Accesses
177 Citations
21 Altmetric
Metrics details

Subjects

Abstract

Genome-wide association studies (GWAS) have proven to be a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here, we show that extremely low-coverage sequencing (0.1–0.5×) captures almost as much of the common (>5%) and low-frequency (1–5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r² of 0.71 using off-target data (0.24× average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome-sequencing data sets, we show that association statistics obtained using extremely low-coverage sequencing data attain similar P values at known associated variants as data from genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in extremely low-coverage sequencing can yield several times the effective sample size of GWAS based on SNP array data and a commensurate increase in statistical power.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Genotype imputation accuracy as function of coverage in 1000 Genomes Project simulations.**

**Figure 2: Observed versus expected association −log₁₀ P values at 103,977 SNPs across the genome in simulated null data sets over 909 samples of the combined data set.**

**Figure 3: Genotype imputation accuracy in IHCS whole-exome data as a function of coverage (solid lines).**

**Figure 4: Coverage and corresponding number of samples for a fixed budget of $300,000.**

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

Simone Rubinacci, Diogo M. Ribeiro, … Olivier Delaneau

Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts

Article Open access 28 January 2020

Elizabeth T. Cirulli, Simon White, … Nicole L. Washington

A genome-wide scan statistic framework for whole-genome sequence data analysis

Article Open access 09 July 2019

Zihuai He, Bin Xu, … Iuliana Ionita-Laza

References

Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).
Article CAS Google Scholar
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
Article CAS Google Scholar
Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Article CAS Google Scholar
Metzker, M.L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
Article CAS Google Scholar
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Article CAS Google Scholar
Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42, 969–972 (2010).
Article CAS Google Scholar
Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).
Article CAS Google Scholar
Montgomery, S.B. et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464, 773–777 (2010).
Article CAS Google Scholar
Rohland, N. & Reich, D. Cost-effective high-throughput DNA sequencing libraries. Genome Res. published online, doi:10.1101/gr.128124.111 (20 January 2012).
Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).
Article CAS Google Scholar
Pritchard, J.K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).
Article CAS Google Scholar
Pereyra, F. et al. The major genetic determinants of HIV-1 control affect HLA class I peptide presentation. Science 330, 1551–1557 (2010).
Article Google Scholar
Suarez, B.K. et al. Genomewide linkage scan of 409 European-ancestry and African American families with schizophrenia: suggestive evidence of linkage at 8p23.3-p21.2 and 11p13.1-q14.1 in the combined sample. Am. J. Hum. Genet. 78, 315–333 (2006).
Article CAS Google Scholar
O'Donovan, M. C. et al. Analysis of 10 independent samples provides evidence for association between schizophrenia and a SNP flanking fibroblast growth factor receptor 2. Mol. Psychiatry 14, 30–36 (2009).
Article CAS Google Scholar
The GAIN Collaborative Research Group. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet. 39, 1045–1051 (2007).
The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Musunuru, K. et al. Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. N. Engl. J. Med. 363, 2220–2227 (2010).
Article CAS Google Scholar
Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 11, 685–696 (2010).
Article CAS Google Scholar
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Article CAS Google Scholar
Sampson, J., Jacobs, K., Yeager, M., Chanock, S. & Chatterjee, N. Efficient study design for next generation sequencing. Genet. Epidemiol. 35, 269–277 (2011).
PubMed PubMed Central Google Scholar
Kim, S.Y. et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 34, 479–491 (2010).
Article Google Scholar
Le, S.Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).
Article CAS Google Scholar
Prabhu, S. & Pe'er, I. Overlapping pools for high-throughput targeted resequencing. Genome Res. 19, 1254–1261 (2009).
Article CAS Google Scholar
Bansal, V. et al. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 20, 537–545 (2010).
Article CAS Google Scholar
Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
Article Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Article Google Scholar

Download references

Acknowledgements

We would like to acknowledge the ARRA Autism Sequencing Consortium (AASC) principal investigators for use of the autism data sets, including E. Boerwinkle, J.D. Buxbaum, E.H. Cook Jr., M.J. Daly (communicating principal investigator), B. Devlin, R. Gibbs, K. Roeder, A. Sabo, G.D. Schellenberg and J.S. Sutcliffe. We thank T. Lehner, A. Felsenfeld and P. Bender for their support and contribution to the AASC project and to the generation of AUT sequencing data. This research was supported by US National Institutes of Health (NIH) grants (R01 HG006399 to B.P., N.P., D.R. and A.L.P. and R01 MH084676 to S.S.). The IHCS acknowledges generous support from the Mark and Lisa Schwartz Foundation and the Collaboration for AIDS Vaccine Discovery of the Bill and Melinda Gates Foundation. The IHCS was also supported in part by NIH grants (P-30-AI060354 to the Harvard University Center for AIDS Research, AI069513, AI34835, AI069432, AI069423, AI069477, AI069501, AI069474, AI069428, AI69467, AI069415, Al32782, AI27661, AI25859, AI28568, AI30914, AI069495, AI069471, AI069532, AI069452, AI069450, AI069556, AI069484, AI069472, AI34853, AI069465, AI069511, AI38844, AI069424, AI069434, AI46370, AI68634, AI069502, AI069419, AI068636 and RR024975 to the AIDS Clinical Trials Group and AI077505 to D.W.H.). Data generation for the NIMH controls was directly supported by NIH grants (R01MH089208, R01 MH089025, R01 MH089004 and R01 MH089482). SCZ data generation was supported by an NIMH grant (5RC2MH089905; P.S. and S.M.P.) and by the Sylvan Herman Foundation and the Stanley Medical Research Institute (a gift to the Stanley Center for Psychiatric Research).

Author information

David Reich and Alkes L Price: These authors jointly directed this work.

Authors and Affiliations

Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA
Bogdan Pasaniuc, Noah Zaitlen, Liming Liang & Alkes L Price
Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA
Bogdan Pasaniuc, Noah Zaitlen, Liming Liang & Alkes L Price
Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, USA
Bogdan Pasaniuc, Nadin Rohland, Paul J McLaren, Kiran Garimella, Noah Zaitlen, Heng Li, Namrata Gupta, Benjamin M Neale, Mark J Daly, Sarah Bergen, Jennifer L Moran, Shaun M Purcell, Liming Liang, Shamil Sunyaev, Nick Patterson, Paul I W de Bakker, David Reich & Alkes L Price
Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA
Nadin Rohland & David Reich
Division of Genetics, Brigham and Women's Hospital, Boston, Massachusetts, USA
Paul J McLaren, Shamil Sunyaev & Paul I W de Bakker
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts, USA
Benjamin M Neale, Mark J Daly & Shaun M Purcell
Department of Psychiatry, Friedman Brain Institute, Mount Sinai School of Medicine, New York, New York, USA
Pamela Sklar
Institute for Genomics and Multiscale Biology, Mount Sinai School of Medicine, New York, New York, USA
Pamela Sklar
Department of Genetics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
Patrick F Sullivan
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
Christina M Hultman, Paul Lichtenstein & Patrik Magnusson
Vanderbilt University School of Medicine, Nashville, Tennessee, USA
David W Haas
Department of Medical Genetics, University Medical Center Utrecht, Utrecht, The Netherlands
Paul I W de Bakker
Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
Paul I W de Bakker

Authors

Bogdan Pasaniuc
View author publications
You can also search for this author in PubMed Google Scholar
Nadin Rohland
View author publications
You can also search for this author in PubMed Google Scholar
Paul J McLaren
View author publications
You can also search for this author in PubMed Google Scholar
Kiran Garimella
View author publications
You can also search for this author in PubMed Google Scholar
Noah Zaitlen
View author publications
You can also search for this author in PubMed Google Scholar
Heng Li
View author publications
You can also search for this author in PubMed Google Scholar
Namrata Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin M Neale
View author publications
You can also search for this author in PubMed Google Scholar
Mark J Daly
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Sklar
View author publications
You can also search for this author in PubMed Google Scholar
Patrick F Sullivan
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Bergen
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer L Moran
View author publications
You can also search for this author in PubMed Google Scholar
Christina M Hultman
View author publications
You can also search for this author in PubMed Google Scholar
Paul Lichtenstein
View author publications
You can also search for this author in PubMed Google Scholar
Patrik Magnusson
View author publications
You can also search for this author in PubMed Google Scholar
Shaun M Purcell
View author publications
You can also search for this author in PubMed Google Scholar
David W Haas
View author publications
You can also search for this author in PubMed Google Scholar
Liming Liang
View author publications
You can also search for this author in PubMed Google Scholar
Shamil Sunyaev
View author publications
You can also search for this author in PubMed Google Scholar
Nick Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Paul I W de Bakker
View author publications
You can also search for this author in PubMed Google Scholar
David Reich
View author publications
You can also search for this author in PubMed Google Scholar
Alkes L Price
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.P., N.R., N.P., A.L.P. and D.R. conceived and designed the study. B.P. conducted the analyses. L.L., S.S., N.R., P.J.M., N.Z. and H.L. provided bioinformatics and statistical support. P.I.W.d.B., N.G., K.G., B.M.N., M.J.D., P.S., P.F.S., S.B., J.L.M., C.M.H., P.L., P.M., S.M.P. and D.W.H. recruited and provided samples and data for these analyses. B.P., A.L.P. and D.R. wrote the paper. All authors contributed to the final version of the manuscript.

Corresponding authors

Correspondence to Bogdan Pasaniuc, David Reich or Alkes L Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–6, Supplementary Figures 1–8 and Supplementary Note (PDF 1490 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pasaniuc, B., Rohland, N., McLaren, P. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 44, 631–635 (2012). https://doi.org/10.1038/ng.2283

Download citation

Received: 13 December 2011
Accepted: 16 April 2012
Published: 20 May 2012
Issue Date: June 2012
DOI: https://doi.org/10.1038/ng.2283

This article is cited by

A cautionary tale of low-pass sequencing and imputation with respect to haplotype accuracy
- David Wragg
- Wengang Zhang
- Dylan N. Clements
Genetics Selection Evolution (2024)
Ultra-low-coverage genome-wide association study—insights into gestational age using 17,844 embryo samples with preimplantation genetic testing
- Shumin Li
- Bin Yan
- Ruibang Luo
Genome Medicine (2023)
The size and composition of haplotype reference panels impact the accuracy of imputation from low-pass sequencing in cattle
- Audald Lloret-Villas
- Hubert Pausch
- Alexander S. Leonard
Genetics Selection Evolution (2023)
Nyssorhynchus darlingi genome-wide studies related to microgeographic dispersion and blood-seeking behavior
- Marcus Vinicius Niz Alvarez
- Diego Peres Alonso
- Paulo E. M. Ribolla
Parasites & Vectors (2022)
Genomic prediction with whole-genome sequence data in intensely selected pig lines
- Roger Ros-Freixedes
- Martin Johnsson
- John M. Hickey
Genetics Selection Evolution (2022)