Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Extremely low-coverage sequencing and imputation increases power for genome-wide association studies

Abstract

Genome-wide association studies (GWAS) have proven to be a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here, we show that extremely low-coverage sequencing (0.1–0.5×) captures almost as much of the common (>5%) and low-frequency (1–5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r2 of 0.71 using off-target data (0.24× average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome-sequencing data sets, we show that association statistics obtained using extremely low-coverage sequencing data attain similar P values at known associated variants as data from genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in extremely low-coverage sequencing can yield several times the effective sample size of GWAS based on SNP array data and a commensurate increase in statistical power.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Genotype imputation accuracy as function of coverage in 1000 Genomes Project simulations.
Figure 2: Observed versus expected association −log10 P values at 103,977 SNPs across the genome in simulated null data sets over 909 samples of the combined data set.
Figure 3: Genotype imputation accuracy in IHCS whole-exome data as a function of coverage (solid lines).
Figure 4: Coverage and corresponding number of samples for a fixed budget of $300,000.

Similar content being viewed by others

References

  1. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).

    Article  CAS  Google Scholar 

  2. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  3. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

  4. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

    Article  CAS  Google Scholar 

  5. Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

    Article  CAS  Google Scholar 

  6. Metzker, M.L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).

    Article  CAS  Google Scholar 

  7. Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).

    Article  CAS  Google Scholar 

  8. Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42, 969–972 (2010).

    Article  CAS  Google Scholar 

  9. Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

    Article  CAS  Google Scholar 

  10. Montgomery, S.B. et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464, 773–777 (2010).

    Article  CAS  Google Scholar 

  11. Rohland, N. & Reich, D. Cost-effective high-throughput DNA sequencing libraries. Genome Res. published online, doi:10.1101/gr.128124.111 (20 January 2012).

  12. Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).

    Article  CAS  Google Scholar 

  13. Pritchard, J.K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).

    Article  CAS  Google Scholar 

  14. Pereyra, F. et al. The major genetic determinants of HIV-1 control affect HLA class I peptide presentation. Science 330, 1551–1557 (2010).

    Article  Google Scholar 

  15. Suarez, B.K. et al. Genomewide linkage scan of 409 European-ancestry and African American families with schizophrenia: suggestive evidence of linkage at 8p23.3-p21.2 and 11p13.1-q14.1 in the combined sample. Am. J. Hum. Genet. 78, 315–333 (2006).

    Article  CAS  Google Scholar 

  16. O'Donovan, M. C. et al. Analysis of 10 independent samples provides evidence for association between schizophrenia and a SNP flanking fibroblast growth factor receptor 2. Mol. Psychiatry 14, 30–36 (2009).

    Article  CAS  Google Scholar 

  17. The GAIN Collaborative Research Group. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet. 39, 1045–1051 (2007).

  18. The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).

  19. Musunuru, K. et al. Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. N. Engl. J. Med. 363, 2220–2227 (2010).

    Article  CAS  Google Scholar 

  20. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 11, 685–696 (2010).

    Article  CAS  Google Scholar 

  21. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    Article  CAS  Google Scholar 

  22. Sampson, J., Jacobs, K., Yeager, M., Chanock, S. & Chatterjee, N. Efficient study design for next generation sequencing. Genet. Epidemiol. 35, 269–277 (2011).

    PubMed  PubMed Central  Google Scholar 

  23. Kim, S.Y. et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 34, 479–491 (2010).

    Article  Google Scholar 

  24. Le, S.Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).

    Article  CAS  Google Scholar 

  25. Prabhu, S. & Pe'er, I. Overlapping pools for high-throughput targeted resequencing. Genome Res. 19, 1254–1261 (2009).

    Article  CAS  Google Scholar 

  26. Bansal, V. et al. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 20, 537–545 (2010).

    Article  CAS  Google Scholar 

  27. Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

    Article  Google Scholar 

  28. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  29. Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the ARRA Autism Sequencing Consortium (AASC) principal investigators for use of the autism data sets, including E. Boerwinkle, J.D. Buxbaum, E.H. Cook Jr., M.J. Daly (communicating principal investigator), B. Devlin, R. Gibbs, K. Roeder, A. Sabo, G.D. Schellenberg and J.S. Sutcliffe. We thank T. Lehner, A. Felsenfeld and P. Bender for their support and contribution to the AASC project and to the generation of AUT sequencing data. This research was supported by US National Institutes of Health (NIH) grants (R01 HG006399 to B.P., N.P., D.R. and A.L.P. and R01 MH084676 to S.S.). The IHCS acknowledges generous support from the Mark and Lisa Schwartz Foundation and the Collaboration for AIDS Vaccine Discovery of the Bill and Melinda Gates Foundation. The IHCS was also supported in part by NIH grants (P-30-AI060354 to the Harvard University Center for AIDS Research, AI069513, AI34835, AI069432, AI069423, AI069477, AI069501, AI069474, AI069428, AI69467, AI069415, Al32782, AI27661, AI25859, AI28568, AI30914, AI069495, AI069471, AI069532, AI069452, AI069450, AI069556, AI069484, AI069472, AI34853, AI069465, AI069511, AI38844, AI069424, AI069434, AI46370, AI68634, AI069502, AI069419, AI068636 and RR024975 to the AIDS Clinical Trials Group and AI077505 to D.W.H.). Data generation for the NIMH controls was directly supported by NIH grants (R01MH089208, R01 MH089025, R01 MH089004 and R01 MH089482). SCZ data generation was supported by an NIMH grant (5RC2MH089905; P.S. and S.M.P.) and by the Sylvan Herman Foundation and the Stanley Medical Research Institute (a gift to the Stanley Center for Psychiatric Research).

Author information

Authors and Affiliations

Authors

Contributions

B.P., N.R., N.P., A.L.P. and D.R. conceived and designed the study. B.P. conducted the analyses. L.L., S.S., N.R., P.J.M., N.Z. and H.L. provided bioinformatics and statistical support. P.I.W.d.B., N.G., K.G., B.M.N., M.J.D., P.S., P.F.S., S.B., J.L.M., C.M.H., P.L., P.M., S.M.P. and D.W.H. recruited and provided samples and data for these analyses. B.P., A.L.P. and D.R. wrote the paper. All authors contributed to the final version of the manuscript.

Corresponding authors

Correspondence to Bogdan Pasaniuc, David Reich or Alkes L Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–6, Supplementary Figures 1–8 and Supplementary Note (PDF 1490 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pasaniuc, B., Rohland, N., McLaren, P. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 44, 631–635 (2012). https://doi.org/10.1038/ng.2283

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.2283

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing