Extremely low-coverage sequencing and imputation increases power for genome-wide association studies

Article metrics

Abstract

Genome-wide association studies (GWAS) have proven to be a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here, we show that extremely low-coverage sequencing (0.1–0.5×) captures almost as much of the common (>5%) and low-frequency (1–5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r2 of 0.71 using off-target data (0.24× average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome-sequencing data sets, we show that association statistics obtained using extremely low-coverage sequencing data attain similar P values at known associated variants as data from genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in extremely low-coverage sequencing can yield several times the effective sample size of GWAS based on SNP array data and a commensurate increase in statistical power.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Genotype imputation accuracy as function of coverage in 1000 Genomes Project simulations.
Figure 2: Observed versus expected association −log10 P values at 103,977 SNPs across the genome in simulated null data sets over 909 samples of the combined data set.
Figure 3: Genotype imputation accuracy in IHCS whole-exome data as a function of coverage (solid lines).
Figure 4: Coverage and corresponding number of samples for a fixed budget of $300,000.

References

  1. 1

    Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).

  2. 2

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  3. 3

    DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  4. 4

    Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

  5. 5

    Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

  6. 6

    Metzker, M.L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).

  7. 7

    Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).

  8. 8

    Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42, 969–972 (2010).

  9. 9

    Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

  10. 10

    Montgomery, S.B. et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464, 773–777 (2010).

  11. 11

    Rohland, N. & Reich, D. Cost-effective high-throughput DNA sequencing libraries. Genome Res. published online, doi:10.1101/gr.128124.111 (20 January 2012).

  12. 12

    Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).

  13. 13

    Pritchard, J.K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).

  14. 14

    Pereyra, F. et al. The major genetic determinants of HIV-1 control affect HLA class I peptide presentation. Science 330, 1551–1557 (2010).

  15. 15

    Suarez, B.K. et al. Genomewide linkage scan of 409 European-ancestry and African American families with schizophrenia: suggestive evidence of linkage at 8p23.3-p21.2 and 11p13.1-q14.1 in the combined sample. Am. J. Hum. Genet. 78, 315–333 (2006).

  16. 16

    O'Donovan, M. C. et al. Analysis of 10 independent samples provides evidence for association between schizophrenia and a SNP flanking fibroblast growth factor receptor 2. Mol. Psychiatry 14, 30–36 (2009).

  17. 17

    The GAIN Collaborative Research Group. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet. 39, 1045–1051 (2007).

  18. 18

    The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).

  19. 19

    Musunuru, K. et al. Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. N. Engl. J. Med. 363, 2220–2227 (2010).

  20. 20

    Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 11, 685–696 (2010).

  21. 21

    Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

  22. 22

    Sampson, J., Jacobs, K., Yeager, M., Chanock, S. & Chatterjee, N. Efficient study design for next generation sequencing. Genet. Epidemiol. 35, 269–277 (2011).

  23. 23

    Kim, S.Y. et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 34, 479–491 (2010).

  24. 24

    Le, S.Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).

  25. 25

    Prabhu, S. & Pe'er, I. Overlapping pools for high-throughput targeted resequencing. Genome Res. 19, 1254–1261 (2009).

  26. 26

    Bansal, V. et al. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 20, 537–545 (2010).

  27. 27

    Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

  28. 28

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  29. 29

    Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

Download references

Acknowledgements

We would like to acknowledge the ARRA Autism Sequencing Consortium (AASC) principal investigators for use of the autism data sets, including E. Boerwinkle, J.D. Buxbaum, E.H. Cook Jr., M.J. Daly (communicating principal investigator), B. Devlin, R. Gibbs, K. Roeder, A. Sabo, G.D. Schellenberg and J.S. Sutcliffe. We thank T. Lehner, A. Felsenfeld and P. Bender for their support and contribution to the AASC project and to the generation of AUT sequencing data. This research was supported by US National Institutes of Health (NIH) grants (R01 HG006399 to B.P., N.P., D.R. and A.L.P. and R01 MH084676 to S.S.). The IHCS acknowledges generous support from the Mark and Lisa Schwartz Foundation and the Collaboration for AIDS Vaccine Discovery of the Bill and Melinda Gates Foundation. The IHCS was also supported in part by NIH grants (P-30-AI060354 to the Harvard University Center for AIDS Research, AI069513, AI34835, AI069432, AI069423, AI069477, AI069501, AI069474, AI069428, AI69467, AI069415, Al32782, AI27661, AI25859, AI28568, AI30914, AI069495, AI069471, AI069532, AI069452, AI069450, AI069556, AI069484, AI069472, AI34853, AI069465, AI069511, AI38844, AI069424, AI069434, AI46370, AI68634, AI069502, AI069419, AI068636 and RR024975 to the AIDS Clinical Trials Group and AI077505 to D.W.H.). Data generation for the NIMH controls was directly supported by NIH grants (R01MH089208, R01 MH089025, R01 MH089004 and R01 MH089482). SCZ data generation was supported by an NIMH grant (5RC2MH089905; P.S. and S.M.P.) and by the Sylvan Herman Foundation and the Stanley Medical Research Institute (a gift to the Stanley Center for Psychiatric Research).

Author information

B.P., N.R., N.P., A.L.P. and D.R. conceived and designed the study. B.P. conducted the analyses. L.L., S.S., N.R., P.J.M., N.Z. and H.L. provided bioinformatics and statistical support. P.I.W.d.B., N.G., K.G., B.M.N., M.J.D., P.S., P.F.S., S.B., J.L.M., C.M.H., P.L., P.M., S.M.P. and D.W.H. recruited and provided samples and data for these analyses. B.P., A.L.P. and D.R. wrote the paper. All authors contributed to the final version of the manuscript.

Correspondence to Bogdan Pasaniuc or David Reich or Alkes L Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–6, Supplementary Figures 1–8 and Supplementary Note (PDF 1490 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Pasaniuc, B., Rohland, N., McLaren, P. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 44, 631–635 (2012) doi:10.1038/ng.2283

Download citation

Further reading