Rapid genotype imputation from sequence without reference panels

Journal name:
Nature Genetics
Volume:
48,
Pages:
965–969
Year published:
DOI:
doi:10.1038/ng.3594
Received
Accepted
Published online

Abstract

Inexpensive genotyping methods are essential for genetic studies requiring large sample sizes. In human studies, array-based microarrays and high-density haplotype reference panels allow efficient genotype imputation for this purpose. However, these resources are typically unavailable in non-human settings. Here we describe a method (STITCH) for imputation based only on sequencing read data, without requiring additional reference panels or array data. We demonstrate its applicability even in settings of extremely low sequencing coverage, by accurately imputing 5.7 million SNPs at a mean r2 value of 0.98 in 2,073 outbred laboratory mice (0.15× sequencing coverage). In a sample of 11,670 Han Chinese (1.7× coverage), we achieve accuracy similar to that of alternative approaches that require a reference panel, demonstrating that our approach can work for genetically diverse populations. Our method enables straightforward progression from low-coverage sequence to imputed genotypes, overcoming barriers that at present restrict the application of genome-wide association study technology outside humans.

At a glance

Figures

  1. Overview of STITCH.
    Figure 1: Overview of STITCH.

    After initializing various parameters (left), represented here by ancestral haplotypes, 40 EM iterations are performed (middle). Each iteration involves (i) determining hidden haplotype states (going down, left side) using current parameters and sample reads and (ii) parameter updates (going up, right side) using sample reads and haplotype probabilities (hidden states). Once the EM iterations are completed, imputed genotypes are generated using the haplotype probabilities and ancestral haplotypes from the final iteration (right). This example uses real data from CFW mice with K = 4 founder haplotypes for approximately 3,000 bp on chromosome 19 containing 20 imputed SNPs. Each of the SNPs in the four reconstructed haplotypes is shown as a vertical bar split proportionally by the probability of emitting the reference (black) or alternate (gray) allele. Sample reads are similarly colored.

  2. Performance of STITCH on CFW mice in comparison to external validation.
    Figure 2: Performance of STITCH on CFW mice in comparison to external validation.

    The validation data sets include the Illumina MegaMUGA array (left) and 10× Illumina sequencing (right). Results are shown for STITCH (K = 4, diploid mode), Beagle (default settings), and findhap (maxlen = 10,000, minlen = 100, steps = 3, iters = 4) across the genome for n = 2,073 mice featuring 7.07 million SNPs before quality control and 5.72 million SNPs after quality control. STITCH is run with 40 iterations. The SNPs retained after quality control have info >0.4 and Hardy–Weinberg equilibrium P value >1 × 10−6. MAF, minor allele frequency.

  3. Performance of STITCH on CONVERGE humans in comparison to external validation.
    Figure 3: Performance of STITCH on CONVERGE humans in comparison to external validation.

    The validation data sets are the Illumina HumanOmniZhongHua-8 array and 10× sequencing. Results are shown for STITCH (K = 40, 38 pseudo-haploid iterations and 2 diploid iterations), Beagle (all SNPs, default settings; reduced SNPs, 3 iterations with a reference panel), and findhap (maxlen = 50,000, minlen = 500, steps = 3, iters = 4) for the first 10 Mb of chromosome 20 for n = 11,670 Han Chinese samples. Reduced SNPs are those also present in the 1000 Genomes Project ASN (Asian) reference panel. The SNPs retained after quality control have info >0.4 and Hardy–Weinberg equilibrium P value >1 × 10−6.

  4. Effects of reduced sequence coverage.
    Figure 4: Effects of reduced sequence coverage.

    Results are shown for CFW mice (left) and CONVERGE humans using STITCH (middle) and Beagle run without a reference panel (right). Validation was performed using array data, with each value representing the average for common SNPs (allele frequency 5–95%), without correction for quality control after imputation. Downsampling of samples and reads was performed at random, except that samples necessary for accuracy assessment were always retained. STITCH settings were the same as for the full CFW and CONVERGE data sets. The colors representing downsampled sequence depth are the same for the STITCH and Beagle results.

References

  1. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP–trait associations. Nucleic Acids Res. 42, D1001D1006 (2014).
  2. International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851861 (2007).
  3. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012).
  4. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 56 (2013).
  5. Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
  6. Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816834 (2010).
  7. Browning, S.R. & Browning, B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 10841097 (2007).
  8. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955959 (2012).
  9. Swarts, K. et al. Novel methods to optimize genotypic imputation for low-coverage, next-generation sequence data in crop plants. Plant Genome http://dx.doi.org/10.3835/plantgenome2014.05.0023 (2014).
  10. Huang, B.E. & George, A.W. R/mpMap: a computational platform for the genetic analysis of multiparent recombinant inbred lines. Bioinformatics 27, 727729 (2011).
  11. Sargolzaei, M., Chesnais, J.P. & Schenkel, F.S. A new approach for efficient genotype imputation using information from relatives. BMC Genomics 15, 478 (2014).
  12. VanRaden, P.M., Sun, C. & O'Connell, J.R. Fast imputation using medium or low-coverage sequence data. BMC Genet. 16, 82 (2015).
  13. Didion, J.P. et al. Discovery of novel variants in genotyping arrays improves genotype retention and reduces ascertainment bias. BMC Genomics 13, 34 (2012).
  14. Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631635 (2012).
  15. CONVERGE Consortium. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588591 (2015).
  16. Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629644 (2006).
  17. Nicod, J. et al. Genome-wide association of multiple complex traits in outbred mice by ultra-low-coverage sequencing. Nat. Genet. http://dx.doi.org/10.1038/ng.3595 (2016).
  18. Yalcin, B. et al. Commercially available outbred mice for genome-wide association studies. PLoS Genet. 6, e1001085 (2010).
  19. Keane, T.M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289294 (2011).
  20. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491498 (2011).
  21. Freedman, A.H. et al. Genome sequencing highlights the dynamic early history of dogs. PLoS Genet. 10, e1004016 (2014).
  22. Bovine HapMap Consortium. Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science 324, 528532 (2009).
  23. Daetwyler, H.D. et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat. Genet. 46, 858865 (2014).
  24. VanBuren, R. et al. Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum. Nature 527, 508511 (2015).
  25. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 17541760 (2009).
  26. Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936939 (2011).
  27. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 12971303 (2010).

Download references

Author information

  1. These authors contributed equally to this work.

    • Simon Myers &
    • Richard Mott

Affiliations

  1. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

    • Robert W Davies,
    • Simon Myers &
    • Richard Mott
  2. Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, Los Angeles, California, USA.

    • Jonathan Flint
  3. Department of Statistics, University of Oxford, Oxford, UK.

    • Simon Myers
  4. UCL Genetics Institute, University College London, London, UK.

    • Richard Mott

Contributions

R.W.D., S.M., and R.M. developed the method. R.W.D. wrote the algorithm and performed analyses. J.F. and R.M. conceived and managed the CFW and CONVERGE projects. All authors contributed to study design, drafted the paper, and reviewed and contributed to the final manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (1,418 KB)

    Supplementary Tables 1–8 and Supplementary Note.

Additional data