Letter | Published:

A universal SNP and small-indel variant caller using deep neural networks

Nature Biotechnology volume 36, pages 983987 (2018) | Download Citation

Abstract

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    , & Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

  2. 2.

    , , & Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).

  3. 3.

    Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

  4. 4.

    et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).

  5. 5.

    et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  6. 6.

    et al. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28, 167–175 (2012).

  7. 7.

    , , , & Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).

  8. 8.

    , , & Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes. BMC Genomics 15, 516 (2014).

  9. 9.

    , & E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 25, 1097–1105 (2012).

  10. 10.

    et al. Google's neural machine translation system: bridging the gap between human and machine translation. Preprint at (2016).

  11. 11.

    et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

  12. 12.

    et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

  13. 13.

    , & Deep learning in bioinformatics. Brief. Bioinform. 18, 851–869 (2017).

  14. 14.

    , , & Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

  15. 15.

    & Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

  16. 16.

    et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).

  17. 17.

    , , , & Rethinking the inception architecture for computer vision. Preprint at (2015).

  18. 18.

    et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

  19. 19.

    , & Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

  20. 20.

    et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).

  21. 21.

    , & Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).

  22. 22.

    & Haplotype-based variant detection from short-read sequencing. Preprint at (2012).

  23. 23.

    et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  24. 24.

    , & 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1–4 (2017).

  25. 25.

    et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Preprint at bioRxiv (2017).

  26. 26.

    et al. New synthetic-diploid benchmark for accurate variant calling evaluation. Preprint at bioRxiv (2017).

  27. 27.

    et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011).

  28. 28.

    What are the standard resources for non-human genomes? (2018).

  29. 29.

    et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials (Cold Spring Harbor, 2015).

  30. 30.

    et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

  31. 31.

    et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  32. 32.

    , & The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56–62 (2014).

  33. 33.

    , , , & TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Preprint at (2015).

  34. 34.

    , , & Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Preprint at bioRxiv (2018).

  35. 35.

    & Training genotype callers with neural networks. Preprint at bioRxiv (2016).

  36. 36.

    et al. Large scale distributed deep networks. Adv. Neural Inf. Process. Syst. 25, 1223–1231 (2012).

  37. 37.

    et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

  38. 38.

    et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv (2015).

Download references

Acknowledgements

We thank J. Zook and his collaborators at NIST for their work developing the Genome in a Bottle resources, the Verily sequencing facility for running the NA12878 replicates, and our colleagues at Verily and Google for their feedback on this manuscript and the project in general. This work was supported by internal funding.

Author information

Affiliations

  1. Verily Life Sciences, Mountain View, California, USA.

    • Ryan Poplin
    • , Dan Newburger
    • , Jojo Dijamco
    • , Nam Nguyen
    • , Pegah T Afshar
    • , Sam S Gross
    • , Lizzie Dorfman
    • , Cory Y McLean
    •  & Mark A DePristo
  2. Google Inc., Mountain View, California, USA.

    • Ryan Poplin
    • , Pi-Chuan Chang
    • , David Alexander
    • , Scott Schwartz
    • , Thomas Colthurst
    • , Alexander Ku
    • , Lizzie Dorfman
    • , Cory Y McLean
    •  & Mark A DePristo

Authors

  1. Search for Ryan Poplin in:

  2. Search for Pi-Chuan Chang in:

  3. Search for David Alexander in:

  4. Search for Scott Schwartz in:

  5. Search for Thomas Colthurst in:

  6. Search for Alexander Ku in:

  7. Search for Dan Newburger in:

  8. Search for Jojo Dijamco in:

  9. Search for Nam Nguyen in:

  10. Search for Pegah T Afshar in:

  11. Search for Sam S Gross in:

  12. Search for Lizzie Dorfman in:

  13. Search for Cory Y McLean in:

  14. Search for Mark A DePristo in:

Contributions

R.P. and M.A.D. designed the study, analyzed and interpreted results and wrote the paper. R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. and M.A.D. performed experiments and contributed to the software.

Competing interests

D.N., J.D., N.N., P.T.A. and S.S.G. are employees of Verily Life Sciences. P.-C.C., D.A., S.S, T.C. and A.K. are employees of Google Inc. R.P., L.D., C.Y.M. and M.A.D. are employees of Verily Life Sciences and Google Inc. This work was internally funded by Verily Life Sciences and Google Inc.

Corresponding author

Correspondence to Mark A DePristo.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1 and 2, Supplementary Tables 1–8 and Supplementary Notes 1–11

  2. 2.

    Life Sciences Reporting Summary

Text files

  1. 1.

    Supplementary Data

    Evaluation metrics

  2. 2.

    Supplementary Software

    Benchmarking script

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nbt.4235