A universal SNP and small-indel variant caller using deep neural networks

Poplin, Ryan; Chang, Pi-Chuan; Alexander, David; Schwartz, Scott; Colthurst, Thomas; Ku, Alexander; Newburger, Dan; Dijamco, Jojo; Nguyen, Nam; Afshar, Pegah T; Gross, Sam S; Dorfman, Lizzie; McLean, Cory Y; DePristo, Mark A

doi:10.1038/nbt.4235

Letter
Published: 24 September 2018

A universal SNP and small-indel variant caller using deep neural networks

Ryan Poplin^1,2,
Pi-Chuan Chang²,
David Alexander²,
Scott Schwartz²,
Thomas Colthurst²,
Alexander Ku²,
Dan Newburger¹,
Jojo Dijamco¹,
Nam Nguyen¹,
Pegah T Afshar¹,
Sam S Gross¹,
Lizzie Dorfman^1,2,
Cory Y McLean^1,2 &
…
Mark A DePristo^1,2

Nature Biotechnology volume 36, pages 983–987 (2018)Cite this article

46k Accesses
574 Citations
319 Altmetric
Metrics details

Subjects

Abstract

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: DeepVariant workflow overview.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Genome-wide association studies

Article 26 August 2021

References

Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Article CAS Google Scholar
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
Article CAS Google Scholar
Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Article CAS Google Scholar
Goldfeder, R.L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Article Google Scholar
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28, 167–175 (2012).
Article CAS Google Scholar
Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).
Article CAS Google Scholar
Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes. BMC Genomics 15, 516 (2014).
Article Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 25, 1097–1105 (2012).
Google Scholar
Wu, Y. et al. Google's neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article CAS Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article CAS Google Scholar
Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 18, 851–869 (2017).
PubMed Google Scholar
Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS Google Scholar
Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS Google Scholar
Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
Article Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. Preprint at https://arxiv.org/abs/1512.00567 (2015).
Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Article CAS Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Article CAS Google Scholar
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).
Article CAS Google Scholar
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Article Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1–4 (2017).
Article Google Scholar
Kim, S. et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Preprint at bioRxiv https://doi.org/10.1101/192872 (2017).
Li, H. et al. New synthetic-diploid benchmark for accurate variant calling evaluation. Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
Keane, T.M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011).
Article CAS Google Scholar
Van der Auwera, G. What are the standard resources for non-human genomes? http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials (Cold Spring Harbor, 2015).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56–62 (2014).
Article CAS Google Scholar
Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Preprint at https://arxiv.org/abs/1603.04467 (2015).
Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
Torracinta, R. & Campagne, F. Training genotype callers with neural networks. Preprint at bioRxiv https://doi.org/10.1101/097469 (2016).
Dean, J. et al. Large scale distributed deep networks. Adv. Neural Inf. Process. Syst. 25, 1223–1231 (2012).
Google Scholar
Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS Google Scholar
Cleary, J.G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi.org/10.1101/023754 (2015).

Download references

Acknowledgements

We thank J. Zook and his collaborators at NIST for their work developing the Genome in a Bottle resources, the Verily sequencing facility for running the NA12878 replicates, and our colleagues at Verily and Google for their feedback on this manuscript and the project in general. This work was supported by internal funding.

Author information

Authors and Affiliations

Verily Life Sciences, Mountain View, California, USA
Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T Afshar, Sam S Gross, Lizzie Dorfman, Cory Y McLean & Mark A DePristo
Google Inc., Mountain View, California, USA
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Lizzie Dorfman, Cory Y McLean & Mark A DePristo

Authors

Ryan Poplin
View author publications
You can also search for this author in PubMed Google Scholar
Pi-Chuan Chang
View author publications
You can also search for this author in PubMed Google Scholar
David Alexander
View author publications
You can also search for this author in PubMed Google Scholar
Scott Schwartz
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Colthurst
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ku
View author publications
You can also search for this author in PubMed Google Scholar
Dan Newburger
View author publications
You can also search for this author in PubMed Google Scholar
Jojo Dijamco
View author publications
You can also search for this author in PubMed Google Scholar
Nam Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Pegah T Afshar
View author publications
You can also search for this author in PubMed Google Scholar
Sam S Gross
View author publications
You can also search for this author in PubMed Google Scholar
Lizzie Dorfman
View author publications
You can also search for this author in PubMed Google Scholar
Cory Y McLean
View author publications
You can also search for this author in PubMed Google Scholar
Mark A DePristo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.P. and M.A.D. designed the study, analyzed and interpreted results and wrote the paper. R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. and M.A.D. performed experiments and contributed to the software.

Corresponding author

Correspondence to Mark A DePristo.

Ethics declarations

Competing interests

D.N., J.D., N.N., P.T.A. and S.S.G. are employees of Verily Life Sciences. P.-C.C., D.A., S.S, T.C. and A.K. are employees of Google Inc. R.P., L.D., C.Y.M. and M.A.D. are employees of Verily Life Sciences and Google Inc. This work was internally funded by Verily Life Sciences and Google Inc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Poplin, R., Chang, PC., Alexander, D. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36, 983–987 (2018). https://doi.org/10.1038/nbt.4235

Download citation

Received: 15 December 2017
Accepted: 02 August 2018
Published: 24 September 2018
Issue Date: November 2018
DOI: https://doi.org/10.1038/nbt.4235

This article is cited by

COSAP: Comparative Sequencing Analysis Platform
- Mehmet Arif Ergun
- Omer Cinal
- Mehmet Baysan
BMC Bioinformatics (2024)
Whole genome sequencing in clinical practice
- Frederik Otzen Bagger
- Line Borgwardt
- Finn Cilius Nielsen
BMC Medical Genomics (2024)
A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
- Ze-Zhen Du
- Jia-Bao He
- Wen-Biao Jiao
Genome Biology (2024)
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
- Yong Zhou
- Nagarajan Kathiresan
- Rod A. Wing
BMC Biology (2024)
Near-gapless and haplotype-resolved apple genomes provide insights into the genetic basis of rootstock-induced dwarfing
- Wei Li
- Chong Chu
- Zhenhai Han
Nature Genetics (2024)