Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Exploring the limit of using a deep neural network on pileup data for germline variant calling

A preprint version of the article is available at bioRxiv.


Single-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited these technologies from being more widely used. Here, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single-molecule sequencing data. For Oxford Nanopore Technology data, Clair achieves better precision, recall and speed than several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks. Clair requires only a conventional central processing unit (CPU) for variant calling and is an open-source project available at

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Clair network architecture and layer details.
Fig. 2: ONT benchmarking results for SNPs and indels.
Fig. 3: The category distribution of FPs and FNs made by Clair in the 1:168×|2:64× experiment on ONT data and six genome browser screen captures showing examples of different categories.

Data availability

The details of and links to the reference genomes, truth variants, ONT data, PacBio CCS data and Illumina data that support the findings of this study are available in the ‘Data sources’ section of the Supplementary Notes. The variant call format files generated by Clair in this study are available at

Code availability

Clair is open source, and available at Clair is licensed under the BSD 3-Clause licence.


  1. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  Google Scholar 

  2. Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–522 (2016).

    Article  Google Scholar 

  3. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    Article  Google Scholar 

  4. Luo, R., Schatz, M. C. & Salzberg, S. L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. GigaScience 6, gix045 (2017).

    Article  Google Scholar 

  5. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinform. 43, 11–33 (2013).

    Article  Google Scholar 

  6. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    Article  Google Scholar 

  7. The long view on sequencing. Nat. Biotechnol. 36, 287 (2018).

  8. Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).

  9. Ameur, A., Kloosterman, W. P. & Hestand, M. S. Single-molecule sequencing: towards clinical applications. Trends Biotechnol. 37, 72–85 (2019).

    Article  Google Scholar 

  10. Luo, R., Sedlazeck, F. J., Lam, T. W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun. 10, 998 (2019).

    Article  Google Scholar 

  11. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    Article  Google Scholar 

  12. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  Google Scholar 

  13. Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

    Article  Google Scholar 

  14. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article  Google Scholar 

  15. Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 4660 (2019).

    Article  Google Scholar 

  16. Sequence correction provided by ONT Research (accessed 17 November 2019).

  17. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  Google Scholar 

  18. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

  19. Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Meth. 14, 407 (2017).

    Article  Google Scholar 

  20. Poplin, R. et al. DeepVariant training data (accessed 22 November 2019).

  21. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  22. Smith, L. N. in 2017 IEEE Winter Conf. on Applications of Computer Vision (WACV) 464–472 (IEEE, 2017).

  23. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. in Proc. IEEE Int. Conf. on Computer Vision 2980–2988 (2017).

  24. Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).

    Article  MathSciNet  Google Scholar 

Download references


We thank S. Salzberg, M. Schatz and F. Sedlazeck for comments. R.L. was supported by the ECS (grant number 27204518) of the HKSAR government, and by the URC fund at HKU. T.-W.L., C.-L.W., Y.-S.W., C.-I.T., C.-M. Liu and C.-M. Leung were supported by the ITF (grant number ITF/331/17FP) from the Innovation and Technology Commission, HKSAR government.

Author information

Authors and Affiliations



R.L. and T.-W.L. conceived the study. R.L, C.-L.W., Y.-S.W., C.-I.T., C.-M. Liu and C.-M. Leung analysed the data and wrote the paper.

Corresponding authors

Correspondence to Ruibang Luo or Tak-Wah Lam.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Tables 1-6, Supplementary Figure 1

Supplementary Table

The details of the FP and FN results in ONT experiment 1:168x|2:64x

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Luo, R., Wong, CL., Wong, YS. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat Mach Intell 2, 220–227 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing