Exploring the limit of using a deep neural network on pileup data for germline variant calling

Abstract

Single-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited these technologies from being more widely used. Here, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single-molecule sequencing data. For Oxford Nanopore Technology data, Clair achieves better precision, recall and speed than several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks. Clair requires only a conventional central processing unit (CPU) for variant calling and is an open-source project available at https://github.com/HKU-BAL/Clair.

A preprint version of the article is available at bioRxiv.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Clair network architecture and layer details.
Fig. 2: ONT benchmarking results for SNPs and indels.
Fig. 3: The category distribution of FPs and FNs made by Clair in the 1:168×|2:64× experiment on ONT data and six genome browser screen captures showing examples of different categories.

Data availability

The details of and links to the reference genomes, truth variants, ONT data, PacBio CCS data and Illumina data that support the findings of this study are available in the ‘Data sources’ section of the Supplementary Notes. The variant call format files generated by Clair in this study are available at http://www.bio8.cs.hku.hk/clair_models/VCFBenchmarked/.

Code availability

Clair is open source, and available at https://github.com/HKU-BAL/Clair. Clair is licensed under the BSD 3-Clause licence.

References

  1. 1.

    Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

  2. 2.

    Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–522 (2016).

  3. 3.

    Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

  4. 4.

    Luo, R., Schatz, M. C. & Salzberg, S. L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. GigaScience 6, gix045 (2017).

  5. 5.

    Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinform. 43, 11–33 (2013).

  6. 6.

    Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

  7. 7.

    The long view on sequencing. Nat. Biotechnol. 36, 287 (2018).

  8. 8.

    Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).

  9. 9.

    Ameur, A., Kloosterman, W. P. & Hestand, M. S. Single-molecule sequencing: towards clinical applications. Trends Biotechnol. 37, 72–85 (2019).

  10. 10.

    Luo, R., Sedlazeck, F. J., Lam, T. W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun. 10, 998 (2019).

  11. 11.

    Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

  12. 12.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

  13. 13.

    Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

  14. 14.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

  15. 15.

    Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 4660 (2019).

  16. 16.

    Sequence correction provided by ONT Research https://github.com/nanoporetech/medaka (accessed 17 November 2019).

  17. 17.

    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

  18. 18.

    Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

  19. 19.

    Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Meth. 14, 407 (2017).

  20. 20.

    Poplin, R. et al. DeepVariant training data https://github.com/google/deepvariant/blob/r0.9/docs/deepvariant-details-training-data.md (accessed 22 November 2019).

  21. 21.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  22. 22.

    Smith, L. N. in 2017 IEEE Winter Conf. on Applications of Computer Vision (WACV) 464–472 (IEEE, 2017).

  23. 23.

    Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. in Proc. IEEE Int. Conf. on Computer Vision 2980–2988 (2017).

  24. 24.

    Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).

Download references

Acknowledgements

We thank S. Salzberg, M. Schatz and F. Sedlazeck for comments. R.L. was supported by the ECS (grant number 27204518) of the HKSAR government, and by the URC fund at HKU. T.-W.L., C.-L.W., Y.-S.W., C.-I.T., C.-M. Liu and C.-M. Leung were supported by the ITF (grant number ITF/331/17FP) from the Innovation and Technology Commission, HKSAR government.

Author information

Affiliations

Authors

Contributions

R.L. and T.-W.L. conceived the study. R.L, C.-L.W., Y.-S.W., C.-I.T., C.-M. Liu and C.-M. Leung analysed the data and wrote the paper.

Corresponding authors

Correspondence to Ruibang Luo or Tak-Wah Lam.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Tables 1-6, Supplementary Figure 1

Supplementary Table

The details of the FP and FN results in ONT experiment 1:168x|2:64x

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Luo, R., Wong, C., Wong, Y. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat Mach Intell 2, 220–227 (2020). https://doi.org/10.1038/s42256-020-0167-4

Download citation