Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads

Shafin, Kishwar; Pesout, Trevor; Chang, Pi-Chuan; Nattestad, Maria; Kolesnikov, Alexey; Goel, Sidharth; Baid, Gunjan; Kolmogorov, Mikhail; Eizenga, Jordan M.; Miga, Karen H.; Carnevali, Paolo; Jain, Miten; Carroll, Andrew; Paten, Benedict

doi:10.1038/s41592-021-01299-w

Article
Published: 01 November 2021

Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads

Nature Methods volume 18, pages 1322–1332 (2021)Cite this article

12k Accesses
91 Citations
117 Altmetric
Metrics details

Subjects

Abstract

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Nanopore variant-calling results.**

**Fig. 2: Comparison between Nanopore, Illumina and PacBio HiFi variant calling performance.**

**Fig. 3: Margin and WhatsHap phasing results.**

**Fig. 5: Diploid assembly-polishing results.**

Efficient assembly of nanopore reads via highly accurate and intact error correction

Article Open access 04 January 2021

Ying Chen, Fan Nie, … Chuan-Le Xiao

Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

Article 14 September 2023

Mikhail Kolmogorov, Kimberley J. Billingsley, … Benedict Paten

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Article Open access 04 May 2020

Kishwar Shafin, Trevor Pesout, … Benedict Paten

Data availability

We have made the analysis data available publicly (variant calling outputs, genome assemblies) in: https://console.cloud.google.com/storage/browser/pepper-deepvariant-public/analysis_data. The source data for the main figures can be found in: https://console.cloud.google.com/storage/browser/pepper-deepvariant-public/figure_source_data/Figure_source_data/.

For sequencing data, we used several publicly available datasets:

• GIAB consortium^26,28: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/

• Human Pangenome Reference Consortium (HPRC): https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html

• Telomere-to-telomere consortium^11,12: https://github.com/nanopore-wgs-consortium/CHM13

Please see the Supplementary Notes to find specific links to the sequencing data that we used for our analysis. Source data are provided with this paper.

Code availability

The modules of PEPPER-Margin-DeepVariant are publicly available in these repositories:

• PEPPER: https://github.com/kishwarshafin/pepper

• Margin: https://github.com/UCSC-nanopore-cgl/margin

• DeepVariant: https://github.com/google/deepvariant

The PEPPER-Margin-DeepVariant software⁵⁷ is available at https://doi.org/10.5281/zenodo.5275510, and we used r0.4 version for the evaluation presented in this manuscript. For simpler use, we have also created a publicly available docker container, kishwars/pepper_deepvariant:r0.4, that can run our variant-calling and polishing pipelines.

References

Altshuler, D. M. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Article CAS Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, W. & Freudenberg, J. Mappability and read length. Front. Genet. 5, 381 (2014).
Article PubMed PubMed Central Google Scholar
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
Article CAS PubMed Google Scholar
Falconer, E. & Lansdorp, P. M. Strand-seq: a unifying tool for studies of chromosome segregation. Semin. Cell Developmental Biol. 24, 643–652 (2013).
Article CAS Google Scholar
Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).
Article CAS PubMed PubMed Central Google Scholar
Jain, M. et al. Improved data analysis for the MinION nanopore sequencer. Nat. Methods 12, 351 (2015).
Article CAS PubMed PubMed Central Google Scholar
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
Article CAS PubMed Google Scholar
Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A. M. A long read mapping method for highly repetitive reference sequences. Preprint at https://doi.org/10.1101/2020.11.01.363887 (2020).
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article CAS PubMed PubMed Central Google Scholar
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 7857 (2021).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS PubMed Google Scholar
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).
Article CAS PubMed Google Scholar
nanoporetech/medaka: sequence correction provided by ONT Research, https://github.com/nanoporetech/medaka (Oxford Nanopore Technologies, 2018).
Luo, R. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat. Mach. Intell. 2, 220–227 (2020).
Article Google Scholar
Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 1–10 (2019).
Article CAS Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ebler, J., Haukness, M., Pesout, T., Marschall, T. & Paten, B. Haplotype-aware diplotyping from noisy long reads. Genome Biol. 20, 116 (2019).
Article PubMed PubMed Central Google Scholar
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Article CAS PubMed PubMed Central Google Scholar
Patterson, M. D. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).
Article CAS PubMed Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Preprint at https://doi.org/10.1101/2020.07.24.212712 (2020).
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short-and long-reads in difficult-to-map regions. Preprint at https://doi.org/10.1101/2020.11.13.380741 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jain, M. et al. Linear assembly of a human centromere on the Y chromosome. Nat. Biotechnol. 36, 321 (2018).
Article CAS PubMed PubMed Central Google Scholar
Fiddes, I. T. et al. Comparative Annotation Toolkit (CAT)—simultaneous clade and personal genome annotation. Genome Res. 28, 1029–1038 (2018).
Article CAS PubMed PubMed Central Google Scholar
Eichler, E. E., Clark, R. A. & She, X. An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet. 5, 345 (2004).
Article CAS PubMed Google Scholar
Euskirchen, P. et al. Same-day genomic and epigenomic diagnosis of brain tumors using real-time nanopore sequencing. Acta Neuropathol. 134, 691–703 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 19, 90 (2018).
Article PubMed PubMed Central CAS Google Scholar
Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 1–9 (2020).
Article CAS Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983 (2018).
Article CAS PubMed Google Scholar
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rodriguez, O. L. et al. A novel framework for characterizing genomic haplotype diversity in the human immunoglobulin heavy chain locus. Front. Immunol. 11, 2136 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050 (2016).
Article CAS PubMed PubMed Central Google Scholar
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174 (2018).
Article CAS Google Scholar
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).
Article CAS PubMed Google Scholar
Harrow, J. et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561 (2019).
Article CAS PubMed PubMed Central Google Scholar
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at https://doi.org/10.1101/2020.12.11.422022 (2020).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS PubMed Google Scholar
Heller, D. & Vingron, M. SVIM-asm: Structural variant detection from haploid and diploid genome assemblies. Bioinformatics 36, 22–23 (2020).
Google Scholar
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Article CAS PubMed PubMed Central Google Scholar
Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS PubMed PubMed Central Google Scholar
Glusman, G., Cox, H. C. & Roach, J. C. Whole-genome haplotyping approaches and genomic medicine. Genome Med. 6, 1–16 (2014).
Article CAS Google Scholar
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central CAS Google Scholar
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).
Article CAS PubMed Google Scholar
Newey, W. K. Adaptive estimation of regression models via moment restrictions. J. Econom. 38, 301–339 (1988).
Article Google Scholar
K. Shafin, et al. PEPPER-Margin-DeepVariant (version r0.4), https://doi.org/10.5281/zenodo.5275510 (Zenodo, 2021).

Download references

Acknowledgements

Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award numbers U41HG010972, R01HG010485, U01HG010961, and OT2OD026682 (K.S., T.P., M.K., J.M.E, K.H.M., M.J., B.P.). We thank Circulomics Inc. for sharing HG001 Nanopore data. We thank J. Zook and J. Wagner from the National Institute of Standards and Technology (NIST) for providing a draft version of the HG005 v4.2.1 benchmark. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

These authors contributed equally: Kishwar Shafin, Trevor Pesout, Pi-Chuan Chang.

Authors and Affiliations

UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
Kishwar Shafin, Trevor Pesout, Mikhail Kolmogorov, Jordan M. Eizenga, Karen H. Miga, Miten Jain & Benedict Paten
Google Inc, Mountain View, CA, USA
Pi-Chuan Chang, Maria Nattestad, Alexey Kolesnikov, Sidharth Goel, Gunjan Baid & Andrew Carroll
Chan Zuckerberg Initiative, Redwood City, CA, USA
Paolo Carnevali

Authors

Kishwar Shafin
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Pesout
View author publications
You can also search for this author in PubMed Google Scholar
Pi-Chuan Chang
View author publications
You can also search for this author in PubMed Google Scholar
Maria Nattestad
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Kolesnikov
View author publications
You can also search for this author in PubMed Google Scholar
Sidharth Goel
View author publications
You can also search for this author in PubMed Google Scholar
Gunjan Baid
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Kolmogorov
View author publications
You can also search for this author in PubMed Google Scholar
Jordan M. Eizenga
View author publications
You can also search for this author in PubMed Google Scholar
Karen H. Miga
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Carnevali
View author publications
You can also search for this author in PubMed Google Scholar
Miten Jain
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Carroll
View author publications
You can also search for this author in PubMed Google Scholar
Benedict Paten
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.P. and A.C. designed and executed the study. K.S. developed PEPPER. T.P. developed Margin. P.-C.C. designed candidate import functionality in DeepVariant. K.S., T.P., and P.-P.C. contributed equally to the methods development and core analysis presented. M.N. designed alt-event alignment in DeepVariant, A.K. contributed to haplotype sorting and improvements on DeepVariant runtime, S. G. contributed to candidate import module of DeepVariant, G.B. designed and executed the post-processing model to improve multiallelic variant accuracy. M.K. designed and evaluated assembly polishing. J.M.E. designed local phasing metric and contributed to phasing evaluation. K.H.M. provided experimental design guidance, and P.C. generated assemblies and provided guidance on assembly polishing. M.J. performed nanopore sequencing, quality control and helped to design and execute analysis. All authors approve of the final manuscript.

Corresponding authors

Correspondence to Andrew Carroll or Benedict Paten.

Ethics declarations

Competing interests

K.S. has performed paid internships at NVIDIA Corp and Google. P.C., M.N., A.K., S.G., G.B., and A.C. are employees of Google and own Alphabet stock as part of the standard compensation package. M.J. has received reimbursement for travel, accommodation, and conference fees to speak at events organized by ONT. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Ruibang Luo and the other, anonymous, reviewers for their contribution to the peer review of this work. Peer reviewer reports are available. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Figures 1–10, and Supplementary Tables 1–31

Reporting Summary

Peer Review Information

Source data

Source Data Fig. 1

Statistical source data

Source Data Fig. 2

Statistical source data

Source Data Fig. 3

Statistical source data

Source Data Fig. 4

Statistical source data

Source Data Fig. 5

Statistical source data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shafin, K., Pesout, T., Chang, PC. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat Methods 18, 1322–1332 (2021). https://doi.org/10.1038/s41592-021-01299-w

Download citation

Received: 08 March 2021
Accepted: 06 September 2021
Published: 01 November 2021
Issue Date: November 2021
DOI: https://doi.org/10.1038/s41592-021-01299-w

This article is cited by

De novo diploid genome assembly using long noisy reads
- Fan Nie
- Peng Ni
- Jianxin Wang
Nature Communications (2024)
Utility of long-read sequencing for All of Us
- M. Mahmoud
- Y. Huang
- F. J. Sedlazeck
Nature Communications (2024)
Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data
- Shunichi Kosugi
- Chikashi Terao
Human Genome Variation (2024)
Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules
- Jianfeng Sun
- Martin Philpott
- Adam P. Cribbs
Nature Methods (2024)
KSNP: a fast de Bruijn graph-based haplotyping tool approaching data-in time cost
- Qian Zhou
- Fahu Ji
- Jue Ruan
Nature Communications (2024)

Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads

Subjects

Abstract

Access options

Similar content being viewed by others

Efficient assembly of nanopore reads via highly accurate and intact error correction

Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Peer Review Information

Source data

Source Data Fig. 1

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Source Data Fig. 5

Rights and permissions

About this article

Cite this article

This article is cited by

De novo diploid genome assembly using long noisy reads

Utility of long-read sequencing for All of Us

Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data

Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules

KSNP: a fast de Bruijn graph-based haplotyping tool approaching data-in time cost

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links