Abstract
Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
AIVariant: a deep learning-based somatic variant detector for highly contaminated tumor samples
Experimental & Molecular Medicine Open Access 01 August 2023
-
vamos: variable-number tandem repeats annotation using efficient motif sets
Genome Biology Open Access 27 July 2023
-
An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
Malaria Journal Open Access 07 July 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout


References
Zook, J. M. et al. Nat. Biotechnol. 32, 246–251 (2014).
Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).
Li, H. Bioinformatics 30, 2843–2851 (2014).
Chin, C. S. et al. Nat. Methods 13, 1050–1054 (2016).
Chin, C. S. et al. Nat. Methods 10, 563–569 (2013).
Seo, J. S. et al. Nature 538, 243–247 (2016).
Huddleston, J. et al. Genome Res. 27, 677–685 (2017).
Schneider, V. A. et al. Genome Res. 27, 849–864 (2017).
Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).
Langmead, B. & Salzberg, S. L. Nat. Methods 9, 357–359 (2012).
Li, H. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).
Garrison, E. & Marth, G. arXiv Preprint at https://arxiv.org/abs/1207.3907 (2012).
Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).
Li, H. Bioinformatics 27, 2987–2993 (2011).
DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).
Cleary, J.G. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).
Auton, A. et al. Nature 526, 68–74 (2015).
Robinson, J. T. et al. Nat. Biotechnol. 29, 24–26 (2011).
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. J. Comput. Biol. 13, 1028–1040 (2006).
Mallick, S. et al. Nature 538, 201–206 (2016).
Li, H. Bioinformatics 31, 3694–3696 (2015).
Acknowledgements
We are grateful to E. Eichler (Department of Genome Sciences, University of Washington, Seattle, WA, USA) for providing DNA from CHM cell lines. We thank A. Carrol for testing PacBio’s new consensus caller, Arrow, and M. DePristo, J. Zook and B. Chapman for helpful suggestions. This study was supported by the US National Institutes of Health (NIH) (grants 5U54DK105566-04 and 5U01HG009088-03 to D.M. and B.N.; grant 1R01HG010040-01 to H.L.).
Author information
Authors and Affiliations
Contributions
H.L. conceived the study, constructed the benchmark dataset and drafted the manuscript; H.L., J.M.B. and Y.F. designed the experiment; L.G. and M.F. analyzed the data and applied the benchmark; and D.M. and B.N. supervised the project. All of the authors helped to revise the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Software
Syndip evaluation scripts and helper scripts used to generate the benchmark dataset
Supplementary Data 1
Numerical data and gnuplot script used to generate Fig. 2
Rights and permissions
About this article
Cite this article
Li, H., Bloom, J.M., Farjoun, Y. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15, 595–597 (2018). https://doi.org/10.1038/s41592-018-0054-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-018-0054-7
This article is cited by
-
An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis
Malaria Journal (2023)
-
vamos: variable-number tandem repeats annotation using efficient motif sets
Genome Biology (2023)
-
FixItFelix: improving genomic analysis by fixing reference errors
Genome Biology (2023)
-
CRISPR-based targeted haplotype-resolved assembly of a megabase region
Nature Communications (2023)
-
Deciphering the exact breakpoints of structural variations using long sequencing reads with DeBreak
Nature Communications (2023)