Abstract
Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.
Your institute does not have access to this article
Relevant articles
Open Access articles citing this article.
-
The sequences of 150,119 genomes in the UK Biobank
Nature Open Access 20 July 2022
-
TT-Mars: structural variants assessment based on haplotype-resolved assemblies
Genome Biology Open Access 06 May 2022
-
Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes
Nature Genetics Open Access 11 April 2022
Access options
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
$29.99
monthly
Subscribe to Journal
Get full journal access for 1 year
$119.00
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Buy article
Get time limited or full article access on ReadCube.
$32.00
All prices are NET prices.


References
Zook, J. M. et al. Nat. Biotechnol. 32, 246–251 (2014).
Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).
Li, H. Bioinformatics 30, 2843–2851 (2014).
Chin, C. S. et al. Nat. Methods 13, 1050–1054 (2016).
Chin, C. S. et al. Nat. Methods 10, 563–569 (2013).
Seo, J. S. et al. Nature 538, 243–247 (2016).
Huddleston, J. et al. Genome Res. 27, 677–685 (2017).
Schneider, V. A. et al. Genome Res. 27, 849–864 (2017).
Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).
Langmead, B. & Salzberg, S. L. Nat. Methods 9, 357–359 (2012).
Li, H. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).
Garrison, E. & Marth, G. arXiv Preprint at https://arxiv.org/abs/1207.3907 (2012).
Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).
Li, H. Bioinformatics 27, 2987–2993 (2011).
DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).
Cleary, J.G. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).
Auton, A. et al. Nature 526, 68–74 (2015).
Robinson, J. T. et al. Nat. Biotechnol. 29, 24–26 (2011).
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. J. Comput. Biol. 13, 1028–1040 (2006).
Mallick, S. et al. Nature 538, 201–206 (2016).
Li, H. Bioinformatics 31, 3694–3696 (2015).
Acknowledgements
We are grateful to E. Eichler (Department of Genome Sciences, University of Washington, Seattle, WA, USA) for providing DNA from CHM cell lines. We thank A. Carrol for testing PacBio’s new consensus caller, Arrow, and M. DePristo, J. Zook and B. Chapman for helpful suggestions. This study was supported by the US National Institutes of Health (NIH) (grants 5U54DK105566-04 and 5U01HG009088-03 to D.M. and B.N.; grant 1R01HG010040-01 to H.L.).
Author information
Authors and Affiliations
Contributions
H.L. conceived the study, constructed the benchmark dataset and drafted the manuscript; H.L., J.M.B. and Y.F. designed the experiment; L.G. and M.F. analyzed the data and applied the benchmark; and D.M. and B.N. supervised the project. All of the authors helped to revise the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Software
Syndip evaluation scripts and helper scripts used to generate the benchmark dataset
Supplementary Data 1
Numerical data and gnuplot script used to generate Fig. 2
Rights and permissions
About this article
Cite this article
Li, H., Bloom, J.M., Farjoun, Y. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15, 595–597 (2018). https://doi.org/10.1038/s41592-018-0054-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-018-0054-7
Further reading
-
TT-Mars: structural variants assessment based on haplotype-resolved assemblies
Genome Biology (2022)
-
Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation
Nature Methods (2022)
-
Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes
Nature Genetics (2022)
-
Assessing reproducibility of inherited variants detected with short-read whole genome sequencing
Genome Biology (2022)
-
The sequences of 150,119 genomes in the UK Biobank
Nature (2022)