Abstract

Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Zook, J. M. et al. Nat. Biotechnol. 32, 246–251 (2014).

  2. 2.

    Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).

  3. 3.

    Li, H. Bioinformatics 30, 2843–2851 (2014).

  4. 4.

    Chin, C. S. et al. Nat. Methods 13, 1050–1054 (2016).

  5. 5.

    Chin, C. S. et al. Nat. Methods 10, 563–569 (2013).

  6. 6.

    Seo, J. S. et al. Nature 538, 243–247 (2016).

  7. 7.

    Huddleston, J. et al. Genome Res. 27, 677–685 (2017).

  8. 8.

    Schneider, V. A. et al. Genome Res. 27, 849–864 (2017).

  9. 9.

    Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).

  10. 10.

    Langmead, B. & Salzberg, S. L. Nat. Methods 9, 357–359 (2012).

  11. 11.

    Li, H. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).

  12. 12.

    Garrison, E. & Marth, G. arXiv Preprint at https://arxiv.org/abs/1207.3907 (2012).

  13. 13.

    Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).

  14. 14.

    Li, H. Bioinformatics 27, 2987–2993 (2011).

  15. 15.

    DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).

  16. 16.

    Cleary, J.G. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).

  17. 17.

    Auton, A. et al. Nature 526, 68–74 (2015).

  18. 18.

    Robinson, J. T. et al. Nat. Biotechnol. 29, 24–26 (2011).

  19. 19.

    Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. J. Comput. Biol. 13, 1028–1040 (2006).

  20. 20.

    Mallick, S. et al. Nature 538, 201–206 (2016).

  21. 21.

    Li, H. Bioinformatics 31, 3694–3696 (2015).

Download references

Acknowledgements

We are grateful to E. Eichler (Department of Genome Sciences, University of Washington, Seattle, WA, USA) for providing DNA from CHM cell lines. We thank A. Carrol for testing PacBio’s new consensus caller, Arrow, and M. DePristo, J. Zook and B. Chapman for helpful suggestions. This study was supported by the US National Institutes of Health (NIH) (grants 5U54DK105566-04 and 5U01HG009088-03 to D.M. and B.N.; grant 1R01HG010040-01 to H.L.).

Author information

Affiliations

  1. Broad Institute of Harvard and MIT, Cambridge, MA, USA

    • Heng Li
    • , Jonathan M. Bloom
    • , Yossi Farjoun
    • , Mark Fleharty
    • , Laura Gauthier
    • , Benjamin Neale
    •  & Daniel MacArthur
  2. Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA

    • Benjamin Neale
    •  & Daniel MacArthur

Authors

  1. Search for Heng Li in:

  2. Search for Jonathan M. Bloom in:

  3. Search for Yossi Farjoun in:

  4. Search for Mark Fleharty in:

  5. Search for Laura Gauthier in:

  6. Search for Benjamin Neale in:

  7. Search for Daniel MacArthur in:

Contributions

H.L. conceived the study, constructed the benchmark dataset and drafted the manuscript; H.L., J.M.B. and Y.F. designed the experiment; L.G. and M.F. analyzed the data and applied the benchmark; and D.M. and B.N. supervised the project. All of the authors helped to revise the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Heng Li or Benjamin Neale or Daniel MacArthur.

Supplementary information

  1. Reporting Summary

  2. Supplementary Software

    Syndip evaluation scripts and helper scripts used to generate the benchmark dataset

  3. Supplementary Data 1

    Numerical data and gnuplot script used to generate Fig. 2

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41592-018-0054-7