A synthetic-diploid benchmark for accurate variant-calling evaluation

Abstract

Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Construction of the Syndip benchmark dataset.
Fig. 2: Evaluation of variant-calling accuracy with Syndip.

References

  1. 1.

    Zook, J. M. et al. Nat. Biotechnol. 32, 246–251 (2014).

    Article  PubMed  CAS  Google Scholar 

  2. 2.

    Eberle, M. A. et al. Genome Res. 27, 157–164 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. 3.

    Li, H. Bioinformatics 30, 2843–2851 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. 4.

    Chin, C. S. et al. Nat. Methods 13, 1050–1054 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. 5.

    Chin, C. S. et al. Nat. Methods 10, 563–569 (2013).

    Article  PubMed  CAS  Google Scholar 

  6. 6.

    Seo, J. S. et al. Nature 538, 243–247 (2016).

    Article  PubMed  CAS  Google Scholar 

  7. 7.

    Huddleston, J. et al. Genome Res. 27, 677–685 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  8. 8.

    Schneider, V. A. et al. Genome Res. 27, 849–864 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. 9.

    Li, H. arXiv Preprint at https://arxiv.org/abs/1303.3997 (2013).

  10. 10.

    Langmead, B. & Salzberg, S. L. Nat. Methods 9, 357–359 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  11. 11.

    Li, H. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).

  12. 12.

    Garrison, E. & Marth, G. arXiv Preprint at https://arxiv.org/abs/1207.3907 (2012).

  13. 13.

    Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  14. 14.

    Li, H. Bioinformatics 27, 2987–2993 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. 15.

    DePristo, M. A. et al. Nat. Genet. 43, 491–498 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  16. 16.

    Cleary, J.G. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2015/08/03/023754 (2015).

  17. 17.

    Auton, A. et al. Nature 526, 68–74 (2015).

    Article  PubMed  CAS  Google Scholar 

  18. 18.

    Robinson, J. T. et al. Nat. Biotechnol. 29, 24–26 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. 19.

    Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. J. Comput. Biol. 13, 1028–1040 (2006).

    Article  PubMed  CAS  Google Scholar 

  20. 20.

    Mallick, S. et al. Nature 538, 201–206 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  21. 21.

    Li, H. Bioinformatics 31, 3694–3696 (2015).

    PubMed  PubMed Central  CAS  Google Scholar 

Download references

Acknowledgements

We are grateful to E. Eichler (Department of Genome Sciences, University of Washington, Seattle, WA, USA) for providing DNA from CHM cell lines. We thank A. Carrol for testing PacBio’s new consensus caller, Arrow, and M. DePristo, J. Zook and B. Chapman for helpful suggestions. This study was supported by the US National Institutes of Health (NIH) (grants 5U54DK105566-04 and 5U01HG009088-03 to D.M. and B.N.; grant 1R01HG010040-01 to H.L.).

Author information

Affiliations

Authors

Contributions

H.L. conceived the study, constructed the benchmark dataset and drafted the manuscript; H.L., J.M.B. and Y.F. designed the experiment; L.G. and M.F. analyzed the data and applied the benchmark; and D.M. and B.N. supervised the project. All of the authors helped to revise the manuscript.

Corresponding authors

Correspondence to Heng Li or Benjamin Neale or Daniel MacArthur.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Supplementary Software

Syndip evaluation scripts and helper scripts used to generate the benchmark dataset

Supplementary Data 1

Numerical data and gnuplot script used to generate Fig. 2

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, H., Bloom, J.M., Farjoun, Y. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15, 595–597 (2018). https://doi.org/10.1038/s41592-018-0054-7

Download citation

Further reading