Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

An open resource for accurately benchmarking small variant and reference calls


Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a ‘first of its kind’ resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Arbitration process used to form our benchmark set from multiple technologies and callsets.
Fig. 2: Complex variant discordant between GIAB and Illumina PG.

Data availability

Raw sequence data were previously published in Scientific Data ( and were deposited in the NCBI SRA with the accession codes SRX1049768–SRX1049855, SRX847862–SRX848317, SRX1388368–SRX1388459, SRX1388732–SRX1388743, SRX852932–SRX852936, SRX847094, SRX848742–SRX848744, SRX326642, SRX1497273 and SRX1497276. 10x Genomics Chromium bam files used are available at The benchmark vcf and bed files resulting from work in this manuscript are available in the NISTv.3.3.2 directory under each genome on the GIAB FTP release folder and, in the future, updated calls will be in the ‘recent’ directory under each genome. The data used in this manuscript and other datasets for these genomes are available at, as well as in NCBI BioProject No. PRJNA200694.

Code availability

All code for analyzing genome sequencing data to generate benchmark variants and regions developed for this manuscript is available in a GitHub repository at Publicly available software used to generate input callsets includes novoalign v.3.02.07, samtools v.0.1.18, GATK v.3.5, Freebayes v.0.9.20, Complete Genomics tools v., Torrent Variant Caller v.4.4, LifeScope v.2.5.1, LongRanger v.2.0, GenomeWarp, rtg-tools v.3.7.1 and Sentieon v.201611.rc1.


  1. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    CAS  Article  PubMed  Google Scholar 

  2. Patwardhan, A. et al. Achieving high-sensitivity for clinical applications using augmented exome sequencing. Genome Med. 7, 71 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Lincoln, S. E. et al. A systematic comparison of traditional and multigene panel testing for hereditary breast and ovarian cancer genes in more than 1000 patients. J. Mol. Diagnostics 17, 533–544 (2015).

    Article  Google Scholar 

  4. Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. Cornish, A. & Guda, C. A comparison of variant calling pipelines using Genome in a Bottle as areference. Biomed. Res. Int. 2015, 1–11 (2015).

    Article  Google Scholar 

  6. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983 (2018).

    CAS  PubMed  Google Scholar 

  7. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  9. Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).

    CAS  Article  PubMed  Google Scholar 

  10. Krusche, P. et al. Best practices for benchmarking germline small variant calls in human genomes. Nat. Biotechnol. (2019).

  11. Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  12. Kudalkar, E. M. et al. Multiplexed reference materials as controls for diagnostic next-generation sequencing: a pilot investigating applications for hypertrophic cardiomyopathy. J. Mol. Diagn. 18, 882–889 (2016).

    CAS  Article  PubMed  Google Scholar 

  13. Lincoln, S. E. et al. An interlaboratory study of complex variant detection. Preprint at bioRxiv (2017).

  14. Zhou, B. et al. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci. Data 5, 180261 (2018).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. Mu, J. C. et al. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci. Rep. 5, 14493 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  16. English, A. C. et al. Assessing structural variation in a personal genome—towards a human reference diploid genome. BMC Genomics 16, 286 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Conrad, D. F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  19. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  20. Beck, T. F. et al. Systematic evaluation of Sanger validation of next-generation sequencing variants. Clin. Chem. 62, 647–654 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. Marks, P. et al. Resolving the full spectrum of human genome variation using linked-reads. Preprint at bioRxiv (2018).

  22. Wenger, A. M. et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. Preprint at bioRxiv (2019).

  23. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at (2012).

  26. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).

    CAS  Article  PubMed  Google Scholar 

  27. Kendig, K. et al. Computational performance and accuracy of Sentieon DNASeq variant calling workflow. Preprint at bioRxiv 396325 (2018).

  28. Toptaş, B. Ç., Rakocevic, G., Kómár, P. & Kural, D. Comparing complex variants in family trios. Bioinformatics (2018).

Download references


We thank the many contributors to GIAM Consortium discussions. We especially thank R. Saldana and the Sentieon team for advice on running the Sentieon pipeline; A. Carroll and the DNAnexus team for advice on implementing the pipeline in DNAnexus; F. Hyland, S. Ghosh, K. Zhao and J. Bodeau at ThermoFisher for advice on integrating Ion exome and SOLiD genome data; D. Church and V. Schneider for helpful discussions about GRCh38; and many individuals for providing feedback on the current version and previous versions of our calls. Selected commercial equipment, instruments or materials are identified to specify the adequacy of experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.X. and S.S. were supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. J.Z., M.S., N.O. and J.W. were supported by the National Institute of Standards and Technology and an interagency agreement with the Food and Drug Administration.

Author information

Authors and Affiliations



J.M.Z., L.T., N.D.O., J.W. and M.S. wrote the manuscript. J.M.Z., J.M., F.M.D., N.D.O., J.W., M.S. and H.P. designed and implemented the integration process. H.H., J.M. and J.M.Z. analyzed and integrated the 10x Genomics data. R.T., J.M. and J.M.Z. analyzed and integrated the Complete Genomics data. S.A.I., L.T., F.M.D., J.M. and J.M.Z. designed and implemented the phasing and robust trio analysis. C.Y.M., J.M. and J.M.Z. designed and implemented the robust GRCh38 liftover analysis. C.X. and S.S. managed and analyzed data. All authors contributed to GIAB discussions planning this work.

Corresponding author

Correspondence to Justin M. Zook.

Ethics declarations

Competing interests

R.T. is an employee of, and holds stock in, Invitae. H.H. was an employee of 10x Genomics. S.A.I. and L.T. are employees of Real Time Genomics. C.Y.M. is an employee of Verily Life Sciences and Google.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Fraction of each chromosome covered by benchmark regions for each genome.

Fraction of the assembled (i.e., non-N) bases in GRCh37 that are covered by the benchmark regions for each genome (HG001 to HG007), separated by chromosome.

Supplementary Figure 2 Overall flow of execution of the code used to integrate VCF and BED files from each method and form benchmark VCF and BED files.

Diagram of the input files (light orange boxes) and output files (dark orange boxes) of each script (blue boxes) used to integrate callsets from each method and form the benchmark set.

Supplementary Figure 3 Preprocessing and merging of VCF and BED files from each input callset.

The Callset Table gives metadata about each input callset, including which difficult regions to exclude from each callset’s callable bed file. This table is used to generate callable bed files for each callset and form a merged vcf that includes the genotype from each callset and annotations that indicate whether it falls in each callset’s callable bed file.

Supplementary Figure 4 Processing union VCF to arbitrate between callsets and form benchmark VCF.

Process used to determine if a consensus genotype call can be made from all trusted input callsets for each line in the union VCF. In the first iteration, each callset’s callable regions are used to determine if a callset can be trusted, and calls where all trusted callsets agree and at least two different platforms support the call are used to train the one class filtering model in Supplementary Figure 5. In the second iteration, each callset’s callable regions are again used to determine if a callset can be trusted, but filtered calls are also excluded. To be included in the benchmark set, all trusted callsets must have the same genotype, and support from only one platform is needed.

Supplementary Figure 5 One-class model used to filter calls from each input callset that have outlier annotations.

To determine whether a call from each input callset can be trusted, we use a simple one-class model that finds calls from each callset that have outlier values for any of the user-specified annotations. For the training set, we use the sites from each input callset that agree with the consensus calls supported by at least two technologies (found in the first iteration of the process in Supplementary Figure 4). The filtered bed files from each callset are used to annotate the union VCF used in the second and final iteration of Supplementary Figure 4.

Supplementary information

Supplementary Information

Supplementary Figures 1–5, Supplementary Tables 1–5 and Supplementary Notes 1–10

Reporting Summary

Supplementary Data 1

Detailed manual curation results for discordant sites

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zook, J.M., McDaniel, J., Olson, N.D. et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol 37, 561–566 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing