Analysis | Published:

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

Nature Biotechnology volume 32, pages 246251 (2014) | Download Citation


Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


Primary accessions


  1. 1.

    et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2010).

  2. 2.

    et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature 486, 405–409 (2012).

  3. 3.

    et al. Dissecting the genomic complexity underlying medulloblastoma. Nature 488, 100–105 (2012).

  4. 4.

    The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).

  5. 5.

    et al. The new sequencer on the block: comparison of Life Technology's Proton sequencer to an Illumina HiSeq for whole-exome sequencing. Hum. Genet. 132, 1153–1163 (2013).

  6. 6.

    et al. Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS ONE 8, e66621 (2013).

  7. 7.

    et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

  8. 8.

    et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78–82 (2012).

  9. 9.

    et al. Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nat. Biotechnol. 30, 61–68 (2012).

  10. 10.

    The Plasma Proteins: Structure, Function and Genetic Control, edn. 2 (Academic Press, New York, 1975).

  11. 11.

    et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013).

  12. 12.

    & First FDA authorization for next-generation sequencer. N. Engl. J. Med. 369, 2369–2371 (2013).

  13. 13.

    The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  14. 14.

    et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  15. 15.

    et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  16. 16.

    The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  17. 17.

    & in Proceedings of the Eleventh Annual Conference on Computational Learning Theory (eds. P. Bartlett & Y. Mansour) 92–100 (ACM, Madison, Wisconsin, USA, 1998).

  18. 18.

    et al. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12, 451 (2011).

  19. 19.

    , , , & Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLoS ONE 7, e41356 (2012).

  20. 20.

    et al. Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes. Nature 455, 105–108 (2008).

  21. 21.

    & Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).

  22. 22.

    Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at (2013).

  23. 23.

    & Haplotype-based variant detection from short-read sequencing. Preprint at (2012).

  24. 24.

    et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  25. 25.

    & Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  26. 26.

    et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).

  27. 27.

    , , , & Accurate and comprehensive sequencing of personal genomes. Genome Res. 21, 1498–1505 (2011).

Download references


We thank J. Johnson and A. Varadarajan from the Archon Genomics X Prize and EdgeBio for contributing their whole-genome sequencing data from SOLiD and Illumina, Complete Genomics and Life Technologies for providing bam files for NA12878, and the Broad Institute and 1000 Genomes Project for making publicly available bam and VCF files for NA12878. The Illumina exome data on GCAT were given to the Mittelman laboratory by M. Linderman at Icahn Institute of Genomics and Multiscale Biology of the Icahn School of Medicine at Mount Sinai. We thank the US Food and Drug Administration High Performance Computing staff for their support in running the bioinformatics analyses. Harvard School of Public Health contributions were funded by the Archon Genomics X PRIZE. Certain commercial equipment, instruments or materials are identified in this document. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products identified are necessarily the best available for the purpose.

Author information


  1. Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, Maryland, USA.

    • Justin M Zook
    •  & Marc Salit
  2. Bioinformatics Core, Department of Biostatistics, Harvard School of Public Health, Cambridge, Massachusetts, USA.

    • Brad Chapman
    • , Oliver Hofmann
    •  & Winston Hide
  3. Arpeggi, Inc., Austin, Texas, USA.

    • Jason Wang
    •  & David Mittelman
  4. Virginia Bioinformatics Institute and Department of Biological Sciences, Blacksburg, Virginia, USA.

    • David Mittelman


  1. Search for Justin M Zook in:

  2. Search for Brad Chapman in:

  3. Search for Jason Wang in:

  4. Search for David Mittelman in:

  5. Search for Oliver Hofmann in:

  6. Search for Winston Hide in:

  7. Search for Marc Salit in:


J.M.Z., M.S., B.C., O.H. and W.H. conceived the integration methods. J.M.Z. wrote the code for the integration methods and wrote the main manuscript. D.M. and J.W. designed the GCAT platform, implemented comparison to our genotype calls, and generated figures.

Competing interests

D.M. and J.W. are partners and equity holders in Gene by Gene Ltd., which offers clinical and direct-to-consumer genetic testing.

Corresponding author

Correspondence to Justin M Zook.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–38, Supplementary Discussion and Supplementary Tables 1–7

About this article

Publication history





Further reading