Technical Report

Comprehensive variation discovery in single human genomes

Published online:


Complete knowledge of the genetic variation in individual human genomes is a crucial foundation for understanding the etiology of disease. Genetic variation is typically characterized by sequencing individual genomes and comparing reads to a reference. Existing methods do an excellent job of detecting variants in approximately 90% of the human genome; however, calling variants in the remaining 10% of the genome (largely low-complexity sequence and segmental duplications) is challenging. To improve variant calling, we developed a new algorithm, DISCOVAR, and examined its performance on improved, low-cost sequence data. Using a newly created reference set of variants from the finished sequence of 103 randomly chosen fosmids, we find that some standard variant call sets miss up to 25% of variants. We show that the combination of new methods and improved data increases sensitivity by several fold, with the greatest impact in challenging regions of the human genome.

  • Subscribe to Nature Genetics for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.



  1. 1.

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  2. 2.

    et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

  3. 3.

    et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  4. 4.

    , , & lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).

  5. 5.

    , , , & De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

  6. 6.

    et al. A likelihood-based framework for variant calling and de novo mutation detection in families. PLoS Genet. 8, e1002944 (2012).

  7. 7.

    et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  8. 8.

    , & A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data. Bioinformatics 29, 1361–1366 (2013).

  9. 9.

    & Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).

  10. 10.

    , , , & An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 23, 833–842 (2013).

  11. 11.

    International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

  12. 12.

    et al. Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell 65, 905–914 (1991).

  13. 13.

    et al. DNA duplication associated with Charcot-Marie-Tooth disease type 1A. Cell 66, 219–232 (1991).

  14. 14.

    et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods 6, 291–295 (2009).

  15. 15.

    , , & A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13, 1028–1040 (2006).

  16. 16.

    et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927–930 (2004).

  17. 17.

    et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

  18. 18.

    Improving SNP discovery by base alignment quality. Bioinformatics 27, 1157–1158 (2011).

  19. 19.

    & Diseases of unstable repeat expansion: mechanisms and common principles. Nat. Rev. Genet. 6, 743–755 (2005).

  20. 20.

    , & Nebulin, a major player in muscle health and disease. FASEB J. 25, 822–829 (2011).

  21. 21.

    Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26 (1979).

  22. 22.

    et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

  23. 23.

    , , & Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26, 1704–1707 (2010).

  24. 24.

    et al. Genome project standards in a new era of sequencing. Science 326, 236–237 (2009).

  25. 25.

    International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  26. 26.

    et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

  27. 27.

    et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  28. 28.

    , & An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).

  29. 29.

    , , & Pebble and Rock Band: heuristic resolution of repeats and scaffolding in the Velvet short-read de novo assembler. PLoS ONE 4, e8407 (2009).

  30. 30.

    et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

  31. 31.

    et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).

  32. 32.

    , & ECHO: a reference-free short-read error correction algorithm. Genome Res. 21, 1181–1192 (2011).

  33. 33.

    et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

Download references


We thank the Broad Institute Genomics Platform for generating data for this project and J. Bochicchio for project management. We thank the 1000 Genomes Project for early access to 250-base reads from the Centre d′Etude du Polymorphisme Humain (CEPH) trio. We thank Illumina for the use of sequence data from the 17-member CEPH pedigree and M. Eberle for detailed information on these data. We thank Z. Iqbal for detailed advice regarding running Cortex on 250-base reads and M. Carneiro and E. Banks for detailed advice regarding running GATK on the same data. We thank H. Li for early access to bwa-mem and for his extensive examination of our manuscript. We thank S. Fisher for help preparing the data cost estimate. We thank M. Daly, M. DePristo, S. Gabriel, Z. Iqbal, D. MacArthur, D. Neafsey and M. Zody for helpful comments. This project has been funded in part with federal funds from the National Human Genome Research Institute, US National Institutes of Health, US Department of Health and Human Services, under grants R01HG003474 and U54HG003067, and in part with federal funds from the National Institute of Allergy and Infectious Diseases, US National Institutes of Health, US Department of Health and Human Services, under contract HHSN272200900018C.

Author information


  1. The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Neil I Weisenfeld
    • , Shuangye Yin
    • , Ted Sharpe
    • , Bayo Lau
    • , Ryan Hegarty
    • , Laurie Holmes
    • , Brian Sogoloff
    • , Diana Tabbaa
    • , Louise Williams
    • , Carsten Russ
    • , Chad Nusbaum
    • , Eric S Lander
    • , Iain MacCallum
    •  & David B Jaffe


  1. Search for Neil I Weisenfeld in:

  2. Search for Shuangye Yin in:

  3. Search for Ted Sharpe in:

  4. Search for Bayo Lau in:

  5. Search for Ryan Hegarty in:

  6. Search for Laurie Holmes in:

  7. Search for Brian Sogoloff in:

  8. Search for Diana Tabbaa in:

  9. Search for Louise Williams in:

  10. Search for Carsten Russ in:

  11. Search for Chad Nusbaum in:

  12. Search for Eric S Lander in:

  13. Search for Iain MacCallum in:

  14. Search for David B Jaffe in:


D.B.J., I.M. and C.R. designed and managed the project. S.Y., N.I.W., T.S., B.L., I.M. and D.B.J. designed algorithms and analyzed data. R.H., L.H., B.S., D.T. and L.W. designed and executed laboratory processes. D.B.J., C.N. and E.S.L. shaped the results and developed the manuscript, to which all authors contributed. All authors read and approved the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to David B Jaffe.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–6, Supplementary Tables 1–10 and Supplementary Note.

Text files

  1. 1.

    Variants called by each caller in fosmid regions.

    This data set provides a tab-delimited list of the variants called in the fosmid regions showing the fosmid identifier, position, reference allele alternate allele and a comma-separated list of notes. The notes are either “fosmid,” indicating that the variant is present in the fosmid reference set or of the form caller=state, where caller is Platinum-100, GATK-250, DISCOVAR or Cortex and state is het, hom or hom+. Here het means that both the variant and reference were called, hom means that the variant but not the reference was called and hom+ means that the variant was called as homozygous. This data set was automatically generated. See also the manually identified defects noted in Supplementary Table 1a.