Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Variant calling and benchmarking in an era of complete human genome sequences

Abstract

Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Challenges of mapping and variant calling in simple repetitive regions.
Fig. 2: Remaining challenges in representing and benchmarking complex variants.
Fig. 3: Mapping challenges in segmental duplications and large structural variants.
Fig. 4: Workflow strategies for variant calling.
Fig. 5: Considerations when generating and using benchmark sets for evaluating variant-calling methods.

Similar content being viewed by others

References

  1. Olson, N. D. et al. PrecisionFDA Truth Challenge V2: calling variants from short and long reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022). The latest iteration of the precisionFDA Truth Challenge, which serves as a baseline for variant call performance from short and long reads in easy versus more difficult regions using the GIAB v4.2.1 benchmark.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Pan, B. et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biol. 23, 2 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Foox, J. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat. Biotechnol. 39, 1129–1140 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019). Initial demonstration of the value of accurate long reads for variant calling and assembly.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Dwarshuis, N. et al. StratoMod: predicting sequencing and variant calling errors with interpretable machine learning. Preprint at bioRxiv https://doi.org/10.1101/2023.01.20.524401 (2023).

  8. Meacham, F. et al. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12, 451 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).

    Article  CAS  PubMed  Google Scholar 

  11. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). Public resource of allele frequencies from 141,456 individuals using short reads, made available through the gnomAD genome browser.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Weisburd, B., VanNoy, G. & Watts, N. The addition of short tandem repeat calls to gnomAD. gnomAD https://gnomad.broadinstitute.org/news/2022-01-the-addition-of-short-tandem-repeat-calls-to-gnomad/ (2022).

  13. Ren, J., Gu, B. & Chaisson, M. J. P. vamos: VNTR annotation using efficient motif sets. Preprint at bioRxiv https://doi.org/10.1101/2022.10.07.511371 (2022).

  14. Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. & Bafna, V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 28, 1709–1719 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022). Initial analysis of complex segmental duplication variation using the T2T-CHM13 reference.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Zhao, X. et al. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. Am. J. Hum. Genet. 108, 919–928 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Lincoln, S. E. et al. One in seven pathogenic variants can be challenging to detect by NGS: an analysis of 450,000 patients with implications for clinical sensitivity and genetic test implementation. Genet. Med. 23, 1673–1680 (2021). Results from a large clinical laboratory showing that one in seven pathogenic variants are challenging for short reads owing to low mappability or variant type.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Chin, C.-S. et al. Multiscale analysis of pangenome enables improved representation of genomic diversity for repetitive and clinically relevant genes. Preprint at bioRxiv https://doi.org/10.1101/2022.08.05.502980 (2022).

  19. Behera, S. et al. FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 24, 31 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022). Initial analysis showing that a complete human genome reference improves variant calling by fixing reference errors and adding new sequences.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022). Latest benchmark from GIAB, demonstrating that diploid assembly can be used to form reliable small-variant and SV benchmarks for a set of 273 challenging medically relevant genes, and providing a prototype for future assembly-based benchmarks.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Vollger, M. R. et al. Increased mutation rate and interlocus gene conversion within human segmental duplications. Preprint at bioRxiv https://doi.org/10.1101/2022.07.06.498021 (2022).

  23. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    Article  CAS  PubMed  Google Scholar 

  25. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).

    Article  Google Scholar 

  28. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  CAS  PubMed  Google Scholar 

  29. Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).

    Article  CAS  PubMed  Google Scholar 

  30. Derrien, T. et al. Fast computation and applications of genome mappability. PLoS ONE 7, e30377 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Ou, S. et al. Effect of sequence depth and length in long-read assembly of the maize inbred NC358. Nat. Commun. 11, 2288 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Ebbert, M. T. W. et al. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. 20, 97 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Hardwick, S. A., Deveson, I. W. & Mercer, T. R. Reference standards for next-generation sequencing. Nat. Rev. Genet. 18, 473–484 (2017).

    Article  CAS  PubMed  Google Scholar 

  35. Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e19 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2018).

    Article  CAS  PubMed  Google Scholar 

  38. Arslan, S. et al. Sequencing by avidity enables high accuracy with low reagent consumption. Preprint at bioRxiv https://doi.org/10.1101/2022.11.03.514117 (2022).

  39. Vergult, S. et al. Mate pair sequencing for the detection of chromosomal aberrations in patients with intellectual disability and congenital malformations. Eur. J. Hum. Genet. 22, 652–659 (2014).

    Article  CAS  PubMed  Google Scholar 

  40. Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019). Review of variant-calling methods for SVs, to complement our more general review of variant calling.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Marks, P. et al. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res. 29, 635–645 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Peters, B. A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Chen, Z. et al. Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information. Genome Res. 30, 898–909 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Belton, J.-M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).

    Article  CAS  PubMed  Google Scholar 

  46. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

    Article  CAS  PubMed  Google Scholar 

  47. Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Sanders, A. D., Falconer, E., Hills, M., Spierings, D. C. J. & Lansdorp, P. M. Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat. Protoc. 12, 1151–1176 (2017).

    Article  CAS  PubMed  Google Scholar 

  49. Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).

    Article  CAS  PubMed  Google Scholar 

  50. Rhoads, A. & Au, K. F. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics 13, 278–289 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    Article  CAS  PubMed  Google Scholar 

  52. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

    Article  CAS  PubMed  Google Scholar 

  55. Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675.e19 (2019). Analysis using assemblies to show the prevalence of structural variation in the human genome.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Xu, Z. et al. Fast-bonito: a faster deep learning based basecaller for nanopore sequencing. Artif. Intell. Life Sci. 1, 100011 (2021).

    CAS  Google Scholar 

  58. Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021). Recent iteration of the deep learning-based tool DeepVariant to call small variants from noisy long reads.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).

    Article  CAS  PubMed  Google Scholar 

  60. Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Cao, H. et al. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. Gigascience 3, 34 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Kaiser, M. D. et al. Automated structural variant verification in human genomes using single-molecule electronic DNA mapping. Preprint at bioRxiv https://doi.org/10.1101/140699 (2017).

    Article  Google Scholar 

  63. Yuan, Y., Chung, C. Y.-L. & Chan, T.-F. Advances in optical mapping for genomic research. Comput. Struct. Biotechnol. J. 18, 2051–2062 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Mantere, T. et al. Optical genome mapping enables constitutional chromosomal aberration detection. Am. J. Hum. Genet. 108, 1409–1422 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Gerding, W. M. et al. Optical genome mapping reveals additional prognostic information compared to conventional cytogenetics in AML/MDS patients. Int. J. Cancer 150, 1998–2011 (2022).

    Article  CAS  PubMed  Google Scholar 

  66. Coster, W. D., De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021). Recent review of how long-read sequencing is increasingly used to study variation in large numbers of samples.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Poplin, R., Zook, J. M. & DePristo, M. Challenges of accuracy in germline clinical sequencing data. JAMA 326, 268–269 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Cortés-Ciriano, I., Gulhan, D. C., Lee, J. J.-K., Melloni, G. E. M. & Park, P. J. Computational analysis of cancer genome sequencing data. Nat. Rev. Genet. 23, 298–314 (2022). Recent review of somatic variant calling, to complement the focus on germline variants in this Review.

    Article  PubMed  Google Scholar 

  69. Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods 19, 705–710 (2022).

    Article  CAS  PubMed  Google Scholar 

  70. Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Prodanov, T. & Bansal, V. Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications. Nucleic Acids Res. 48, e114 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput. Sci. 2, 797–803 (2022).

    Article  Google Scholar 

  75. AlDubayan, S. H. et al. Detection of pathogenic variants with germline genetic testing using deep learning vs standard methods in patients with prostate cancer and melanoma. JAMA 324, 1957–1969 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).

    Article  CAS  PubMed  Google Scholar 

  77. Sapoval, N. et al. Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun. 13, 1728 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat. Biotechnol. 41, 232–238 (2023).

    CAS  PubMed  Google Scholar 

  79. Almogy, G. et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. Preprint at bioRxiv https://doi.org/10.1101/2022.05.29.493900 (2022).

  80. Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun. 10, 998 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  82. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  83. Van der Auwera GA & O’Connor BD. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra 1st edn (O’Reilly, 2020).

  84. Cooke, D. P., Wedge, D. C. & Lunter, G. A unified haplotype-based method for accurate and comprehensive variant calling. Nat. Biotechnol. 39, 885–892 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Freed, D. et al. DNAscope: high accuracy small variant calling using machine learning. Preprint at bioRxiv https://doi.org/10.1101/2022.05.20.492556 (2022).

  86. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Dolzhenko, E. et al. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35, 4754–4756 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Tang, H. et al. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes. Am. J. Hum. Genet. 101, 700–715 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Hall, C. L. et al. Accurate profiling of forensic autosomal STRs using the Oxford Nanopore Technologies MinION device. Forensic Sci. Int. Genet. 56, 102629 (2022).

    Article  CAS  PubMed  Google Scholar 

  92. Fang, L. et al. DeepRepeat: direct quantification of short tandem repeats on signal data from nanopore sequencing. Genome Biol. 23, 108 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. PacificBiosciences. Tandem repeat genotyping and visualization from PacBio HiFi data. GitHub https://github.com/PacificBiosciences/trgt (2023).

  94. Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).

    Article  CAS  PubMed  Google Scholar 

  95. Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Garg, S. et al. A haplotype-aware de novo assembly of related individuals using pedigree sequence graph. Bioinformatics 36, 2385–2392 (2020).

    Article  CAS  PubMed  Google Scholar 

  97. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article  CAS  PubMed  Google Scholar 

  100. Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).

  101. Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Liao, W.-W. et al. A draft human pangenome reference. Preprint at bioRxiv https://doi.org/10.1101/2022.07.09.499321 (2022). First manuscript from the Human Pangenome Reference Consortium about their initial pangenome formed from accurate diploid assemblies, which can be used to improve variant calling.

  103. Kulski, J. K., Suzuki, S. & Shiina, T. Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes. Hum. Genome Var. 9, 49 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Sherman, R. M. & Salzberg, S. L. Pan-genomics in the human genome era. Nat. Rev. Genet. 21, 243–254 (2020). Review of pangenomes, including how past work on pangenomes for other species can inform work on human pangenomes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Rakocevic, G. et al. Fast and accurate genomic analyses using genome graphs. Nat. Genet. 51, 354–362 (2019).

    Article  CAS  PubMed  Google Scholar 

  106. Tetikol, H. S. et al. Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis. Nat. Commun. 13, 4384 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  109. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  110. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  111. Dewey, F. E. et al. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet 7, e1002280 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  114. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019). Primary product of the GA4GH Benchmarking Team, including a summary of best practices for benchmarking variant calls.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  117. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  118. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  119. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  CAS  PubMed  Google Scholar 

  120. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  121. Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi.org/10.1101/023754 (2015).

  122. Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  123. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  124. Jones, W. et al. A verified genomic reference sample for assessing performance of cancer panels detecting small variants of low allele frequency. Genome Biol. 22, 111 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Zhao, Y. et al. Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Sci. Data 8, 296 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  126. Fang, L. T. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat. Biotechnol. 39, 1151–1160 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  127. Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  128. Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).

    Article  CAS  PubMed  Google Scholar 

  129. English, A. C. et al. Assessing structural variation in a personal genome — towards a human reference diploid genome. BMC Genomics 16, 286 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  130. Mu, J. C. et al. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci. Rep. 5, 14493 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  131. Zhou, B. et al. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci. Data 5, 180261 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  132. Jun, G. et al. muCNV: genotyping structural variants for population-level sequencing. Bioinformatics 37, 2055–2057 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  133. Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  134. Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    Article  CAS  PubMed  Google Scholar 

  135. Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  136. Chiang, C. et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 966–968 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  137. Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).

    Article  CAS  PubMed  Google Scholar 

  138. Chowdhury, M., Pedersen, B. S., Sedlazeck, F. J., Quinlan, A. R. & Layer, R. M. Searching thousands of genomes to classify somatic and novel structural variants using STIX. Nat. Methods 19, 445–448 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  139. Rhie, A. et al. The complete sequence of a human Y chromosome. Preprint at bioRxiv https://doi.org/10.1101/2022.12.01.518724 (2022).

  140. Lee, A. Y. et al. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 19, 188 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  141. Samadian, S., Bruce, J. P. & Pugh, T. J. Bamgineer: introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets. PLoS Comput. Biol. 14, e1006080 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  142. Li, Z. et al. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation. J. Mol. Diagn. 23, 285–299 (2021).

    Article  CAS  PubMed  Google Scholar 

  143. Xia, L. C. et al. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7, (2018).

  144. Duncavage, E. J. et al. A model study of in silico proficiency testing for clinical next-generation sequencing. Arch. Pathol. Lab. Med. 140, 1085–1091 (2016).

    Article  CAS  PubMed  Google Scholar 

  145. Duncavage, E. J. et al. Recommendations for the use of in silico approaches for next generation sequencing bioinformatic pipeline validation: a joint report of the Association for Molecular Pathology, Association for Pathology Informatics, and College of American Pathologists. J. Mol. Diagn. 25, 3–16 (2023).

    Article  CAS  PubMed  Google Scholar 

  146. Reis, A. L. M. et al. Using synthetic chromosome controls to evaluate the sequencing of difficult regions within the human genome. Genome Biol. 23, 19 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  147. Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210–223 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  148. Shand, M. et al. A validated lineage-derived somatic truth data set enables benchmarking in cancer genome analysis. Commun. Biol. 3, 744 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  149. Craig, D. W. et al. A somatic reference standard for cancer genome sequencing. Sci. Rep. 6, 24607 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  150. Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  151. English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022). Describes the Truvari tool, which has been important for benchmarking SVs and tandem repeats by comparing different representations of variants.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  152. Alser, M. et al. From molecules to genomic variations: accelerating genome analysis via intelligent algorithms and architectures. Comput. Struct. Biotechnol. J. 20, 4579–4599 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  153. Shneiderman, B. in The Craft of Information Visualization (eds. Bederson, B. B. & Shneiderman, B.) 364–371 (Morgan Kaufmann, 2003).

  154. Belyeu, J. R. et al. SV-plaudit: a cloud-based framework for manually curating thousands of structural variants. Gigascience 7, giy064 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  155. Chapman, L. M. et al. A crowdsourced set of curated structural variants for the human genome. PLoS Comput. Biol. 16, e1007933 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  156. Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  157. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv https://doi.org/10.48550/arXiv.1303.3997 (2013).

  158. Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  159. Eggertsson, H. P. et al. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat. Commun. 10, 5402 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  160. Mitchell, M. et al. in Proceedings of the Conference on Fairness, Accountability, and Transparency 220–229 (Association for Computing Machinery, 2019).

  161. Medvedev, P. The theoretical analysis of sequencing bioinformatics algorithms and beyond. Preprint at arXiv https://doi.org/10.48550/arXiv.2205.01785 (2022).

Download references

Acknowledgements

The authors thank the members of the Genome in a Bottle Consortium, Human Pangenome Reference Consortium and Telomere to Telomere Consortium for helpful discussions about the strengths and limitations of the various technologies and bioinformatics methods. Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to all aspects of the manuscript.

Corresponding author

Correspondence to Justin M. Zook.

Ethics declarations

Competing interests

F.J.S. has received support from Oxford Nanopore Technologies, Pacific Biosciences, Illumina and Genentech. The other authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Genetics thanks Kai Ye and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

GA4GH/GIAB stratifications: https://github.com/genome-in-a-bottle/genome-stratifications

Genome in a Bottle Consortium: http://www.genomeinabottle.org/

gnomAD: https://gnomad.broadinstitute.org/

Human Pangenome Reference Consortium: https://humanpangenome.org/

T2T-CHM13: https://github.com/marbl/CHM13

Supplementary information

Glossary

Acrocentric arms

Short arms of human chromosomes 13, 14, 15, 21 and 22, which are known to be enriched with satellite DNA, segmental duplications and transposable element insertions. They also contain long tracts of ribosomal DNAs. They are highly similar in repeat structure and sequence content.

Admixed ancestries

Individuals with ancestors coming from multiple populations that had previously diverged.

Benchmarking variants

The process of comparing a variant callset (the query callset) to the benchmark callset in the benchmark regions in order to identify true positives, false positives and false negatives.

Benchmark sets

Set of variants and regions defined to reliably identify false positives and false negatives, also sometimes called ‘high-confidence’, ‘truth’, ‘baseline’ and ‘gold standard’.

Centromeres

Genomic regions, one per chromosome, that map the location of kinetochore assembly, typically marked as a primary constriction on a metaphase chromosome.

Circular consensus sequencing

A sequencing method in which a single molecule is circularized and sequenced multiple times to improve accuracy (for example, in Pacific Biosciences HiFi sequencing).

De novo assembly

Analysis of DNA reads to produce the genome sequence of an individual without mapping individual reads to a reference genome. Increasingly, human genome assemblies can be haplotype-resolved (phased), such that separate assembled sequences are produced for the copies of each chromosome coming from the mother and father.

Genome in a Bottle Consortium

(GIAB). A public–private–academic consortium formed by the US National Institute of Standards and Technology (NIST) in 2013, involving a broad community from government, academia, commercial technology developers and clinical laboratories. Its aim is to develop authoritatively characterized genomes that can be used to benchmark human genome variant calls.

Germline variant

A variant attributed to the initial sequence of an organism at conception, and typically found in all the cells in an individual.

Haplotype

A region of DNA containing multiple variants (or alleles) that are frequently inherited together.

Indels

Variants that are insertions and deletions of sequence, typically 1 to 49 bp in size.

Long interspersed nuclear elements

(LINEs). A family of transposons, with approximately 100,000 truncated copies and a few thousand full-length 6,000-bp copies in the human genome, causing mapping challenges.

N50

A summary measure of read length distribution: 50% of the bases in the reads are in reads longer than the N50 value. Similarly, for de novo assemblies, 50% of the bases in the assembled contigs are in contigs longer than the N50 value.

Pangenome references

Collection of many genomes used as references (sometimes, but not always, represented as graphs) in addition to the standard linear genome reference assemblies.

Pericentromeric heterochromatin regions

Typically multi-megabase-sized regions directly adjacent to centromeres that are enriched with satellite DNA, segmental duplications and transposable elements. These regions are associated with darkly staining constitutive heterochromatin.

Phasing

The process of assigning heterozygous variants to the same haplotype (for example, the maternal copy of the chromosome contains both variants) or to opposite haplotypes (one variant is on the maternal copy and the other is on the paternal copy).

Precision

The fraction of query variants in the benchmark regions that match the benchmark variants, or true positives/(true positives + false positives).

Read mapping

Aligning a given read to a reference.

Reads

Small sequence fragments from larger molecules generated by a given sequencing technology; the length can range from 100 bp to >1 million bp, depending on the sequencing method.

Recall

The fraction of benchmark variants that are matched by query variants, or true positives/(true positives + false negatives).

Reference genome assembly

A haploid genome assembly to which sequencing reads are mapped and variants are called. The current versions in common use are GRCh37 (also known as hg19), GRCh38 (also known as hg38) and T2T-CHM13.

Reference material

A material that is sufficiently stable (over time) and homogeneous (between vials) for its applications. For example, genomic reference materials from the US National Institute of Standards and Technology (NIST) are extensively characterized to develop benchmark variants and regions to reliably identify false positives and false negatives.

Satellite DNA

Highly repetitive regions that originally were defined by their density owing to a unique composition of A, C, G and T bases. Satellite DNA regions are often characterized by tandem repeats organized in very long arrays and are embedded in regions known to be enriched in silent, constitutive heterochromatin.

Scaffolding

The process of connecting assembled contigs even when the intervening sequence is unknown.

Segmental duplications

Long DNA sequences that are highly similar to each other in the reference genome assembly, typically at least 1,000 bp in length and not a transposable element, tandem repeat or satellite DNA. There is some overlap between variable number tandem repeat (VNTR) and segmental duplication annotations, particularly for tandem repeat unit sizes longer than 1,000 bp, as occurs in the medically relevant genes LPA and CR1.

Sequencing Quality Control Consortium

(SEQC). A consortium formed by the US Food and Drug Administration (FDA) to compare sequencing methods and understand sources of variability.

Short tandem repeats

(STRs). Many consecutive repeats of 2-bp to 6-bp sequence units.

Single-nucleotide variants

(SNVs). Variants that are single-base substitutions. They are also commonly called single-nucleotide polymorphisms (SNPs) when they occur at an appreciable frequency (typically >1%) in the germ lines of the wider population.

Somatic variant

A variant attributed to a mutation after conception. Only some cells in the organism will have this variant; they are most frequently detected in cancer tissues or blood.

Structural variants

(SVs). Typically defined as variants of at least 50 bp in size.

Variable number tandem repeats

(VNTRs). Many consecutive repeats of >6-bp sequence units.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Olson, N.D., Wagner, J., Dwarshuis, N. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 24, 464–483 (2023). https://doi.org/10.1038/s41576-023-00590-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41576-023-00590-0

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research