A robust benchmark for detection of germline large deletions and insertions

Zook, Justin M.; Hansen, Nancy F.; Olson, Nathan D.; Chapman, Lesley; Mullikin, James C.; Xiao, Chunlin; Sherry, Stephen; Koren, Sergey; Phillippy, Adam M.; Boutros, Paul C.; Sahraeian, Sayed Mohammad E.; Huang, Vincent; Rouette, Alexandre; Alexander, Noah; Mason, Christopher E.; Hajirasouliha, Iman; Ricketts, Camir; Lee, Joyce; Tearle, Rick; Fiddes, Ian T.; Barrio, Alvaro Martinez; Wala, Jeremiah; Carroll, Andrew; Ghaffari, Noushin; Rodriguez, Oscar L.; Bashir, Ali; Jackman, Shaun; Farrell, John J.; Wenger, Aaron M.; Alkan, Can; Soylev, Arda; Schatz, Michael C.; Garg, Shilpa; Church, George; Marschall, Tobias; Chen, Ken; Fan, Xian; English, Adam C.; Rosenfeld, Jeffrey A.; Zhou, Weichen; Mills, Ryan E.; Sage, Jay M.; Davis, Jennifer R.; Kaiser, Michael D.; Oliver, John S.; Catalano, Anthony P.; Chaisson, Mark J. P.; Spies, Noah; Sedlazeck, Fritz J.; Salit, Marc

doi:10.1038/s41587-020-0538-8

Resource
Published: 15 June 2020

A robust benchmark for detection of germline large deletions and insertions

Nature Biotechnology volume 38, pages 1347–1355 (2020)Cite this article

18k Accesses
154 Citations
188 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 22 July 2020

This article has been updated

Abstract

New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution and comprehensiveness. To help translate these methods to routine research and clinical practice, we developed a sequence-resolved benchmark set for identification of both false-negative and false-positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12,745 isolated, sequence-resolved insertion (7,281) and deletion (5,464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5,262 insertions and 4,095 deletions supported by ≥1 diploid assembly. We demonstrate that the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked- and long-read sequencing and optical mapping.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Pairwise comparison of sequence-resolved SV callsets obtained from multiple technologies and SV callers for SVs ≥50 bp from HG002.**

**Fig. 2: Process to integrate SV callsets and diploid assemblies from different technologies and analysis methods and form the benchmark set.**

**Fig. 3: Size distributions of deletions and insertions in the benchmark set.**

**Fig. 4: Support for benchmark SVs by long reads, short reads and optical mapping.**

**Fig. 5: Summary of manual curation of putative false positives and false negatives when benchmarking short and long reads against the v0.6 benchmark set.**

**Fig. 6: Inverse cumulative distribution showing the number of discovery methods that supported each SV.**

**Fig. 7: Fraction of SVs for each number of discovery callsets that estimated exactly matching sequence changes.**

Comprehensive benchmarking and guidelines of mosaic variant calling strategies

Article Open access 12 October 2023

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Article Open access 19 March 2024

Variant calling and benchmarking in an era of complete human genome sequences

Article 14 April 2023

Data availability

Raw sequence data were previously published in Scientific Data (https://doi.org/10.1038/sdata.2016.25) and deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive with the accession codes SRX847862 to SRX848317, SRX1388732 to SRX1388743, SRX852933, SRX5527202, SRX5327410 and SRX1033793 to SRX1033798. 10× Genomics Chromium bam files used are available at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/10XGenomics_ChromiumGenome_LongRanger2.2_Supernova2.0.1_04122018/. The data used in this paper and other data sets for these genomes are available at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ and in the NCBI BioProject PRJNA200694.

The v0.6 SV benchmark set (only compare to variants in the Tier 1 vcf inside the Tier 1 bed with the FILTER ‘PASS’) for HG002 on GRCh37 is available in dbVar accession nstd175 and at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/.

Input SV callsets, assemblies and other analyses for this trio are available at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/.

Code availability

Scripts for integrating candidate structural variants to form the benchmark set in this paper are available in a GitHub repository at https://github.com/jzook/genome-data-integration/tree/master/StructuralVariants/NISTv0.6. This repository includes Jupyter notebooks for the comparisons to HGSVC, GRC, vg, paragraph and Bionano. Publicly available software used to generate input callsets is described in the Methods.

Change history

22 July 2020
An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).
Article CAS PubMed PubMed Central Google Scholar
Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 20, 159–163 (2018).
Article CAS PubMed Google Scholar
Mantere, T., Kersten, S. & Hoischen, A. Long-read sequencing emerging in medical genetics. Front. Genet. 10, 426 (2019).
Article CAS PubMed PubMed Central Google Scholar
Roses, A. D. et al. Structural variants can be more informative for disease diagnostics, prognostics and translation than current SNP mapping and exon sequencing. Expert Opin. Drug Metab. Toxicol. 12, 135–147 (2016).
Article CAS PubMed Google Scholar
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Article PubMed PubMed Central Google Scholar
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).
Article CAS PubMed Google Scholar
Spies, N. et al. Genome-wide reconstruction of complex structural variants using read clouds. Nat. Methods 14, 915–920 (2017).
Article CAS PubMed PubMed Central Google Scholar
Marks, P. et al. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res. 29, 635–645 (2019).
Article CAS PubMed PubMed Central Google Scholar
Karaoglanoglu, F. et al. VALOR2: characterization of large-scale structural variants using linked-reads. Genome Biol. 21, 72 (2020).
Article PubMed PubMed Central Google Scholar
Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cretu Stancu, M. et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat. Commun. 8, 1326 (2017).
Article PubMed PubMed Central Google Scholar
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2014).
Article PubMed PubMed Central Google Scholar
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article CAS PubMed PubMed Central Google Scholar
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. https://doi.org/10.1038/nbt.4277 (2018).
Kaiser, M. D. et al. Automated structural variant verification in human genomes using single-molecule electronic DNA mapping. Preprint at https://www.biorxiv.org/content/10.1101/140699v1.full (2017).
Lam, E. T. et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).
Article CAS PubMed Google Scholar
Barseghyan, H. et al. Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis. Genome Med. 9, 90 (2017).
Article PubMed PubMed Central Google Scholar
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS PubMed Google Scholar
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cleveland, M. H., Zook, J. M., Salit, M. & Vallone, P. M. Determining performance metrics for targeted next-generation sequencing panels using reference materials. J. Mol. Diagn. 20, 583–590 (2018).
Wenger, A. M. et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155-1162 (2019).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article CAS PubMed PubMed Central Google Scholar
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
Article CAS PubMed Google Scholar
Parikh, H. et al. svclassify: a method to establish benchmark structural variant calls. BMC Genomics 17, 64 (2016).
Article PubMed PubMed Central Google Scholar
Pang, A. W. et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 11, R52 (2010).
Article PubMed PubMed Central Google Scholar
Mu, J. C. et al. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci. Rep. 5, 14493 (2015).
Article CAS PubMed PubMed Central Google Scholar
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Article CAS PubMed PubMed Central Google Scholar
English, A. C. et al. Assessing structural variation in a personal genome-towards a human reference diploid genome. BMC Genomics 16, 286 (2015).
Article PubMed PubMed Central Google Scholar
Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wala, J. A. et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).
Article CAS PubMed PubMed Central Google Scholar
Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lee, A. Y. et al. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 19, 188 (2018).
Article PubMed PubMed Central Google Scholar
Xia, L. C. et al. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7, https://doi.org/10.1093/gigascience/giy081 (2018).
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Article CAS PubMed PubMed Central Google Scholar
Spies, N., Zook, J. M., Salit, M. & Sidow, A. svviz: a read viewer for validating structural variants. Bioinformatics 31, 3994–3996 (2015).
Song, J. H. T., Lowe, C. B. & Kingsley, D. M. Characterization of a human-specific tandem repeat associated with bipolar disorder and Schizophrenia. Am. J. Hum. Genet. 103, 421–430 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chapman, L. M. et al. SVCurator: a crowdsourcing app to visualize evidence of structural variants for the human genome. Preprint at https://www.biorxiv.org/content/10.1101/581264v1 (2019).
Collins, R. L. et al. An open resource of structural variation for medical and population genetics. Preprint at https://www.biorxiv.org/content/10.1101/578674v1 (2019).
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).
Article PubMed PubMed Central Google Scholar
Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).
Article PubMed PubMed Central Google Scholar
Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).
Article CAS PubMed PubMed Central Google Scholar
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Preprint at https://www.biorxiv.org/content/10.1101/735928v3 (2019).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank many GIAB Consortium Analysis Team members for helpful discussions about the design of this benchmark. We thank J. Monlong and G. Hickey for sharing genotypes for HG002 from vg and paragraph. We thank T. Hefferon at NIH/NCBI for assistance with the dbVar submission. Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.X. and S.S. were supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. N.F.H., J.C.M., S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. J.M.Z. and N.D.O. were supported by the National Institute of Standards and Technology and an interagency agreement with the Food and Drug Administration. C.E.M. acknowledges the XSEDE Supercomputing Resources, STARR I13-0052 and NIH R01AI151059.

Author information

Authors and Affiliations

Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
Justin M. Zook, Nathan D. Olson & Lesley Chapman
National Human Genome Research Institute, National Institutes of Health, Rockville, MD, USA
Nancy F. Hansen, James C. Mullikin, Sergey Koren & Adam M. Phillippy
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Chunlin Xiao & Stephen Sherry
Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
Paul C. Boutros
Roche Sequencing Solutions, Belmont, CA, USA
Sayed Mohammad E. Sahraeian
Ontario Institute for Cancer Research, Toronto, Ontario, Canada
Vincent Huang
Charles-Bruneau Cancer Centre, Division of Hematology-Oncology, CHU Sainte-Justine, Montreal, Quebec, Canada
Alexandre Rouette
Molecular Biology Institute, University of California, Los Angeles, Los Angeles CA, USA
Noah Alexander
Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA
Christopher E. Mason, Iman Hajirasouliha & Camir Ricketts
The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA
Christopher E. Mason
The WorldQuant Initiative for Quantitative Prediction, Weill Cornell Medicine, New York, NY, USA
Christopher E. Mason
The Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA
Christopher E. Mason
Bionano Genomics, Inc., San Diego, CA, USA
Joyce Lee
Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA, Australia
Rick Tearle
10× Genomics, Pleasanton, CA, USA
Ian T. Fiddes & Alvaro Martinez Barrio
Broad Institute of Harvard and MIT, Cambridge, MA, USA
Jeremiah Wala
Google, Mountain View, CA, USA
Andrew Carroll
Department of Computer Science, Roy G. Perry College of Engineering, Prairie View A&M University, Prairie View, TX, USA
Noushin Ghaffari
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Oscar L. Rodriguez & Ali Bashir
BC Cancer Genome Sciences Centre, Vancouver, British Columbia, Canada
Shaun Jackman
Biomedical Genetics, Department of Medicine, Boston University Medical School, Boston, MA, USA
John J. Farrell
Pacific Biosciences, Menlo Park, CA, USA
Aaron M. Wenger
Department of Computer Engineering, Bilkent University, Ankara, Turkey
Can Alkan
Department of Computer Engineering, Konya Food and Agriculture University, Konya, Turkey
Arda Soylev
Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
Michael C. Schatz
Department of Genetics, Harvard Medical School, Boston, MA, USA
Shilpa Garg & George Church
Heinrich Heine University, Medical Faculty, Düsseldorf, Germany
Tobias Marschall
Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, TX, USA
Ken Chen
Department of Computer Science, Rice University, Houston, TX, USA
Xian Fan
Bioinformatics R&D, Spiral Genetics, Seattle, WA, USA
Adam C. English
Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA
Jeffrey A. Rosenfeld
Department of Pathology, Robert Wood Johnson Medical School, New Brunswick, NJ, USA
Jeffrey A. Rosenfeld
Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
Weichen Zhou & Ryan E. Mills
Nabsys 2.0, LLC, Providence, RI, USA
Jay M. Sage, Jennifer R. Davis, Michael D. Kaiser, John S. Oliver & Anthony P. Catalano
Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
Mark J. P. Chaisson
Joint Initiative for Metrology in Biology, SLAC National Accelerator Lab, Stanford University, Stanford, CA, USA
Noah Spies & Marc Salit
Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
Fritz J. Sedlazeck

Authors

Justin M. Zook
View author publications
You can also search for this author in PubMed Google Scholar
Nancy F. Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Nathan D. Olson
View author publications
You can also search for this author in PubMed Google Scholar
Lesley Chapman
View author publications
You can also search for this author in PubMed Google Scholar
James C. Mullikin
View author publications
You can also search for this author in PubMed Google Scholar
Chunlin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Sherry
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Koren
View author publications
You can also search for this author in PubMed Google Scholar
Adam M. Phillippy
View author publications
You can also search for this author in PubMed Google Scholar
Paul C. Boutros
View author publications
You can also search for this author in PubMed Google Scholar
Sayed Mohammad E. Sahraeian
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Huang
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Rouette
View author publications
You can also search for this author in PubMed Google Scholar
Noah Alexander
View author publications
You can also search for this author in PubMed Google Scholar
Christopher E. Mason
View author publications
You can also search for this author in PubMed Google Scholar
Iman Hajirasouliha
View author publications
You can also search for this author in PubMed Google Scholar
Camir Ricketts
View author publications
You can also search for this author in PubMed Google Scholar
Joyce Lee
View author publications
You can also search for this author in PubMed Google Scholar
Rick Tearle
View author publications
You can also search for this author in PubMed Google Scholar
Ian T. Fiddes
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro Martinez Barrio
View author publications
You can also search for this author in PubMed Google Scholar
Jeremiah Wala
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Carroll
View author publications
You can also search for this author in PubMed Google Scholar
Noushin Ghaffari
View author publications
You can also search for this author in PubMed Google Scholar
Oscar L. Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Ali Bashir
View author publications
You can also search for this author in PubMed Google Scholar
Shaun Jackman
View author publications
You can also search for this author in PubMed Google Scholar
John J. Farrell
View author publications
You can also search for this author in PubMed Google Scholar
Aaron M. Wenger
View author publications
You can also search for this author in PubMed Google Scholar
Can Alkan
View author publications
You can also search for this author in PubMed Google Scholar
Arda Soylev
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Schatz
View author publications
You can also search for this author in PubMed Google Scholar
Shilpa Garg
View author publications
You can also search for this author in PubMed Google Scholar
George Church
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Marschall
View author publications
You can also search for this author in PubMed Google Scholar
Ken Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xian Fan
View author publications
You can also search for this author in PubMed Google Scholar
Adam C. English
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey A. Rosenfeld
View author publications
You can also search for this author in PubMed Google Scholar
Weichen Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ryan E. Mills
View author publications
You can also search for this author in PubMed Google Scholar
Jay M. Sage
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer R. Davis
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Kaiser
View author publications
You can also search for this author in PubMed Google Scholar
John S. Oliver
View author publications
You can also search for this author in PubMed Google Scholar
Anthony P. Catalano
View author publications
You can also search for this author in PubMed Google Scholar
Mark J. P. Chaisson
View author publications
You can also search for this author in PubMed Google Scholar
Noah Spies
View author publications
You can also search for this author in PubMed Google Scholar
Fritz J. Sedlazeck
View author publications
You can also search for this author in PubMed Google Scholar
Marc Salit
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.M.Z. contributed project design, manuscript writing, generating SV input callsets and integrating SV calls. N.D.O. contributed SV integration and figures. L.M.C. contributed benchmark evaluation. N.F.H. contributed SV callsets, benchmark evaluation, SV integration and manuscript editing. J.C.M. contributed SV callsets and SV integration. C.X. contributed data management, SV callsets, benchmark evaluation and manuscript editing. S.S. contributed data management and SV callsets. S.K. contributed de novo assembilies. A.M.P. contributed de novo assemblies. P.C.B. contributed manuscript writing, SV callsets and benchmark evaluation. S.M.E.S. contributed SV input callsets, benchmark evaluation and manuscript editing. V.H. contributed SV callsets and benchmark evaluation. A.R. contributed SV callsets and benchmark evaluation. N.A. contributed benchmark evaluation. C.E.M. contributed project design, manuscript editing and benchmark evaluation. I.H. contributed project design, manuscript editing and SV callsets. C.R. contributed SV callsets. J.L. contributed SV callsets and benchmark evaluation. R.T. contributed provision and interpretation of Complete Genomics data and formats. I.T.F. contributed SV callsets, benchmark evaluation and de novo assemblies. A.M.B. contributed SV callsets, benchmark evaluation and de novo assemblies. J.W. contributed SV callsets. A.C. contributed SV callsets and benchmark evaluation. N.G. contributed genome assembly of the Ashkenazi trio, DISCOVER de novo and manuscript editing. O.L.R. contributed SV callsets and de novo assemblies. A.B. contributed SV callsets and de novo assemblies. S.J. contributed de novo assembilies. J.J.F. contributed SV callsets. A.M.W. contributed SV callsets and benchmark evaluation. C.A. contributed SV callsets. A.S. contributed SV callsets. M.C.S. contributed project design and manuscript editing. S.G. contributed integrative phasing short variant calls. G.C. contributed integrative phasing short variant calls. T.M. contributed haplotype phasing. K.C. contributed SV callsets. X.F. contributed SV callsets. A.C.E. contributed SV callsets, benchmark evaluations and SV integration. J.A.R. contributed SV callsets and project design. W.Z. contributed SV callsets. R.E.M. contributed SV callsets. J.M.S. contributed data collection, SV callsets and benchmark evaluation. J.R.D. contributed data collection, SV callsets and benchmark evaluation. M.D.K. contributed SV callsets, benchmark evaluation and SV-Verify development. J.S.O. contributed SV callsets and benchmark evaluation. A.P.C. contributed data collection. N.S. contributed SV integration (svviz2 development). M.J.P.C. contributed SV callsets. F.J.S. contributed SV callsets, manuscript editing and SV integration. M.S. contributed project design and manuscript writing.

Corresponding author

Correspondence to Justin M. Zook.

Ethics declarations

Competing interests

A.M.W. is an employee and shareholder of Pacific Biosciences. A.M.B. and I.T.F. are employees and shareholders of 10× Genomics. G.M.C. is the founder and holds leadership positions of many companies described at http://arep.med.harvard.edu/gmc/tech.html. F.J.S. has received sponsored travel from Oxford Nanopore and Pacific Biosciences and received a 2018 sequencing grant from Pacific Biosciences. J.L. is an employee and shareholder of Bionano Genomics. A.C. is an employee of Google and is a former employee of DNAnexus. J.M.S., J.R.D., M.D.K., J.S.O. and A.P.C. are employees of Nabsys 2.0. A.C.E. is an employee and shareholder of Spiral Genetics. S.M.E.S. is an employee of Roche.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Number of long reads supporting the SV allele vs. the reference allele in the benchmark set.

Variants are colored by heterozygous (blue) and homozygous (dark orange) genotype, and are stratified into deletions and insertions, and into SVs overlapping and not overlapping tandem repeats longer than 100 bp in the reference.

Extended Data Fig. 2 Mendelian contingency table for sites with consensus genotypes from svviz in the son, father, and mother.

SVs in boxes highlighted in red violate the expected Mendelian inheritance pattern. Variants on chromosomes X and Y are excluded.

Extended Data Fig. 3 Comparison of false negative rates for the union of all long read-based SV discovery methods, the union of all short read-based discovery methods, and paired-end and mate-pair short read genotyping of known SVs.

Variants are stratified into deletions (top) and insertions (bottom), and into SVs overlapping (right) and not overlapping (left) tandem repeats longer than 100 bp in the reference. SVs are also stratified by size into 50 bp to 99 bp, 100 bp to 299 bp, 300 bp to 999 bp, and ≥1000 bp.

Extended Data Fig. 4 Known limitations of the v0.6 benchmark.

It is important to understand the limitations of any benchmark, such as the limitations below for v0.6, when interpreting the resulting performance metrics.

Supplementary information

Supplementary Information

Supplementary Notes 1–4.

Reporting Summary

Supplementary Table 1

Variant callsets used to develop the benchmark (‘discovery’) and to evaluate the benchmark’s reliability in identifying false positives and false negatives (‘evaluation’).

Supplementary Table 2

Detailed results from manual curation of putative false positives and false negatives from evaluation of benchmark set and of deletions not in v0.6 that were in the population-based gnomAD-SV v2.1 callset that were homozygous reference in less than 5% of individuals of European ancestry, and at least 1,000 Europeans had the variant.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zook, J.M., Hansen, N.F., Olson, N.D. et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 38, 1347–1355 (2020). https://doi.org/10.1038/s41587-020-0538-8

Download citation

Received: 16 July 2019
Accepted: 28 April 2020
Published: 15 June 2020
Issue Date: November 2020
DOI: https://doi.org/10.1038/s41587-020-0538-8

This article is cited by

De novo diploid genome assembly using long noisy reads
- Fan Nie
- Peng Ni
- Jianxin Wang
Nature Communications (2024)
Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data
- Yichen Henry Liu
- Can Luo
- Xin Maizie Zhou
Nature Communications (2024)
A sequence-aware merger of genomic structural variations at population scale
- Zeyu Zheng
- Mingjia Zhu
- Yongzhi Yang
Nature Communications (2024)
Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data
- Shunichi Kosugi
- Chikashi Terao
Human Genome Variation (2024)
Benchmarking long-read aligners and SV callers for structural variation detection in Oxford nanopore sequencing data
- Asmaa A. Helal
- Bishoy T. Saad
- Khaled M. Aboshanab
Scientific Reports (2024)