New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution and comprehensiveness. To help translate these methods to routine research and clinical practice, we developed a sequence-resolved benchmark set for identification of both false-negative and false-positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12,745 isolated, sequence-resolved insertion (7,281) and deletion (5,464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5,262 insertions and 4,095 deletions supported by ≥1 diploid assembly. We demonstrate that the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked- and long-read sequencing and optical mapping.
Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Raw sequence data were previously published in Scientific Data (https://doi.org/10.1038/sdata.2016.25) and deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive with the accession codes SRX847862 to SRX848317, SRX1388732 to SRX1388743, SRX852933, SRX5527202, SRX5327410 and SRX1033793 to SRX1033798. 10× Genomics Chromium bam files used are available at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/10XGenomics_ChromiumGenome_LongRanger2.2_Supernova2.0.1_04122018/. The data used in this paper and other data sets for these genomes are available at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ and in the NCBI BioProject PRJNA200694.
The v0.6 SV benchmark set (only compare to variants in the Tier 1 vcf inside the Tier 1 bed with the FILTER ‘PASS’) for HG002 on GRCh37 is available in dbVar accession nstd175 and at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/.
Input SV callsets, assemblies and other analyses for this trio are available at ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/.
Scripts for integrating candidate structural variants to form the benchmark set in this paper are available in a GitHub repository at https://github.com/jzook/genome-data-integration/tree/master/StructuralVariants/NISTv0.6. This repository includes Jupyter notebooks for the comparisons to HGSVC, GRC, vg, paragraph and Bionano. Publicly available software used to generate input callsets is described in the Methods.
Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).
Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 20, 159–163 (2018).
Mantere, T., Kersten, S. & Hoischen, A. Long-read sequencing emerging in medical genetics. Front. Genet. 10, 426 (2019).
Roses, A. D. et al. Structural variants can be more informative for disease diagnostics, prognostics and translation than current SNP mapping and exon sequencing. Expert Opin. Drug Metab. Toxicol. 12, 135–147 (2016).
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).
Spies, N. et al. Genome-wide reconstruction of complex structural variants using read clouds. Nat. Methods 14, 915–920 (2017).
Marks, P. et al. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res. 29, 635–645 (2019).
Karaoglanoglu, F. et al. VALOR2: characterization of large-scale structural variants using linked-reads. Genome Biol. 21, 72 (2020).
Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Cretu Stancu, M. et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat. Commun. 8, 1326 (2017).
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2014).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. https://doi.org/10.1038/nbt.4277 (2018).
Kaiser, M. D. et al. Automated structural variant verification in human genomes using single-molecule electronic DNA mapping. Preprint at https://www.biorxiv.org/content/10.1101/140699v1.full (2017).
Lam, E. T. et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).
Barseghyan, H. et al. Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis. Genome Med. 9, 90 (2017).
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Cleveland, M. H., Zook, J. M., Salit, M. & Vallone, P. M. Determining performance metrics for targeted next-generation sequencing panels using reference materials. J. Mol. Diagn. 20, 583–590 (2018).
Wenger, A. M. et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155-1162 (2019).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
Parikh, H. et al. svclassify: a method to establish benchmark structural variant calls. BMC Genomics 17, 64 (2016).
Pang, A. W. et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 11, R52 (2010).
Mu, J. C. et al. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci. Rep. 5, 14493 (2015).
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
English, A. C. et al. Assessing structural variation in a personal genome-towards a human reference diploid genome. BMC Genomics 16, 286 (2015).
Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).
Wala, J. A. et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).
Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).
Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).
Lee, A. Y. et al. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 19, 188 (2018).
Xia, L. C. et al. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7, https://doi.org/10.1093/gigascience/giy081 (2018).
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Spies, N., Zook, J. M., Salit, M. & Sidow, A. svviz: a read viewer for validating structural variants. Bioinformatics 31, 3994–3996 (2015).
Song, J. H. T., Lowe, C. B. & Kingsley, D. M. Characterization of a human-specific tandem repeat associated with bipolar disorder and Schizophrenia. Am. J. Hum. Genet. 103, 421–430 (2018).
Chapman, L. M. et al. SVCurator: a crowdsourcing app to visualize evidence of structural variants for the human genome. Preprint at https://www.biorxiv.org/content/10.1101/581264v1 (2019).
Collins, R. L. et al. An open resource of structural variation for medical and population genetics. Preprint at https://www.biorxiv.org/content/10.1101/578674v1 (2019).
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).
Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).
Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Preprint at https://www.biorxiv.org/content/10.1101/735928v3 (2019).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
We thank many GIAB Consortium Analysis Team members for helpful discussions about the design of this benchmark. We thank J. Monlong and G. Hickey for sharing genotypes for HG002 from vg and paragraph. We thank T. Hefferon at NIH/NCBI for assistance with the dbVar submission. Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.X. and S.S. were supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. N.F.H., J.C.M., S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. J.M.Z. and N.D.O. were supported by the National Institute of Standards and Technology and an interagency agreement with the Food and Drug Administration. C.E.M. acknowledges the XSEDE Supercomputing Resources, STARR I13-0052 and NIH R01AI151059.
A.M.W. is an employee and shareholder of Pacific Biosciences. A.M.B. and I.T.F. are employees and shareholders of 10× Genomics. G.M.C. is the founder and holds leadership positions of many companies described at http://arep.med.harvard.edu/gmc/tech.html. F.J.S. has received sponsored travel from Oxford Nanopore and Pacific Biosciences and received a 2018 sequencing grant from Pacific Biosciences. J.L. is an employee and shareholder of Bionano Genomics. A.C. is an employee of Google and is a former employee of DNAnexus. J.M.S., J.R.D., M.D.K., J.S.O. and A.P.C. are employees of Nabsys 2.0. A.C.E. is an employee and shareholder of Spiral Genetics. S.M.E.S. is an employee of Roche.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Number of long reads supporting the SV allele vs. the reference allele in the benchmark set.
Variants are colored by heterozygous (blue) and homozygous (dark orange) genotype, and are stratified into deletions and insertions, and into SVs overlapping and not overlapping tandem repeats longer than 100 bp in the reference.
Extended Data Fig. 2 Mendelian contingency table for sites with consensus genotypes from svviz in the son, father, and mother.
SVs in boxes highlighted in red violate the expected Mendelian inheritance pattern. Variants on chromosomes X and Y are excluded.
Extended Data Fig. 3 Comparison of false negative rates for the union of all long read-based SV discovery methods, the union of all short read-based discovery methods, and paired-end and mate-pair short read genotyping of known SVs.
Variants are stratified into deletions (top) and insertions (bottom), and into SVs overlapping (right) and not overlapping (left) tandem repeats longer than 100 bp in the reference. SVs are also stratified by size into 50 bp to 99 bp, 100 bp to 299 bp, 300 bp to 999 bp, and ≥1000 bp.
It is important to understand the limitations of any benchmark, such as the limitations below for v0.6, when interpreting the resulting performance metrics.
Supplementary Notes 1–4.
Variant callsets used to develop the benchmark (‘discovery’) and to evaluate the benchmark’s reliability in identifying false positives and false negatives (‘evaluation’).
Detailed results from manual curation of putative false positives and false negatives from evaluation of benchmark set and of deletions not in v0.6 that were in the population-based gnomAD-SV v2.1 callset that were homozygous reference in less than 5% of individuals of European ancestry, and at least 1,000 Europeans had the variant.
About this article
Cite this article
Zook, J.M., Hansen, N.F., Olson, N.D. et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0538-8