Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

SVision: a deep learning approach to resolve complex structural variants

Abstract

Complex structural variants (CSVs) encompass multiple breakpoints and are often missed or misinterpreted. We developed SVision, a deep-learning-based multi-object-recognition framework, to automatically detect and characterize CSVs from long-read sequencing data. SVision outperforms current callers at identifying the internal structure of complex events and has revealed 80 high-quality CSVs with 25 distinct structures from an individual genome. SVision directly detects CSVs without matching known structures, allowing sensitive detection of both common and previously uncharacterized complex rearrangements.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Workflow and evaluation of SVision’s detection of CSVs.
Fig. 2: Application of SVision on HG00733 HiFi data.

Similar content being viewed by others

Data availability

HG002 ONT and HiFi data were downloaded from ftp://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/ and https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/, respectively. The NA12878 HiFi data was downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v1.0/assemblies/20200628_HHU_assembly-results_CCS_v12/haploid_reads.

The HG00733 HiFi and ONT data were downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/working/20190925_PUR_PacBio_HiFi/ and http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20181210_ONT_rebasecalled/, respectively. The HG00733 assembly was download from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/working/20200417_Marschall-Eichler_NBT_hap-assm/.

The human reference genome hg19 was downloaded from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz. The human reference genome GRCh38 was downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/.

The HG00733 PAV callset was downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/working/20210806_PAV_VCF/. The merged PAV callset of 35 samples was downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v2.0/integrated_callset/.

The RNA-seq data was downloaded from Sequence Read Archive of project ID PRJNA720779.

All results generated by this study are available in Supplementary Note from the article.

Code availability

The SVision program (v1.3.6) and trained model are provided at GitHub (https://github.com/xjtu-omics/SVision), which is available under GNU General Public License v3.0. SVision is free for non-commercial use by academic, government and non-profit/not-for-profit institutions. Please contact the corresponding author for more information about commercial usage. A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.8937098.v1).

References

  1. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Collins, R. L. et al. Defining the diverse spectrum of inversions, complex structural variation, and chromothripsis in the morbid human genome. Genome Biol. 18, 36 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Fujimoto, A. et al. Whole-genome sequencing with long reads reveals complex structure and origin of structural variation in human genetic variations and somatic mutations in cancer. Genome Med. 13, 65 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Baca, S. C. et al. Punctuated evolution of prostate cancer genomes. Cell 153, 666–677 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Quinlan, A. R. & Hall, I. M. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 28, 43–53 (2012).

    Article  CAS  PubMed  Google Scholar 

  7. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Aganezov, S. et al. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Res. 30, 1258–1273 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171–189 (2020).

    Article  CAS  PubMed  Google Scholar 

  12. Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Heller, D. & Vingron, M. SVIM: structural variant identification using mapped long reads. Bioinformatics 35, 2907–2915 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Guennewig, B. et al. Defining early changes in Alzheimer’s disease from RNA sequencing of brain regions differentially affected by pathology. Sci. Rep. 11, 4865 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675(2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  21. Cai, L., Wu, Y. & Gao, J. DeepSV: accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network. BMC Bioinf. 20, 665 (2019).

    Article  Google Scholar 

  22. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Bolognini, D. et al. VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing. Bioinformatics 36, 1267–1269 (2020).

    Article  CAS  PubMed  Google Scholar 

  24. Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Krumsiek, J., Arnold, R. & Rattei, T. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 23, 1026–1028 (2007).

    Article  CAS  PubMed  Google Scholar 

  26. Krizhevsky, A., Sutskever, I. & Hinton, G.E. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 1097–1105 (2012).

    Google Scholar 

  27. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank X. Zhao, P. Balachandran, A. Wenger, and other members of the Human Genome Structural Variation Consortium for helpful discussions on methods development and structural variants analysis. K. Y. and X. Y. are supported by National Science Foundation of China (32125009, 32070663 and 62172325), the Key Construction Program of the National ‘985’ Project, the World-Class Universities (Disciplines), the Fundamental Research Funds for the Central Universities, and the Characteristic Development Guidance Funds for the Central Universities. C. R. B., P. A. A., and J. I. F. are supported by the National Institutes of Health R35GM133600 through the NIGMS and pilot funding from the Jackson Laboratory Cancer Center (P30 CA034196). D. M. is supported by the National Science Foundation of China (61721002) and the Macao Science and Technology Development Fund under Grant (061/2020/A2).

Author information

Authors and Affiliations

Authors

Contributions

K. Y. designed and supervised research; J. L. and S. W. developed the algorithm and software; D. M. contributed to the assessment and analysis of the deep-learning model; W. K., T. M. and P. A. provided constructive suggestions for the algorithm; J. L. performed the algorithm benchmarking on real data and CSV analysis; S. W. performed algorithm benchmarking on the simulated data. P. A. A., J. I. F., and C. R. B. contributed to the analysis and experimental validation of complex structural variants; P. J. and X. Y. contributed to the sequencing data processing; J. L., W. K., P. A. A., C. R. B., and K. Y. wrote the paper with input from all other authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kai Ye.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Ryan Layer and the other, anonymous, reviewer for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Diagram of example simple and complex structural variants and their aberrant alignment patterns.

a, The diagram and alignment pattern of a simple deletion. b, The diagram and alignment pattern of a deletion associated with inversion, where the inverted segment occurred at the 3’ flank region of the deletion.

Extended Data Fig. 2 Performance evaluation of callers with HG002 truthset at different coverages and platforms.

a, F-score of callers on different platforms evaluated with Truvari. The boxplot for HiFi data was the F-score measured for each caller at 5X, 10X and 28X coverage, respectively. Each box contains three values, that is, SVision (0.83, 0.89 and 0.90), SVIM (0.83, 0.89 and 0.89), pbsv (0.65, 0.79 and 0.82), CuteSV (0.83, 0.89 and 0.89) and Sniffles (0.72, 0.79 and 0.85). The boxplot for ONT data was the F-score measured for each caller at 5X, 10X and 47X coverage, respectively. Each box also contains three values (n = 3), that is, SVision (0.76, 0.84 and 0.92), SVIM (0.74, 0.82 and 0.89), pbsv (0.67, 0.78 and 0.84), CuteSV (0.77, 0.85 and 0.91) and Sniffles (0.74, 0.82 and 0.90). The boxplot defines the median (Q2, 50th percentile), first quartile (Q1, 25th percentile) and third quartile (Q3, 75th percentile). The bounds of box, that is interquartile range (IQR), of the boxplot is between Q1 and Q3. The minima and maxima values are defined as Q1-1.5*IQR and Q3 + 1.5*IQR, respectively. The whiskers are values between minima and Q1 as well as between Q3 and maxima. b, The precision (x-axis), recall (y-axis) and F-score (F, dotted line) measurements of detecting SVs from HiFi data at different coverages. c, The precision and recall measurements of detecting SVs from ONT data at different coverages. It should be noted that this evaluation ignored SV genotype, but only evaluated on event level.

Extended Data Fig. 3 Simulated complex structural variant types and performance of detecting complex structural variant subcomponents.

a, The diagrams of simulated complex structural variants (CSV). Each type has a unique ID and a type definition. b, The size distribution of simulated CSVs smaller than 1Kbp (1,200 events). c, The size distribution of simulated CSVs larger than 1Kbp (1,800 events). d, The region-match recall rates of model-based callers for detecting subcomponents (that is, DUP-duplication, DEL-deletion, INV-inversion) of CSVs.

Extended Data Fig. 4 The diagrams and alignment patterns of two unclassified complex structural variants.

a, SVision correctly detected a deleted sequence replaced with dispersed duplication and inverted duplication. b, SVision characterized a complex insertion, consisting of two dispersed duplications and one inverted duplication. Both types of (a) and (b) are labeled as unclassified (NA) in the 1KGP call set. The top panel of (a) and (b) are the discordant alignments derived from short-read sequencing (that is, one end unmapped and discordant alignment). The bottom panels of (a) and (b) describe the abnormal alignments from long-read alignment.

Extended Data Fig. 5 One example of simple deletion misinterpreted as complex event by short-read data due to local repeats.

a, Two Dotplots are created with Gepard to illustrate the local repeats at the variant locus on the reference genome (left) and the breakpoints comparing HiFi read (READ, y-axis) and the reference genome (REF, x-axis). b, The IGV view at this locus with reads grouped by pair orientation and colored by insert-size and pair orientation.

Extended Data Fig. 6 Examples of reported complex structural variant types identified by SVision.

a, One of the 12 inverted duplication events detected by SVision and classified as CSV graph structure ‘12’, b, One of the eight deletion associated with inversion events detected by SVison and classified as CSV graph structure ‘15’. c, One of the five multiple-deletion with spacer events detected by SVision and classified as CSV graph structure ‘27’. d, One of ten deletion with inverted duplication events detected by SVision and classified as CSV graph structure ‘23’. e, One of the five deletion with duplication events detected by SVision and classified as CSV graph structure ‘28’. From figure (a) to (e), the Dotplots on the left column are SVision one-variant images created with variant feature sequence (VAR, y-axis) and reference sequence (REF, x-axis) at the variant loci, while the Dotplots on right column are created with variant spanning HiFi assemblies (CONTIG, y-axis) and the reference sequence (REF, x-axis) at the variant loci.

Extended Data Fig. 7 The HiFi assembly reconstruction of the expanded allele and complex structural variant allele affecting CNTN5.

The grey region indicates the repeat expansion. The dark red region indicates exon 4 of CNTN5, while the light red region is the 5’ flanking region of the exon.

Extended Data Fig. 8 The IGV screenshot of duplicated CNTN5 exon signature observed in RNA-Seq data.

The RNA-Seq data of the primary visual cortex from an Alzheimer disease female. b, The RNA-Seq data of a control male precuneus. In (a) and (b), the green bars pointed by red arrows are duplication like read-pair signatures, that is, there are 4 supporting discordant read-pairs in (a), and 2 in (b). Moreover, read depth change (fitted by purple line) on exon is observed in both (a) and (b). The RNA-Seq data for (a) and (b) are obtained from Sequence Read Archive (SRA) with accession number SRR14194220 and SRR14194206, respectively.

Extended Data Fig. 9 The ancestral state of one genome segment revealed by a complex structural variant.

a, The structure and breakpoint junction sequence of the variant derived from HiFi assembly. b, Blastn results of the inserted sequence mapping to primate genomes, and the top hits include pan troglodytes and gorilla.

Extended Data Fig. 10 Examples of graph and symmetric graphs as well as two special complex events identified by SVision.

a, An example of a complex structural variant (CSV) graph where its graph path is interpreted as S1 + S3-S3-S4+. b, Examples of isomorphic graphs representing two different CSV events. c, SVision detected CSV classified as local target site duplication. d, SVision detected CSV classified as tandem duplication. Though events of structure depicted by (c) and (d) were computed as complex events, they were considered as simple events from the biological perspective.

Supplementary information

Supplementary Information

Supplementary Notes and Supplementary Files 1–5

Reporting Summary

Peer Review File

Supplementary Table

Supplementary Tables 1–13

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, J., Wang, S., Audano, P.A. et al. SVision: a deep learning approach to resolve complex structural variants. Nat Methods 19, 1230–1233 (2022). https://doi.org/10.1038/s41592-022-01609-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01609-w

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research