The importance of structural variant detection in cancer

Genomic aberrations acquired in cancer genomes encompass a broad spectrum of types and sizes. These range from single nucleotide variants (SNVs) to larger structural variants (SVs) that impact genome organization (Fig. 1, Table 1)1,2. SVs are a major contributor to genomic variation, they affect more base pairs in the genome than SNVs3 and can have serious phenotypic impact4,5. Some SVs are known to drive carcinogenesis and SVs resulting in gene fusions were the first recurrent mutations observed in many pediatric cancers6,7. With at least 30% of cancer genomes affected by a pathogenic SV, detection of SVs is essential for both diagnosis and treatment stratification6,7,8,9,10,11. In addition, discovering new oncogenic SV driver events is beneficial for understanding cancer etiology. However, research into the role of SVs in cancer has been limited due to difficulties in their detection which has partially resulted from co-opting sequencing technologies designed for SNV detection.

Fig. 1: Major SV types and their characteristic read-alignment patterns.
figure 1

Alignment of paired-end sequencing reads to a reference genome is used to infer sites of discontinuity or breakpoints. Structural variants (SVs) are generally defined as larger than 50 base pairs and further classified in five major SV types: deletions, insertions of non-reference sequence or mobile elements, duplications, inversions and translocations. Clusters of breakpoints in a genomic region which cannot be classified are considered “complex SVs” and likely result from either progressive rearrangements or a major genomic disturbance. SVs (red blocks) are characterized by patterns in breakpoints and reads aligned to flanking reference sequences (blue blocks). The reads directly below the sample DNA strand represent the distance and orientation at which they are generated during sequencing. If the reads align differently than expected to the reference strand this is indicative of an SV. Changes in read depth (RD) or coverage indicate mostly larger duplications or deletions and are useful for detecting copy number variants (CNVs). Discordant pairs (DP) align to the reference at a different relative distance or orientation than expected. DPs are best suited for detecting large SVs such as inter-chromosomal translocations or inversions. Split reads (SR) span breakpoints and can only be partially aligned. SR can detect small variants with base-pair resolution, especially those smaller than the size of the read.

Table 1 Glossary of key terms.

Advances in sequencing technologies have increased the number of SVs identified per genome from ~2, 1–2, 5k in the 1000 genomes project to more than 27k in recent multi-platform sequencing efforts3,4,12. Specifically for the cancer genomics community, recent contributions of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium have provided an extensive resource of paired tumor-normal genomes13. The insights obtained from multi-platform analyses also highlight current SV blindspots in cancer variant databases like COSMIC. Despite technological innovations, confident SV detection in cancer genomes remains challenging due to biological factors including contamination from healthy tissue, intratumor heterogeneity and polyploidy. Identification of variants acquired in tumor cells requires discerning tumor-specific somatic SVs (TSSVs) from variants in the germline and mosaic variants present in unaffected cells14. This is often done by differential analysis between paired tumor-normal samples15. The classification of SVs as tumor-specific or normal is confounded by inconsistencies in detected breakpoints and derived variant types, as well as the biological complexity of some rearrangements.

Confident SV detection and subsequent classification of variants as either germline, tumor-specific or mosaic variation in healthy tissue is not only important for diagnostics and cancer etiology but also for research into cancer predisposition and genetic interactions. In addition, the genetic context of somatic variants and interplay with germline variants may influence their tumorigenic potential16. Here, we focus on the detection of TSSVs from paired tumor-normal WGS data. First, we explore current approaches for SV detection and their integration, whilst accounting for challenges specific to cancer samples. Second, we address different approaches aimed at distinguishing TSSVs from normal SVs. Third, we highlight the impact that long-read sequencing can have on somatic SV detection. Last, we explore how orthogonal sequencing technologies can be combined to improve TSSV detection.

Detection of somatic SVs in short-read WGS data

SVs can be detected using short-read sequencing data based on patterns in aligned reads (Fig. 1). These reads are sequenced as paired ends of 150–250 bp long. Changes in read-depth (RD) are used to derive copy-number variants (CNVs). Discordant read-pairs (DP) that align with an abnormal distance and/or orientation to the reference genome are suited for detecting large SVs. Split or soft-clipped reads (SR) are partially mapped reads and can indicate breakpoints with base-pair resolution17. Both the alignment method and reference genome used, influence the performance of SV detection algorithms17,18. BWA-MEM is predominantly used for alignment prior to SV detection, as it provides secondary alignments to reads mapping to multiple locations rather than placing the reads randomly19,20. However, alignment uncertainty is inherent to short-read sequencing data. In parallel, the reference genome continues to evolve, resulting in improved alignments and fewer false-positive variants in studies which adopted GRCh38 (hg38) compared to GRCh37 (hg19)8,21,22,23.

Combinatorial algorithms integrate multiple read-alignment patterns

The latest generation of SV detection algorithms that combine multiple read-alignment patterns can detect SVs across a broad range of types and sizes. At present, many different strategies and methods exist (Table 2). How these combinatorial algorithms integrate read-alignment patterns influences their ability to detect specific variant classes (Fig. 2A)24,25. As a result, no single algorithm performs best across the full spectrum of SVs, implying that integration of multiple algorithms is beneficial25. Although most studies comparing SV algorithms focus on germline SVs, these findings were recently also confirmed for somatic SV detection26. The methodology used by DELLY, LUMPY, Manta, SvABA, and GRIDSS for detecting SVs (Box 1) achieves high performance in detecting both germline and somatic SVs25,26.

Table 2 SV detection algorithms.
Fig. 2: Data integration to improve tumor-specific SV detection.
figure 2

a Alignment of sequencing data against a reference is used to infer SVs by detecting aberrant patterns of read-alignment: discordant pairs (DP), split reads (SR), read depth (RD) and (local) assembly (top, see also Fig. 1). Algorithms that combine multiple read-alignment patterns can resolve more SVs (middle). Likewise, read-level integration of technologies can aid SV detection, i.e., combining short and long reads (bottom). b Comparison of SV callsets requires merging variants from the same genomic rearrangement based on e.g., reciprocal overlap or breakpoint distance (top). These merging approaches can yield different outcomes as shown by how only a small segment of the deletion overlaps between tools and not all breakpoints could be matched. Intersection of callsets identifies the SVs with support from multiple algorithms or technologies. Alternatively, sensitivity can be increased by taking the union of callsets or their pairwise intersections (bottom). c Identification of tumor-specific SVs (red) requires tumor-normal differential analysis of reads or events. A tumor sample (purple) is expected to contain tumor-specific variants (red, bottom stand), as well as germline variants (blue, top strand). Tumor/normal reads can be distinguished prior to SV inference or afterwards by comparison of the variants or breakpoints as in b. If multiple SV tools are used, differential analysis can be done after merging tumor and normal callsets (bottom left) or first by using each algorithm’s somatic filtering feature (bottom right).

SV-level integration of multiple algorithms improves precision

Since the optimal detection algorithm differs between SV type and size range, full-spectrum SV detection with high recall and precision currently requires multiple algorithms25,27. The optimal method to combine the resulting callsets remains a largely unanswered question and a variety of tools and in-house pipelines are currently used4,13,25,28. To compare and combine SV callsets, variants from the same genomic rearrangement need to be merged first, this is complicated by diversity in breakpoint resolution and SV typing (Fig. 2B). The recent review by Ho et al. addresses different “ensemble” integration approaches currently in use in germline SV research4. In general, simple integration strategies use (reciprocal) overlap or breakpoint distance to merge SVs whilst more complex solutions combine this with read-evidence integration, local assembly or machine learning29,30,31,32.

After overlapping variants are merged, integration of SV callsets from multiple algorithms can either be performed by taking the union or intersection (Fig. 2B). Since achieving high precision takes priority in most cancer research and clinical applications, an intersection strategy is often preferred but reduces recall. The precision/recall trade-off can be optimized by carefully selecting which tools to intersect25 and by taking the union of pairwise intersections26.

Distinguishing somatic from germline SVs

TSSV detection aims to identify variants that uniquely occur in a patient’s tumor cells. Typically paired tumor-normal samples are used to classify SVs as either germline, mosaic-normal or tumor-specific variants15. Detection of TSSVs is a two-step process that involves the detection of SVs in both samples, followed by differential analysis of the callsets (Fig. 2C). Also, cancer genomes can have highly complex rearrangements. Alternatively, if patient-derived healthy material is not available, SVs can be filtered using a panel-of-normals. A sufficiently large panel-of-normals can provide more statistical power for filtering recurrent germline variants, but is less effective than a patient-derived normal sample when filtering rare or private germline variants4. Also, strictly filtering out regions with germline CNVs excludes potentially interesting genomic regions from SV analysis, which are susceptible to rearrangements because of their architecture33.

Tools for somatic SV detection in WGS data

Somatic SV detection algorithms differ in their approach to identify TSSVs from paired tumor-normal samples and as a result can classify the same event differently26. Despite their differences, DELLY, LUMPY, SvABA, Manta, and GRIDSS have successfully been used to report somatic SVs in various studies34,35,36,37. DELLY and LUMPY use ad hoc filtering whereby SVs supported by at least one read from the normal sample are removed from the tumor SV callset34,35, which is highly sensitive contamination. In contrast, Manta uses a probabilistic scoring system for somatic SVs integrating evidence from tumor and normal reads36. SvABA uses both the tumor and normal data during assembly before distinguishing somatic variants38. GRIDSS has yet another approach and applies extensive rule-based filtering to both single break-ends and breakpoints37,39.

Specialized somatic SV detection tools such as Lancet and Varlociraptor account for challenges specific to the identification of TSSVs (Box 2)31,40. The first challenge in comparing tumor and normal SV callsets are differences in SV breakpoints and types, analogous to the issues with overlapping SV callsets of different algorithms25. Second, somatic SVs are often complex which can be problematic for algorithms that are not equipped to resolve these complex SV signatures and instead infer (false-positive) small indels41. As an alternative to ad-hoc filtering of SV callsets, Varlociraptor and Lancet, respectively, compare breakpoints and aberrant reads between tumor-normal samples at an earlier stage of the analysis (Fig. 2C). Specifically, Varlociraptor compares the statistical support for an altered reference with simulated variant versus an unadjusted reference (Box 2)31. Using read-level or breakpoint-level comparison can account for the subsequent mutations at germline variant locations, as these mutations may convolute somatic-germline comparisons. Third, issues inherent to analyzing tumor samples such as contamination, polyploidy, and heterogeneity are accounted for by Varlociraptor and Lancet (Box 2).

Challenges for accurate SV detection in cancer genomes

The analysis of tumor-normal paired samples is confounded by challenges inherent to cancer samples, including polyploidy, heterogeneity and contamination17. First, potential aneuploidy of tumor cells complicates haplotype reconstruction and phasing reads12,42. Second, intratumor heterogeneity can result in multiple subclonal variants which have low allele frequency (AF) and few supporting reads, making them difficult to detect. Third, contamination of the tumor sample with healthy material and vice versa complicates differential analysis between paired samples due to mislabelled reads. This can result in algorithms falsely discarding somatic variants with one or more supporting reads from the control sample. Adjusting the filtering threshold based on an estimated contamination fraction is a balance between precision and sensitivity for detecting low-AF variants.

The detection of rare TSSVs is limited by sequencing depth and AF. In practice, a minimum of 20% AF is required for reliable variant detection from tumor-normal pairs26,31. Increasing sequencing depth to 75x-90x for tumor samples improves the sensitivity of detection, especially for variants below 20% AF, whilst maintaining precision26. In addition, interpretation of TSSV allele frequencies is not straightforward since they can reflect intratumor heterogeneity and/or multiple alleles within a polyploid tumor genome. Note that the SV type should be considered during AF interpretation43. For diploid normal cells, variants are expected to have an AF of 0%, 50%, 100%, or 33% in case of a heterozygous duplication. However, mosaic-normal variants can occur at varying AF and be difficult to distinguish from TSSVs14. Computational modeling with AF can provide insight into intratumor heterogeneity and clonal architecture, both of which are important for therapeutic resistance and relapse44. The majority of SV tools operate under a diploid genome assumption. A multitude of tools independently quantify purity and ploidy of tumor samples however benchmarking studies show little consensus39,45. These tools can rely solely on CNV deletion events to model the cell purity and ploidy, and/or incorporate heterozygous known SNPs into their probabilistic models. At present, only SVclone uses SVs to estimate intra-tumor heterogeneity due to the complexities of calculating variant AF for SVs43.

Computational challenges of complex variant detection

Genomic instability in cancer genomes results in more breakpoints and more complex SVs compared to germline variation46. Complex SVs are characterized by signatures of many breakpoints clustering together and are hypothesized to be caused by a single catastrophic process followed by repair or progressive rearrangements47. The presence of breakpoint clusters complicates the inference of the underlying genomic rearrangements and therefore also the identification of tumor-specific events. Alternatively, when breakpoint clusters confound confident SV calling, breakpoint-level differential analysis can be used to identify tumor-specific events. In addition, unsupervised clustering can discern complex from simple SVs and help to study both events more accurately41.

Technical limitations of short-read WGS influence SV detection

The detection of SVs is also influenced by technical limitations of the sequencing platform; most notably genome coverage bias and alignment uncertainty. Illumina (IL) is currently the most commonly used short-read sequencing platform since it’s relatively affordable, fast and has a high nucleotide accuracy (>99%)48. However, IL sequencing has inherent biases in genome coverage with regions that have a high, or low GC content (<10% and >85% GC) or long homopolymers49. Although PCR-free library preparation does reduce GC biases it does require a large amount of input DNA (Table 3)49.

Table 3 Comparison of long-read and short-read sequencing technologies.

The detection of SVs relies on identifying aberrant read alignment patterns (Fig. 1). Reads derived from highly homologous regions, such as pseudogenes and segmental duplications, are often not long enough to uniquely map to the reference genome50. Yet repeat-rich regions comprise about half of the human genome and are vulnerable to SVs due to homologous recombination errors and replication slippage33,51. Depending on the alignment algorithm, uncertainty usually results in either random placement of reads or multi-mapping to all possible locations52. Multi-mapping, for example as done by BWA-MEM, causes unequal genome coverage altering the signal-to-noise ratio52. Hence, alignment uncertainty is problematic for accurate SV detection and should be addressed with a sound statistical model30,31,52. Current estimates suggest ~55 Mb of GRCh38 are “dark regions” inaccessible to IL sequencing due to alignment ambiguity (i.e., repeat-rich regions) or the sequencing chemistry (i.e., GC content)53. The over 4000 affected gene bodies53 also include disease-related genes, such as the TERT promoter which was found to be mutated in 9% of tumors in the PCAWG study but mutations can be missed due to its high GC content13.

Impact of long-read sequencing

Single-molecule long-read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are valuable for SV detection54. PacBio and ONT generate reads of ~10+ kb versus ~250 bp from IL; the longer reads reduce alignment ambiguity and do not have a GC bias, resulting in improved coverage of “dark” regions in the genome55. In addition, long reads allow for haplotype phasing of variants and de novo assembly of complex rearrangements56. For example, sequencing lung cancer cell lines with PromethION detected both known cancer-driver SNVs and revealed large previously unknown genomic rearrangements, including an 8 Mb amplification of MYC57. Similarly, direct comparison of a PacBio assembly with IL sequencing shows ~2.5× more uniquely identified SVs (~48k and ~20k, respectively), in particular more inversions and 50 bp–2 kb insertions/deletions located in repeat-rich regions12.

Limitations of long-read sequencing

The disadvantages of PacBio and ONT platforms include costs and sample requirements, which are substantial compared to IL sequencing and can be problematic for tumor samples (Table 3)55. In addition, they have a lower nucleotide accuracy of ~85% for single molecule sequencing and up to 99% using consensus sequencing of the same DNA molecule58,59,60,61. Continuous improvements in algorithms for base calling and error correction have increased the accuracy of these platforms58,59. Since low nucleotide accuracy can impede read-alignment, error correction potentially improves SV detection by increasing the fraction of aligned reads62. However, error-correction strategies come with trade-offs for SV detection. Long reads can be aligned to each other as a self-correction strategy when sufficient coverage (~50×) is available55. However, haplotyping information is lost as a result of using the consensus of reads with mixed molecular origin. This makes the consensus sequence unsuitable for variant phasing or for studying intra-tumor heterogeneity or polyploidy. Alternatively, short reads can be used for error correction by aligning them to the long reads, but this approach only improves accuracy of genomic regions accessible to IL sequencing55,61.

Long-read data requires specialized algorithms

Long-read SV detection algorithms are either based on de novo assembly or read-alignment to a reference genome. Assembly-based strategies have a higher sensitivity for detecting non-template insertions and homozygous SVs. During assembly, contigs are compared to the reference genome and can provide more evidence than individual reads32,55. However, variant calling using alignment requires less coverage than assembly (~20× versus ~50×) and statistical significance when identifying SVs is achieved relatively easily due to the low alignment uncertainty of long reads32,50,55. Compared to assembly methods, alignment-based approaches are more suited to identify heterozygous SVs and more robust to amplifications in highly homologous regions such as low-complexity regions12,55. Within clinical applications, often insufficient resources are available to perform long-read sequencing of tumor-normal pairs to depths required for de novo assembly (Table 3). Therefore, we focus on using alignment-based strategies (Table 2).

Alignment of long reads differs from short reads due to the increase in base pairs to align and different errors profiles55. Although BWA-MEM offers support for long reads, it often infers many small gaps during alignment and misses large indels63,64. Specialized long-read alignment algorithms have been developed to overcome these issues. In contrast to short-read data, there is no best practise for which aligner should be used when performing SV detection63,64,65,66. Preliminary comparisons suggest that NGMLR and minimap2 perform well and both algorithms are designed to handle the higher error rates and adjust for the 1 bp indels in long-reads12.

Alignment-based SV detection algorithms for long-read data

Currently, many tools are actively developed to detect SVs from alignment of ONT and PacBio data (Table 2). However, studies comparing long-read SV detection tools have been scarce and predominantly show the limitations of available truth sets by identifying many novel variants12,67. At present only nanomonsv reports somatic SVs from long-read data68. The commonly used tools SVIM and Sniffles have shown good precision and sensitivity in multiple performance assessments63,67,69. They were among the first to process both ONT and PacBio data despite their different error profiles and have been followed by additional tools like NanoVar and CuteSV (Table 2). Similar to short-read SV detection tools, long-read tools combine multiple read-alignment patterns to detect SVs. They infer patterns similar to split reads and discordant pairs using intra-alignment and inter-alignment signatures, despite long reads not being paired-end. Similar to short-read tools, using a consensus callset created by intersecting multiple long-read SV detection algorithms can increase precision32,67. Alternatively, machine learning approaches can attain greater improvements in precision and sensitivity than ad hoc intersection, given a truth set is available for training32.

Multi-platform data integration to improve detection of somatic SVs in cancer

Limitations in both short-read and long-read WGS can potentially be overcome by using a multi-platform approach and as such improve the identification of TSSVs. Integration can improve both precision and sensitivity by combining read-alignment patterns (Fig. 2A) and integrating SV callsets from multiple algorithms or technologies (Fig. 2B).

Gene fusion detection by combined analysis of RNA and WGS

Integration of genomic and transcriptomic data can further improve variant detection and provide insight into the phenotypic effect of SVs; specifically resolving gene fusions, splice variants and linking SVs to altered gene expression70. RNA sequencing of tumor samples offers unique advantages such as tissue specificity and time specificity, but obtaining high-quality RNA can be problematic. In addition, sufficient expression is necessary to detect events, which may impede detection of low AF variants.

RNA-seq is especially suitable for detecting gene fusion events through their chimeric transcripts. Gene fusions have high clinical relevance since they are often cancer drivers and otherwise occur rarely in the general population6,70. Specialized gene fusion algorithms predict gene fusions from chimeric transcripts by using read-alignment patterns such as SR crossing exonic junctions and DP mapping to both gene partners71. However, these algorithms can suffer from a high false positive rate which requires extensive filtering72. Chimeric transcripts can occur without genomic rearrangement, for example through intergenic splicing (trans-splicing and cis-splicing) or transcriptional slippage on short homologous sequences73. Since these chimeric transcripts are also present in healthy cells, this advocates for tissue matched RNA-seq of paired tumor-normal samples to allow the identification of tumor-specific events.

Combining RNA-seq with WGS data could resolve specificity issues and improve gene fusion detection. By itself, WGS can detect gene fusions, but not the occurrence of functional transcripts. Although sometimes used for validation purposes74, there are no established algorithms which integrate WGS and RNA-seq such that they both contribute to detection. The advantages of combining WGS, RNA-seq and exome sequencing has been demonstrated for detecting SVs in heterogeneous pediatric cancers75. Similarly, joint analysis of RNA-seq and short-read WGS in the PCAWG study identified the underlying SV for 82% of gene fusions. The remaining fusions were either the result of RNA-only alterations such as transcriptional read-through or underdetection of SVs5.

Integration of short-read and long-read WGS

Short-read and long-read data can complement each platform’s strengths and overcome individual limitations12. Combining SV callsets after detection can increase sensitivity and requiring orthogonal support for variants across platforms can increase their confidence. However, the union or intersection of callsets is still affected by platform-specific technical biases. Read-level integration can overcome some of these issues as illustrated by error correction approaches which use IL reads to improve the accuracy of PacBio/ONT reads55. Likewise, hybrid assembly of short and long reads benefits from their respective high accuracy and scaffolding properties. Localized hybrid assembly tailored to SV detection as implemented by HySA shows that problematic SVs can be detected that have too little support in either PacBio or IL76. However, HySA cannot infer somatic SVs and some variants were missed due to few supporting aberrant IL reads and PacBio alignment issues. Hybrid assembly can also reduce coverage requirements for de novo assembly77.

As an alternative to long-read technologies, linked-read sequencing from 10× Genomics (10×) performs well for haplotype construction and variant phasing12. A read-barcode is added during library preparation to trace the molecule of origin at costs similar to IL sequencing78 (Table 3). In addition, 10× can report variants in repeat-rich regions not accessible by standard short-read IL sequencing79,80. Integration of short-read WGS and 10× enabled chromosome-scale haplotyping and phasing of detected variants of the polyploid cancer cell line HepG281,82. Variant phasing can help to gain biological insights, as shown for associated regulatory and coding mutations in treatment-resistant prostate cancer83 and identification of SVs as potential cancer drivers by altering cis-regulation of genes84.

Discovery of large, complex variants by chromatin assays

Combining sequencing data with technologies that provide insight into genomic organization can elucidatie large complex rearrangements. Technologies such as Bionano Genomics (BNG) and Hi–C have shown limitations of SV detection using sequencing. The combination of short-read WGS, BNG, and Hi–C on a cancer cell line showed most of the large (>1 Mb) intra-chromosomal and inter-chromosomal SV events were uniquely detected by a single technology with only ~20–35% validated by multiple platforms8. Each platform has its own scope of variant detection. Short-read WGS detected the largest number of variants across a broad range, whilst BNG and Hi–C lack base-pair resolution but can detect >1 kb deletions in repeat rich regions unlike short-read WGS8. BNG has promising diagnostic applications as it can confidently detect large variants with low input requirements (Table 3). Also, BNG had full concordance with standard diagnostic assays in pediatric ALL and identified additional variants85.

Incorporating pre-existing technologies in ongoing studies

Continuous technological improvements provide exciting new data and SV discoveries, but this does not make existing datasets obsolete. The phenotypic effect of CNVs is often better understood than for SVs and established technologies have had more opportunity to collect samples, including rare cancer types. Currently many samples are available in repositories that profile genomic imbalances either via SNV array or exome sequencing technologies13,86. Challenges in integrating these datasets result from differences between technologies, such as breakpoint resolution and platform-specific biases, and systematic solutions are rare87. The widely varying detection resolution of different technologies invalidates callset intersection strategies, as smaller events are below the detection limits for lower resolution arrays, and exome sequencing is limited to events involving multiple exons. The absence of an event in a callset should not be considered proof that the event does not exist. Gene-centric approaches based on unions seem the most applicable. Although integration of pre-existing datasets assayed with different technologies with recently acquired datasets provides a complex computational challenge and is often ignored, it is likely to be an ongoing issue as technologies and platforms continue to evolve.

Challenges in using sequencing for precision oncology

In clinical practice, next-generation sequencing (NGS) is increasingly used to replace targeted assays subject to budgetary and sample requirements. NGS can simultaneously detect different variant types and discover new biomarkers, and is more cost-effective than a series of single-gene assays. Although turn-around times are often longer, sensitivity and precision are maintained88 provided sufficient sequencing depth is achieved26,31. As a result, NGS makes pan-cancer biomarker testing feasible, leading to the approval of drugs based on molecular alterations shared by different cancer types like the use of TRK inhibitors for all solid tumors with a NTRK fusion88. However, the distribution of NGS data over multiple repositories and lack of data harmonization complicates clinical decision-making and prevents precision medicine from reaching its full potential.

Variant interpretation is a major challenge in precision oncology often done by expert panels such as interdisciplinary molecular tumor boards88. Despite its challenges, integration of multi-omics data is increasingly being used to improve variant interpretation and increase the number of identified drivers or actionable targets5,88,89. However, standards on variant interpretation and prioritization are still emerging90. As a result, there is low concordance between the recommendations of different molecular tumor boards when given identical case studies, especially for complex genomic alterations90.

Recent initiatives have attempted to resolve this need for standardization in variant assessment and clinical decision through the Molecular Tumor Board Portal91 and Somatic Working Group of the Clinical Genome92. Both harmonize different variant repositories, curated knowledge bases and computational predictions to acquire insights into variant-gene-drug-disease relationships with the focus on clinical use Although extremely valuable, these efforts focus only on SNVs and to a limited extent gene fusions. Similar initiatives for SVs and complex genomic alterations are currently lacking. Largely due to tumor-specific SVs not yet commonly being used as molecular targets or biomarkers to guide patient-specific treatment. We anticipate that improved confidence of TSSV detection will enable the subsequent research necessary for the use of the full spectrum of variants in precision oncology.

Conclusion

The field of SV detection is continuously improving through advancements in sequencing technologies and tools. These advancements will contribute to discoveries into the role of SVs in cancer, as well as the incorporation of SVs in precision oncology programs. Nevertheless, SV detection and interpretation in tumor samples is complicated by unique biological and technical challenges, i.e., contamination, intra-tumor heterogeneity and aneuploidy. These challenges are addressed by algorithms specialized in identifying TSSVs from tumor-normal paired sequencing data, which requires both SV detection and distinguishing tumor-specific variants.

Based on studies of normal genomic variation, a multi-platform approach is necessary to detect the full spectrum of variants and reduce false positives. Truth sets and procedures developed for SV detection from short-read data show that combining multiple tools improves precision and recall. Despite this, short-read sequencing has inherent limitations such as GC coverage bias and mapping ambiguities leading to inaccessible genomic regions. Long-read sequencing technologies can resolve large, complex SVs and improve coverage, but have lower per-nucleotide accuracy, higher costs and sample requirements. SV detection tools for long-read data have yet to mature with performance assessments and truth sets lacking.

Integration of long-read and short-read data is likely required for complete characterization of tumor genomes. However, adopting sequencing technologies in clinical laboratories requires a clear added value compared to the standardized assays, as well as being fast and affordable. Considering IL and 10× provide high accuracy WGS at low sample requirements, they are most feasible for tumor-normal sequencing in a clinical setting. Supplementary low-coverage sequencing with ONT can cover regions inaccessible to short-read WGS and aid in variant phasing. Alternatively, RNA sequencing has proven to be highly beneficial in a clinical setting for the detection of gene fusion events.

In conclusion, improving detection of TSSVs by integrating data derived from multiple platforms and detection tools enables the use of TSSVs in precision oncology and research into their role in cancer. With accurate TSSV datasets becoming more available, previously unchartered territories of variant types can be explored to potentially discover novel SV cancer driver events.