Searching thousands of genomes to classify somatic and novel structural variants using STIX

Structural variants are associated with cancers and developmental disorders, but challenges with estimating population frequency remain a barrier to prioritizing mutations over inherited variants. In particular, variability in variant calling heuristics and filtering limits the use of current structural variant catalogs. We present STIX, a method that, instead of relying on variant calls, indexes and searches the raw alignments from thousands of samples to enable more comprehensive allele frequency estimation.


Challenges with calling and genotyping SVs
Part of the challenge when moving from SNVs to SVs is the substantial increase in the uncertainty of the underlying data. For example, the allele balance for heterozygous SNVs and SVs from the Genome In a Bottle Consortium 1,2 sample shows a shift from the expected peak at 0.5 allele balance in SNVs (Fig. 1A) to 0.3 in SVs (Fig. 1B). The reason for this shift is that SV detection and genotyping from short-read data is complicated by evidence that does not provide direct information about the location of the variant (e.g., read depth and discordant pair-reads). These two issues result in fundamentally different detection and genotyping strategies for SVs. Instead of explicitly testing for the existence of every possible SV (which is intractable), read alignment evidence is clustered, and a consensus breakpoint (which is often not at single-base resolution) and genotype is inferred. The two major issues with this type of clustering are instances where spurious alignments overlap by chance, causing false positives, and where fluctuations in coverage create false negatives or incorrect genotypes. Both of these cases produce SVs with a wide range of per-sample evidence depths and summarizing each sample into just three states (homozygous reference, heterozygous, and homozygous alternate) hides information that can be important when determining if a newly observed variant is common, rare, or noise. Genotype quality scores capture some of this uncertainty, but in practice, these scores are only used to exclude problematic samples from an analysis. This highlights the need for new metrics that can represent the full extent of structural variant evidence in a population.

COSMIC/PCAWG SVs present in STIX 1KG/SGDP database
Given its scalability, we can use STIX to improve somatic SV calls by scanning thousands of genomes for corroborating evidence. Among the 46,185 deletions in the Catalogue of Somatic Mutations in Cancer 3 (COSMIC), 12,270 (26.5%) appeared in the 1KG STIX database ( Fig. 2A), 12,902 (27.9%) were in the SGDP STIX database (Fig. 2B), and 13,295 (28.8%) were in the combined cohort database (see Supplementary Table 2). Despite having matched normal tissues for every sample, 1,732 (2.1%) of the 84,083 somatic deletions found by PCAWG were in 1KG ( Fig. 2D), 2,833 (3.4%) were in SGDP (Fig. 2E), and 3,237 (3.8%) appeared in either population (see Supplementary Table 3). The SVs found by STIX are likely either germline or recurrent mutations and are unlikely to be driving tumor evolution. These results highlight the importance of using STIX for future studies to incorporate larger reference populations to prioritize SVs.
Scanning a large population for recurring SVs can improve somatic calling, but relying on an SV call set of the population is insufficient. While STIX found that the 12,270 COSMIC SVs had some evidence in the 1KG cohort, the published 1KG SV call set 4 only recovered 454 variants (Fig. 2C). Similarly, only 193 PCAWG variants were in the 1KG catalog versus the 1,668 found by STIX (Fig. 2F), and many of the missing SVs were at high frequency (x=0 for Figs. 2C and 2F). SV calls from larger cohorts are also less sensitive. For example, gnomAD SV 5 , which included 14,918 genomes, only found 893 COSMIC SVs and 433 PCAWG SVs.
In addition to somatic SVs, we used STIX to study de novo variation in a large family study 6 . Since de novo SVs are new events, they should be rare in the population if mutations arise largely at random. Our analysis found strong evidence (at least three supporting reads) for 57 of 698 de novo SVs in either 1KG or SGDP (8.7% deletions, 5.6% duplications, 30% inversions) (see Supplementary Table 5). Most (47) de novo SVs were observed in a single 1KG sample, and one was in six. Given the massive number of possible SV combinations, the low de novo SV rate (0.16 events per genome 6 ), and the likelihood that these SV are true de novo variants, finding any evidence in these populations highlights the plausibility of recurring alleles, which has been shown in other species 7 , and in some complex diseases 8 . Only five of the reported de novo deletions appear in the 1KG catalog. STIX again shows its utility and importance in uncovering novel insights into SV dynamics by enabling an accessible and comprehensive assessment from population data for variants often not reported in SV catalogs.

STIX query resolution
When a paired-end read spanning an SV breakpoint is aligned to a reference genome, it will often have a notably different configuration from the vast majority of the other paired-end read alignments. For example, the ends of a pair spanning a deletion will align to loci that are further apart than expected. While these "discordant pairs'' are a primary signal for short-read SV callers, they only convey indirect evidence of an SV since the breakpoint is not sequenced by either end. The result is ambiguity in the exact breakpoint location. STIX also uses discordant pairs when assessing the number of samples that contain evidence supporting an SV, and the uncertainty inherent to the evidence affects the resolution of the results. For example, queries against the 1KG cohort have a resolution between 200bp and 400bp (which is close to the insert size mean) (Supplementary Figure 4). The resolution of split-read evidence is better since the breakpoint is fully sequenced and can be more accurately localized.