Modeling tissue-specific breakpoint proximity of structural variations from 2,382 whole-genomes to identify cancer drivers

Structural variations (SVs) in cancer cells often impact large genomic regions with functional consequences. However, little is known about the genomic features related to the breakpoint distribution of SVs in different cancers, a prerequisite to distinguish loci under positive selection from those with neutral evolution. We developed a method that uses a generalized additive model to investigate the breakpoint proximity curves from 2,382 whole-genomes of 32 cancer types. We find that a multivariate model, which includes linear and nonlinear partial contributions of various tissue-specific features and their interaction terms, can explain up to 57% of the observed deviance of breakpoint proximity. In particular, three-dimensional genomic features such as topologically associating domains (TADs), TAD-boundaries and their interaction with other features show significant contributions. The model is validated by identification of known cancer genes and revealed putative drivers in novel cancers that have previous evidence of therapeutic relevance in other cancers.


Introduction 23
Whole-genome sequencing of cancer genomes has revealed that they contain a wide variety of DNA 24 structural variations (SVs) that include deletions, duplications, translocations and other complex events 25 [1]. The SVs in cancer cells arise from different mechanisms and vary in size from kilobases to whole 26 chromosomal rearrangements [1][2][3][4]. Consequently, SVs usually span several genes and their associated 27 regulatory elements. While it is well known that genomic rearrangements and copy number variations 28 (CNVs) can lead to dysregulation of tumor-suppressors or oncogenes and act as drivers of cancer 29 progression [1,[5][6][7][8], identification of SVs under positive selection in cancer remains a challenging task. 30 This is because SVs are heterogeneously distributed across the genome leading to many genomic 31 regions recurrently altered in multiple samples due to neutral background processes [4]. To identify the 32 events under positive selection, the null background distribution of SV breakpoints must be 33 characterized accounting for the genomic covariates [5,9]. Additionally, the identification of the 34 specific functional element that is the target of positive selection (i.e., the coding sequence of a gene, its 35 cis-regulatory regions, or non-coding RNAs) constitutes another challenge due to the large genomic 36 span of SVs. 37 While numerous computational methods have been developed to model the background distribution of 38 single-nucleotide variants (SNVs) and identify drivers in a tissue-specific manner, similar methods for 39 SVs are lacking [5,[10][11][12][13]. The Pan-Cancer Analysis of Whole Genomes (PCAWG) SV Working Group 40 used a Gamma-Poisson fit to model the breakpoint density from 2,658 whole-genomes using eight 41 covariates to identify the significant driver genes in several cancers [5]. However, this analysis was 42 performed at the pan-cancer level without accounting for tissue-specific covariates. Since there is ample 43 evidence of different SV distributions and putative mechanisms across cancer types [1,2,14], it is 44

Results 67
We analyzed a set of 324,838 high confidence somatic SVs derived from whole-genome sequencing of 68 2,382 patients of 32 distinct cancer types from 15 different organ systems (Supplementary Table 1). The 69 cancer types include those with a high SV burden from the PCAWG project [1,18] and metastatic 70 prostate cancer samples [19,20]. Based on the rationale that tissue-specific covariates can influence the 71 rearrangement landscape, the cancer types from different organ systems were modeled separately. 72 Furthermore, the prostatic primary and metastatic cohorts were analyzed separately since they can have 73 distinct drivers. 74

Breakpoint proximity curve to model genome-wide SV distribution 75
To describe the genome-wide distribution of SV breakpoints in a given cohort, we included all unique 76 breakpoint coordinates for each sample. Then we computed the breakpoint proximity curve (BPpc) 77 based upon the breakpoint neighbor reachability (BPnri) for each individual breakpoint (BPi) in the 78 cohort. This metric captures the genomic regions with high or low proximity between breakpoints 79 (Methods). The BPpc is defined as the smooth curve resulting from the nonparametric local polynomial 80 regression (locally estimated scatterplot smoothing, LOESS) fitted to the dataset of BPnri, after reverse 81 scale normalization -log10(BPnri+1) (Fig. 1a). BPpc allows the use of a peak-calling approach to 82 directly identify the regions with higher breakpoint clustering relative to the surrounding area (i.e., peak 83 summits), thereby overcoming the inherent issues associated with pre-defining a genomic bin size for 84 computing breakpoint density along the genome [5,21]. This is important because functionally relevant 85 breakpoint clustering events may occur over a wide range of genomic lengths. Thus, BPpc models the 86 underlying distribution of breakpoints in a more robust and unsupervised manner compared to the 87 computation of breakpoint densities. 88 within vs. outside of LADs for some cohorts (Fig. 2d). The two-dimensional (2D) contour plots allow us 133 to visualize the partial contributions of features to BPpc. Most cohorts exhibit peaks and valleys on the 134 smooth function for the partial effects of most covariates demonstrating the importance of non-linear 135 modeling (Figs. 3a, 3b, Supplementary Fig. 3). For instance, the partial contribution of TAD-recurrence 136 shows non-linear behavior in brain, colorectal, kidney and prostate cancers although it is linear in breast 137 cancer regardless of the LAD status (Fig. 3a). The behavior of some features may also vary within vs. 138 outside LADs. For example, the contribution of GC content in breast cancer is linear inside LADs but 139 non-linear outside LADs (Fig. 3b). 140 We find a statistically significant contribution for the interaction terms of covariates across all the three 141 functional classes of TAD segments for most cohorts (Fig. 2d). However, their effect sizes show distinct 142 behavior for each cancer type and often across TAD segment classes as evident from the 2D contour 143 plots (Fig. 3c, Supplementary Fig. 3). Interestingly, the distinct behavior is also apparent for primary vs. 144 metastatic prostate cancers, likely pointing to different processes contributing to early vs. late genome-145 wide SV distribution [29,30]. 146 In general, there is wide variability in the partial contributions of different features across cancer types 147 demonstrating the importance of tissue-sepcific modeling. 148

Putative driver candidates identified at significantly recurrent peaks in BPpc 149
We obtained the adjusted !"#$ = !"#$ − !"#$ % by correcting the observed curve with the expected 150 model (Fig. 1C). Values around zero in the adjusted curve correspond to the observed BPpc close to the 151 expected one from the GAM. Peaks in positive values of !"#$ correspond to regions where the 152 breakpoints are closer than expected while the valleys in negative values are loci where the breakpoints 153 are sparser than expected ( Fig. 4a and Supplementary Fig. 4). As expected, there is a wide variability of 154 the peaks and valleys representing the differential landscape of rearrangements for each cohort (Fig. 4a  155 and Supplementary Fig. 4 ) 156 To identify the peaks that potentially correspond to positively selected loci across the BPpc 157 rearrangement landscape, we computed the peak recurrence score (PRs) (Methods). 158 where Nsmp is the number of unique samples in the peak, Nsv is the number of unique SVs, PeakA is the 160 area under the peak and PeakGR is the genome range of the peak. Thus, PRs is highest for peaks where 161 many samples (high Nsmp) contribute same SVs (low Nsv) to create tight clusters of breakpoints (high 162 PeakA) over narrow regions (low PeakGR). Next, for each cohort, we identified peaks with PRs 163 significantly higher than the fitted theoretical density using a gamma distribution (FDR < 0.2) (Fig. 1d, 164 We identify 79 significantly recurrent peaks that potentially correspond to positively selected loci 166 (Supplementary Table 3). The peak summits corresponding to putative driver candidates range from 179 167 bp to 10.71 Mb, with a median of 822.96 kb, highlighting the strength of our approach to capture 168 breakpoint clustering over varied genomic lengths (Fig. 4c). The number of significant peaks ranges 169 from 0 in bone cancer to a maximum of 13 in colon cancer (Supplementary Fig. 6 (Fig. 5). Besides these known cancer genes in specific cohorts, we also identified 190 several others with known roles in multiple cancers (Fig. 5). Interestingly, most of these common 191 candidates are large cancer genes within fragile sites, such as FHIT, WWOX,CCSER1,IMMP2L,192 CDKN2A and CDKN2B [26,[41][42][43][44] Overall, out of the 73 genes whose exons or enhancers are identified as putative drivers, 47 are known 194 cancer genes (Fig. 5). Genes identified as potential drivers in our analysis that are known to be 195 oncogenic in another cancer type [45,46] are of high interest since they are often therapeutic targets 196 under investigation (Fig. 6). For example, DMD is known to be a cancer gene [45]  and we fnd it is a putative driver gene in esophageal cancer where 54% of the cohort carries SVs in this 199 region (Fig. 6a). We also identified LRRN3 as a driver candidate in esophageal cancer with SVs in 200 20.7% of the samples (Fig. 6b). Other members of the LRRN gene family have been found as drivers in 201 multiple cancer types, including neuroblastoma and gastric cancer [50], and the role of LRRN3, in 202 particular, has been studied in fibrosarcoma [51]. As shown in Fig. 6, clear peaks in BPpc correspond to 203 tight clustering of breakpoints allowing us to confidently pinpoint these genes as putative drivers. 204 Another interesting result is our finding in ovarian cancer of a region affected in 25% of the cohort 205 where our method points to NF1 as a potential driver (Fig. 6c). Previous studies have shown the 206 importance of NF1 in several other cancers including glioblastoma, melanoma, breast and lung cancers 207 [36,[52][53][54]. The gene CDK12 is a candidate within a region affected in 14.9% of the uterine cancer 208 cohort (Fig. 6d). CDK12 is a well-studied target in other female reproductive cancers, such as ovarian 209 and breast [55], and it has also been studied in stomach and prostate cancers [20,56]. In colorectal 210 cancer, we predict WWOX as a putative driver in a region that shows SVs in 25% of the samples (Fig.  211 6e). WWOX is within a known fragile site and has been reported in several studies of different tumor 212 types including breast, prostate, lung, esophagus, cervical, ovarian and bladder cancers [57][58][59][60][61]. 213 Limited availability of RNA-seq data prevented analysis of impact on gene expression for many 214 candidates (Supplementary Table 6). Among the cohorts with sufficient sample sizes for RNA-seq 215 analysis, we find that an enhancer of CTDSP2 is a putative driver in brain cancer (glioblastoma 216 multiforme) and the SVs are associated with its differential expression (Fig. 6f). CTDSP2 is known to be 217 important in other cancer types [62,63]. The genes MACROD2 in pancreatic cancer (Fig. 6h) and 218 CDKN2B in liver cancer (Fig. 6g) also show significant differential expression between samples with 219 SVs at these loci relative to those without SVs. 220 These findings highlight the importance of both the tissue-specific and clustering-based approaches used 221 in our method to capture the significantly rearranged regions in different cancers. Notably, while 222 breakpoint clustering clearly points to these specific candidates as seen in Fig. 6, breakpoint density over 223 fixed bins would require different bin sizes to capture the ones with maximum density. This is because 224 using a fixed-size window for all these instances will likely generate several bins with high density 225 without clearly pointing to the most probable target element of selection. 226

CSVDriver: Computational tool to identify SV drivers from whole-genome sequences 227
The computational approach developed in this work to identify SVs and functional elements under 228 positive selection in cancer whole-genomes is implemented in CSVDriver, Cancer Structural Variation 229 Driver, a user-friendly shiny app tool (https://github.com/khuranalab/CSVDriver). The input for the tool 230 is SV calls and researchers can provide the tissue-specific co-variates data or use existing datasets to run 231 the tool on their cancer cohort/s. Besides the list of functional elements that are putative drivers in a 232 given cohort, the app provides the graphical visualization of the GAM results including the analysis of 233 non-linearity and covariates interaction. 234

Discussion 235
One of the major challenges in cancer genomics is the accurate estimation of the expected 236 heterogeneous distribution of passenger variants. This distribution represents the null background, which 237 can be used to identify loci that exhibit significantly recurrent variants likely due to positive selection. 238 While this problem has received considerable attention for SNVs, studies for tissue-specific neutral 239 models of the genomic distribution of SVs are lacking. We find that a GAM is able to describe the 240 breakpoint proximity distribution of SVs in cancer genomes with the explained deviance ranging from 241 10% in lung to 57% in lymph-nodes cancer. The use of a GAM allows us to interpret the results 242 graphically and provides estimates for both the magnitude and statistical significance of the contribution 243 of each feature. We find that the 3D chromosomal conformation plays an important role in the genome-244 wide distribution of SVs for most cancer types with TAD-recurrence, TAD-B recurrence and their 245 interaction terms with other covariates contributing significantly to the model. 246 Our method is able to identify the known cancer drivers and nominate novel candidates as those that 247 exhibit higher breakpoint proximity than expected by random chance, such as DMD and LRRN3 in 248 esophageal cancer, NF1 in ovarian cancer, CDK12 in uterine cancer and WWOX in colorectal cancer. 249 We note that the pan-cancer analysis by PCAWG allowed the identification of regions that are not likely 250 to gain significance in single cancer analyses due to limited cohort sizes but failed to identify many 251 novel regions that gain significance in our tissue-specific analysis. While our method is able to identify 252 24 enhancers as putative drivers, they are usually in the vicinity of other coding exons with similarly 253 high element rearrangement scores and further functional validation will be needed to decipher their role 254 in tumorigenesis. However, one prominent example supported by RNA-seq data is CTDSP2 enhancer 255 that is a candidate driver in glioblastoma. 256 Finally, while we could identify the features that contribute significantly to the breakpoint proximity 257 curve, we find that the relationships are usually tissue-specific, complex and non-linear, often forbidding 258 straightforward interpretations. Furthermore, while we focused our analysis on the peaks in BPpc, the 259 valleys could potentially provide insights about the loci showing depletion of breakpoints due to 260 negative selection in future studies. 261

Cancer somatic structural variations data 263
All somatic SVs were obtained from the PCAWG project Working Group 6 consensus calls [1,18]. We  We included two higher-order structural 3D chromosomal conformations, TADs and LADs, in the set of 296 covariates. For TADs, we use the recurrence of the regions as well as their boundaries (TAD-B) 297 annotated across 37 datasets obtained from the 3D Genome Browser [24]. We also classify each sub-298 segment of TAD that shows different recurrence by computing the coverage of all 15 chromatin marks 299 within every TAD segment, similar to the approach used in [16] (Supplementary Fig. 7). We found that 300 three principal components explain most of the variance of this coverage ( Supplementary Fig. 8). Thus, 301 we grouped the TAD segments in three clusters according to the coverage of chromatin marks. The three 302 clusters are quiescent/heterochromatin, low-active and active. We confirm that the clustering is robust 303 across the 16 cohorts analyzed in this study ( Supplementary Fig. 9 ). More details of the sources of 304 genomic feature annotations are shown in Supplementary Table 2.  305 The putative functional effect for each significantly rearranged locus is predicted by annotating the 306 potential drivers on the basis of the impact of the SVs breakpoints on the coding and noncoding 307 elements within these regions. For the coding drivers, we use the subset of protein-coding genes 308 extracted from the comprehensive gene list obtained from the Genecode Release 29 (GRCh37). For the 309 non-coding elements, we use the active tissue-specific cCRE marks gathered from ENCODE 3 [25]. 310

CSVDriver workflow. 311
CSVDriver aims to identify cancer-driving rearrangement events by analyzing the focal trend of 312 breakpoint clustering. 313 Input and preprocessing of data 314 The method takes as input the data of cancer somatic SVs calls and analyzes the combined impact of all 315 rearrangement types including insertions (Ins), deletions (Del), inversions (Inv) and translocations (Tra). 316 The expected file format is SV- Establishing the observed breakpoint proximity-curve (BPpc) 323 The method bases the SV analysis on the description of the genome-wide distribution of breakpoint 324 proximity taking all unique breakpoint coordinates for each distinct sample in the given cohort. Thus, if 325 a breakpoint coordinate is present in two samples, it is represented twice. Then we sort all breakpoints 326 based on their coordinates and obtained an ordered list of breakpoints encompassing all samples, each 327 one represented by BPi, where i is the ordered index. Next, we annotate each BPi by computing their 328 neighbor reachability (BPnri,) defined as the average genome distance to reach the two adjacent 329 breakpoints: 330 Thereupon we define the breakpoint proximity (BPpi) as the normalized reverse scale, applying 332 logarithmic transformation: BPpi = -log10(BPRi+1). Then, we compute the breakpoint proximity curve 333 (BPpc), which is a smooth curve resulting from the nonparametric local polynomial regression (LOESS) 334 fitted to the BPnri values (Fig. 1A). This curve shows the trend of focal clustering of the breakpoints 335 because the fitting result is weighted toward the nearest surrounding values. The span argument (a = 336 0.2) controls the size of the near surrounding. It reflects the interval as a proportion of the total 337 breakpoints and regulates the grade of smoothness in the resulting BPpc. This curve depicts the 338 breakpoint genomic distribution and captures the regions of high and low proximity between 339 breakpoints. Chromosomes with a low load of breakpoints that show no clear trend of clustering will not 340 reflect a reliable signal. Therefore, we do not consider them. That is the case for small chromosomes 341 (e.g. chr21, chr22) in some cancer cohorts. 342 Modeling the expected BPpc by using generalized additive model (GAM) with tissue-specific genomic 343 covariates 344 The model is conceived to expand the linear regression analysis of genomic covariates by introducing 345 the capacity to investigate the potential non-linear relationships between genomic covariates and the 346 distribution of breakpoints. Additionally, it accounts for the contribution of the covariates' interaction to 347 the model. Thus, CSVDriver models the expected background BPpc using GAM, a parametric 348 regression method, which models the BPpc (dependent variable) with respect to tissue-specific data of 349 genomic covariates (predictors or independent covariates). of the tensor product interaction (te) for the interplay between GeneDensity and TAD recurrence; 367 GeneDensity and RT; and TAD and RT. This accounts for the status of TAD-segment class. 368 Nonetheless, a high number of interaction terms increases the chance of over-fitting. Furthermore, GAM 369 can be computationally expensive to reach convergence. Consequently, we try to balance the complexity 370 of the model and its ability to explain the deviance by covariate interactions reasonably by including the 371 interaction only of covariates that contribute at least 10% in at least one cohort in single predictor 372 models. 373 Computing the adjusted BPpc 374 The goal of CSVDriver is to capture the loci where breakpoint clusters potentially arise due to selective 375 pressure unlike the clusters associated with neutral non-selective forces. Therefore, the adjusted 376 curve !"#$ = !"#$ − !"#$ % represents a corrected BPpc where regions with values close to the !"#$ % 377 will be considered expected biases. The signal in the y-axis depicts how close the breakpoints are in a 378 given region (Fig. 1c) The method takes the regions corresponding to the top 25% of the summit of each peak, which represent 384 the regions of local maximum breakpoint clustering. Then we identify the loci that potentially 385 correspond to the positively selected regions (Fig. 1d). Each peak is described by their peak recurrence 386 score (PRs) defined as: 387 where Nsmp is the number of unique samples in the peak, Nsv is the number of unique SVs, PeakA is the 389 area under the peak and PeakGR is the genome range of the peak. Thus, PRs is the highest for peaks 390 where many samples (high Nsmp) contribute to create tight clusters of breakpoints (high PeakA) over 391 narrow regions (low PeakGR). The score is square root transformed to reduce the dispersion in the values 392 while keeping the trend of interest. We take into account the inherent limitation of the background 393 model and the differential explained deviance per tissue type (Fig. 2a). Consequently, the approach does 394 not use any absolute pre-established threshold, instead, it considers the cohort-specific distributions of 395 all PRs (Fig. 4b density distribution) as the expectation of the combined effect of several processes. 396 Next, for each cohort, we identify peaks with PRs significantly higher (Fig. 4b, QQ-plots) than the fitted 397 gamma distribution theoretical density using the 'fitdistrplus' R Package [69]. After controlling the false 398 discovery rate (FDR) [70] for multiple hypothesis testing, the significant loci were defined as those with 399 FDR < 0.2. 400 Detecting the driver candidates within the significantly rearranged peaks 401 The potential functional effect of the significantly rearranged regions for each cancer is directly 402 associated with the effects on coding and non-coding elements by their deletion, disruption or relocation. 403 CSV-Driver determines the functional elements that are the most likely targets of positive selection 404 within each significantly rearranged peak. 405 For each functional element (protein-coding exons, lncRNA, enhancer, CTCF-insulator) located at a 406 significantly rearranged peak, the method computes the element rearrangement-score (RSE) 407 The method reports the element rearrangement scores for the highest scoring elements of all types 421 (protein-coding exons, enhancer or lncRNA) at a given peak (Supplementary Fig. 6 and Supplementary 422 Table 5). While at a specific locus altered by SVs, it is fair to assume that affected coding exons of 423 genes are more likely to have greater impact than the non-coding elements, CSV-Driver allows the 424 analysis of the potential importance of all possible driver elements. One peak may contain multiple 425 driver elements which can represent alternative paths of disrupted regulation, or even some subgroup of 426 samples with slightly different, yet related genomic drivers. In the current analysis, if a peak contains 427 multiple elements of different type, all elements with highest rearrangement scores are reported in 428 Supplementary Fig. 6 and Supplementary Table 5. However, if one of those elements is or is associated 429 with a known cancer gene, we consider it as the most likely candidate in a given peak shown in Fig. 5. 430

Results report 431
For each input cohort, CSVDriver reports the graphics of the BPpc annotated with the driver candidates 432 (coding genes and non-coding elements) at each significantly rearranged peak (Supplementary Fig. 6). It 433 also provides the summary tables with the catalog of drivers (Supplementary Table 5 Figure1: CSVDriver workflow. The input data is the standard cancer somatic SV calls that include a pair of breakpoint coordinates (BP1, BP2) for each SV_id per sample. (a) Step 1 computes the breakpoint neighbor reachability (BPnri grey dots) for each breakpoint (BP i ), where 'i' is the sorted index. Then, for each chromosome, the BPpc is generated as the smooth curve (grey line) resulting from the nonparametric local polynomial regression fitted to the dataset of BPnr i , after reverse scale normalization -log10(BPnr i +1). (b) In step 2, based on the observed BPpc distribution (solid line), the method assesses the expected (!"#$) background distribution (dashed line) by using a generalized additive model (GAM) that includes multiple tissue-specific breakpoint covariates. (c) In step 3, it computes adjusted BPpc (observed -expected). (d) In step 4, the method calls peaks across the adjusted BPpc and identifies those that potentially correspond to positively selected loci. It computes the peak recurrence score (PRs) and based on the empirical density of PRs (dashed line distribution) it identifies the peaks with PRs significantly higher (QQ-plot FDR < 0.2) than the fitted theoretical density (red line distribution). (e) In the last step 5, the driver candidates are identified as the genomic elements (CDS (coding sequence), enhancer, CTCF-Insulator, and lncRNA) with the highest rearrangement scores within the peak region.     the smooth function of the partial contribution of GC content. These graphics represent the linear or non-linear behavior of the partial contribution of each covariate analyzed. The x-axis is the value of the covariate and the y-axis is the corresponding partial effect of the covariate. The confident interval is represented by the grey area. (c) 2D graphics for the partial contribution of the interplay between gene density and TAD recurrence for the interaction with the three functional classes of TAD segments. The scale from light yellow to red represents partial contribution for higher to lower values of the distribution. The full set of graphics for all cohorts is shown in Supplementary Fig. 3 skin Examples of cancer-specific BPpc Empirical and theoretical density Q−Q plot  Figure 4: (a) Examples of cancer-specific BPpc and the peaks of significantly recurrent rearrangements for representative cancer types. Each peak shows a dot colored in the scale of significance for the corresponding peak recurrence score (PRs). The peaks that come mostly from a unique sample (likely a chromothripsis event) are marked with an asterisk. The known driver candidates detected within the significantly rearranged peaks are marked (green for coding sequence and red for enhancers). (b) Empirical and theoretical density of PRs for each cohort and the corresponding QQ-plots which show the p-values follow a uniform distribution. (c) Scatter plots of the genome length of the peak region vs. number of SVs for each peak, the size of the dots shows the recurrence in the cohort. Histogram of the genome length distribution of the significant peaks.