Medium levels of transcription and replication related chromosomal instability are associated with poor clinical outcome

Genomic instability (GI) influences treatment efficacy and resistance, and an accurate measure of it is lacking. Current measures of GI are based on counts of specific structural variation (SV) and mutational signatures. Here, we present a holistic approach to measuring GI based on the quantification of the steady-state equilibrium between DNA damage and repair as assessed by the residual breakpoints (BP) remaining after repair, irrespective of SV type. We use the notion of Hscore, a BP “hotspotness” magnitude scale, to measure the propensity of genomic structural or functional DNA elements to break more than expected by chance. We then derived new measures of transcription- and replication-associated GI that we call iTRAC (transcription-associated chromosomal instability index) and iRACIN (replication-associated chromosomal instability index). We show that iTRAC and iRACIN are predictive of metastatic relapse in Leiomyosarcoma (LMS) and that they may be combined to form a new classifier called MAGIC (mixed transcription- and replication-associated genomic instability classifier). MAGIC outperforms the gold standards FNCLCC and CINSARC in stratifying metastatic risk in LMS. Furthermore, iTRAC stratifies chemotherapeutic response in LMS. We finally show that this approach is applicable to other cancers.


Results
Since the LMS genome is highly rearranged, the question arises whether DNA breakages occur randomly or are due to mechanisms associated with structural and/or functional DNA elements and are thus specific to certain regions of the genome. To address this question, we tested the "hotspotness" of three different types of genomic DNA structure: regulatory DNA elements, and NBD and DNA repeats. We propose that Hscore may be used to assess the propensity of DNA elements to break more than expected by chance under the random breakage model (RBM) (Fig. 1A). Hscore for a given type of DNA element is defined as the − log10 of the probability of having more BP than those observed in those elements, given the total number of BP, the cumulative size of the DNA elements and the readable genome size ("Materials and methods" section). Hscore is very intuitive: the higher it is, the more significantly a given DNA element is more broken than expected. To determine whether a given genomic feature is a hotspot, we computed Hscore in sliding windows upstream and downstream and made profile plots ( Fig. 1A; see "Materials and methods" section). Two significantly broken structures emerged: hotspots and hot regions. We define the characteristics of hotspots as follows: (1) DNA elements are significantly more broken than expected by chance (Hscore ≥ 3) while the immediate surrounding regions are not (Hscore < 3); (2) if the DNA elements and the immediate surrounding regions have a Hscore ≥ 3, a DNA element is considered as a hotspot if its Hscore is at least 1.5 times the Hscore of both surrounding regions. A hot region is inferred if an Hscore ≥ 3 is observed for both the DNA element and its surrounding regions, provided that Hscore of the DNA element is less than 1.5 times the Hscore of the surrounding regions.
The total BP count per LMS (TBPc) was highly skewed (ranging from 26 to 1200; mean = 222.1, median = 181; Supplemental Fig. S1A) and did not fit a normal distribution (Shapiro-Wilk test P = 1.4 × 10 -12 ; Supplemental Fig. S1A), but TBPc follows a log-normal distribution (Shapiro-Wilk test P = 0.81; Supplemental Fig. S1B). This is not surprising since log-normal distributions are common in nature and reflect forces acting independently and whose interactions result in multiplicative effects 39 . In cancer, these forces may include genetic, environmental, physiological, immune, sub-cellular and supra-cellular constraints.  18 . Sequences prone to form NBD are widespread in the human genome and are associated with GI 40 . The formation of NBD requires unwinding of the DNA sequence, as occurs during replication and transcription 18 . NBD comprise A-Phased Repeats (APR), Direct Repeats (DR), G-Quadruplex (GQ), Inverted Repeats (IR), Mirror Repeats (MR), Short Tandem Repeats (STR), Z DNA (Z) and RLoops Forming Sequences (RLFS). We found that all NBD except APR are hotspots (Fig. 1B), especially GQ which are so highly broken so that Hscore tends to infinity (P = 0). The DNA repeats investigated in this study comprise high-copy repeats: MicroSatellite (MS), Low Complexity (LC), simple repeats (SR); and low copy repeats like Self Chain Segments (SCS), which we split into self-aligned (SCS-S) and gapped (SCS-G), Long Terminal Repeats (LTR) and Retro Transposons (RT). While SR, MS, LC are hotspots, the viral origin repeats LTR and RT are not (Fig. 1C). Concordantly, no viral insertion was observed in LMS genomes when using HGT tools (Ref. 41 and data not shown). Furthermore, while SCS-S were hotspots, SCS-G were not. Genes promoters are hotspots; gene enhancers are hot regions. The promoters of transcriptionally active genes have repeatedly been shown to recurrently harbor double-strand breaks (DSB) 13 . Furthermore, DNase Hyper sensitive (DHS) DNA elements as well as active chromatin marks have been shown to colocalize with DSB 42 . The regulatory DNA elements used in this study comprise CpG islands (CpGi), Cis regulatory Modules (CRM), and DHS of promoter type (DHS_prom), of enhancer type (DHS_enh), of dyadic type (both enhancer and promoter signatures) (DHS_dyadic), and of other types (DHS_rest) 43 . We found that promoterassociated DNA elements (i.e. DHS_prom and CpGi) are hotspots for DNA breakage (Fig. 1D) while DNA elements associated with enhancer activity (DHS_Enh) are hot regions (Fig. 1D). Furthermore, promoter-associated regulatory elements (CpGi, DHS_prom) break more significantly than Enhancers (DHS_enh), which break  www.nature.com/scientificreports/ more significantly than Dyadic regulatory elements (DHS_dyadic) and the rest of DHS (DHS_rest) (Fig. 1D). Interestingly, CRM, which comprises overlapping regulatory regions (promoters, enhancers, dyadic) 44 , is a hotspot and has an intermediate Hscore. This further underlines the robustness of our BP hotspotness magnitude scale. Finally, DHS which are not labeled to have any gene regulatory function (DNS_rest), are not significantly broken more than random (Fig. 1D). Taken together, these results show that DNA breakages are more enriched in promoters than in enhancers and dyadic regions. Furthermore, BP are more significantly enriched in open chromatin structures (DHS) with attributable gene regulatory functions like enhancers and promoters than DHS, with no attributable gene regulatory function, like DHS_res. Therefore, because DHS has an open DNA structure both inside and outside genes, we hypothesize that they do not break merely due to the openness of their chromatin but rather due to their genic context. We thus refined our analysis by splitting each DNA element into two categories: located inside or outside genes.
Regulatory elements are almost exclusively "hot" inside genes and not outside them. To address the impact of the functional genomic context of a DNA element on the BP frequency, we split them into those located inside genes from the transcription start site (TSS) to the transcription end site (TES) (including DNA elements with at least 1 bp overlapping with genes) and those located outside genes, and then computed Hscore for each group in sliding windows. We found that all regulatory elements (Fig. 2) were more significantly broken when they were located inside genes than outside, independently of their structure (sequence type like CpGi vs DNA-protein complexes) or function (promoter type vs enhancer type). CRM were always broken more frequently than by chance when located inside genes, with an Hscore of 67.82 inside genes and less than 3 outside ( Fig. 2A, Supplemental Table S3). Although CpGi located outside genes were slightly significantly broken (Hscore = 6.41), this significance was much higher when they were located inside the genes (Hscore = 83.30) (Fig. 2B), with a ratio of Hscores inside/outside genes (RHscore i/o) of 12.99 (Supplemental Table S3). Similarly, all types of DHS regulatory elements were broken more significantly inside genes than outside ( Fig. 2C-F). These results suggest that the transcriptional and/or epigenetic context of genes strongly modulates the ability of DNA regulatory elements to harbor DNA BP. Since regulatory elements have an open DNA structure both inside and outside genes when they are active, we rule out the possibility that their propensity to harbor DNA BP is merely due to the openness of their chromatin.
NBD and DNA repeats are hotspots both inside and outside genes. We also split NBD and DNA repeats into those located inside genes and those located outside them (Fig. 3). MR, IR, DR, and STR were significantly broken to a similar extent inside and outside genes (Ratio Hscore i/o 1.02, 1.12, 1.02, 1.13 respectively) (Fig. 3A-D). While RLFS were highly significantly enriched in BP outside genes (Hscore = 66.24), they were even more significantly broken inside them (  Table S3). GQ were also highly significantly enriched in BP outside genes (Hscore = 180.53) and tended to be more broken inside them (Hscore = 230.17, RHscore i/o = 1.27) (Fig. 3F). Z DNA tended to be more frequently broken outside genes than inside them (RHscore i/o = 0.73) (Fig. 4G). Hscores of DNA repeats located inside and outside genes were almost identical except for SCS-S, which were significantly more broken than random inside genes (Hscore = 8.94) but not outside them (Hscore = 1.18 < 3), while SCS-G were indistinguishable from random ( Fig. 4). Together these results suggest that, except for SCS-S, GQ and RLFS, NBD and DNA repeats are insensitive to gene context and may cause GI by a mechanism largely independent from transcription and which is probably replication-dependent. Interestingly, GQ and RLFS share the properties of the three DNA element types: they are sensitive to their genic context as regulatory elements and are breakable when located outside genes, as are DNA repeats and NBD.
Not all LMS present BP distributed as hotspots. The abovementioned results obtained on the whole cohort were global so we still did not know what happens in each patient. To evaluate the DNA breakage mechanisms present in each patient, we computed Hscore for regulatory elements, NBD and DNA repeats in each LMS tumor sample and made a hierarchical clustering of LMS patients based on these Hscores (Fig. 5, see "Materials and methods" section). We found that not all LMS patients had BP distributed as hotspots and that there was a gradient of hotspotness. Interestingly, the level of hotspotness does not seem to correlate with total BP counts ( Fig. 5) showing that Hscore is not a measure of tumor mutational burden, but rather a measure of the propensity of tumor DNA BP to be concentrated in given DNA elements. Furthermore, each LMS sample had its specific profile: while some patients had no detectable hotspots, others had BP hotspots mainly in the regulatory regions, and others still had them mainly in NBD and DNA repeats. There were even some patients who had them in both (Fig. 5). Interestingly, metastatic and non-metastatic patients were not evenly distributed over the gradient of hotspotness, the hotspot side of the heatmap having fewer metastatic events than the opposite side (Fig. 5). We therefore address the question of the relation between BP hotspotness and patient prognosis with different approaches in the next sections.
The LMS cohort can be stratified into clinically relevant groups. Because (a) regulatory elements, NBD and DNA repeats have been thoroughly documented to impede transcription and replication in both a transcription-associated 18 and replication-associated 3 manner and to cause GI, and (b) GI is predictive of poor prognosis 45 , we hypothesized that there is a link between transcription-and replication-associated DNAbreakage mechanisms and metastatic clinical outcome in LMS. To test this hypothesis, we sought to quantify the overall transcription-associated and replication-associated GI. We used our BP hotspotness magnitude scale and derived genomic indexes for both transcription-dependent and transcription-independent DNA-breakage and genome instability. www.nature.com/scientificreports/ inside or outside genes, we considered these elements as transcription-independent and thus as replicationassociated chromosomal instability elements (RACINe). Conversely, we considered RLFS, GQ, CpGi, CRM, SCS-S and DHS as transcription-associated chromosomal instability elements (TRACe), because they are more frequently broken than chance inside genes and not outside them. By consolidating TRACe and RACINe into one functional group and computing Hscores (see "Materials and methods" section), we derived a TRAC index (iTRAC) and a RACIN index (iRACIN), respectively. iTRAC and iRACIN allow the quantification of the overall contribution of RACINe and TRACe to DNA breakage and therefore to GI in each LMS patient. To address the relationship between iTRAC, iRACIN and metastatic clinical outcome, we developed a method called Iterative multi-threshold PARTioning (iPART) (see "Materials and methods" section). First, we used iPART to establish a threshold for iTRAC and iRACIN that splits the LMS cohort into two groups with a maximum difference in MFS (Metastasis-Free Survival). Figure 6A,B left panels show that iTRAC has several thresholds giving P-values less than 0.05 (red horizontal dashed line), while iRACIN has none of www.nature.com/scientificreports/ them. We also noted the presence of several local minima suggesting that there could be more than one threshold to split the LMS cohort into groups of different MFS. To test this idea, we applied iPART using all combinations of two thresholds from all local minima below 0.10 for iTRAC and 0.3 for iRACIN (blue horizontal dashed line) (materials and methods). Accordingly, both iTRAC and iRACIN significantly stratified the LMS cohort (P = 3.08 × 10 -5 and P = 4.13 × 10 -5 , respectively) into three groups with distinct outcomes (Fig. 6A,B right panels). Strikingly, it was the groups of LMS with medium iTRAC or medium iRACIN which had the most unfavorable outcome (Fig. 6A,B right panels). We also applied iPART on TBPc, nBPSVintra, nBPSVinter. Although TBPc, nBPSVintra and nBPSVinter were slightly significant (Supplemental Fig. S2), they were not comparable to the level of significance we obtained by iTRAC and iRACIN. Thus, iTRAC and iRACIN have far more added value in the stratification of metastatic risk in LMS than mere BP counts, further stressing the relevance our approach.

Mixed transcription and replication-associated genomic instability classifier (MAGIC).
To translate these results into a clinically relevant stratification tool, we sought to integrate both iTRAC and iRA-CIN into one classifier called the Mixed transcription-and replication-associated genomic instability classifier (MAGIC). Given that both indexes have high and comparable statistical significance in stratifying LMS, we considered at high risk (MAGIC High-risk) any patient classified as medium level by either iTRAC or iRACIN, and the rest of patients as low risk (MAGIC Low-risk). MAGIC achieved a very high level of significance in stratifying LMS samples (P = 8.75 × 10 -8 ; MAGIC outperforms histologic FNCLCC and molecular CINSARC gradings. The histological FNCLCC grading system predicting patient evolution is the current standard in sarcomas 34,46 . Complexity INdex in SARComa (CINSARC), which is the best molecular signature of sarcomas, is challenging the histological gold   31 . Both approaches individually did not significantly split the LMS cohort into groups with different metastatic evolution (Fig. 7B,C). This finding precludes the introduction of either of these grading systems in multivariate analysis, so we conclude that MAGIC strongly outperforms both the FNCLCC grading system and CINSARC in LMS metastasis risk stratification.
iPART stratifies a Pan-Cancer cohort of twelve cancer types significantly into clinically relevant groups. The intermediary level of GI resulting in poor clinical outcome compared to low and high levels was also reported in a Pan-Cancer study of 12 cancer types (TCGA cohort) 47 . The authors (Andor et al.) used CNV abundance as a measure of GI and found that CNV affecting between 25 and 75% of a tumor's meta-genome was predictive of poor survival. We thus hypothesized that what we observed in LMS might be a general mechanism associated with tumor aggressiveness. To tackle this question in a reasonable time scale with our computational resources, we used the iPART algorithm directly on CNV abundance from the Pan-Cancer study 47 as a proxy for iTRAC/iRACIN. Those authors used the arbitrary and commonly used method of splitting the data into quartiles based on thresholds of 25%, 50%, and 75% to segment the cohort into four groups based on CNV abundance in the tumor meta-genome. Using iPART, we split their cohort into 2, 3, 4, 5, 6, 7 and 8 groups and evaluated the influence of each split on the risk of mortality using the Log-rank test and hazard ratio (Fig. 8, Supplemental Fig. S4). The best data segmentation corresponded to 5 groups split with the log-rank test P = 6.9 × 10 -10 and maximal HR 4.7 (P = 3.42 × 10 -9 ) (Supplemental Fig. S4). On the other hand, the data segmentation applied by Andor et al. had a log-rank test P = 5 × 10 -6 ( Fig. 5b from Ref. 47 ) and a maximum HR no exceeding 1.9 with P < 0.005 (Fig. 5c from Ref. 47    .08], P = 0.0032), no therapeutic benefit was found in the Medium and High risk iTRAC groups (Fig. 9B). In the High risk iTRAC group, there was a tendency for chemotherapy to be beneficial, but there was not enough statistical power (only three events in the chemotherapeutic arm) to test this hypothesis. Further patient inclusion would be needed to test the relevance of the prediction of the response to chemotherapy. Conversely, none of the iRACIN groups was relevant for stratifying chemotherapeutic response, again probably due to the small sample of treated patients (Fig. 9C). Interestingly, the ability of iTRAC to stratifiy chemotherapeutic response in LMS remains significant using overall survival (OS) as clinical endpoint (Supplemental Fig. S5) further stressing the relevance of our approach.

Discussion
This study describes new tools, measures and insights that address the question of the clinical outcome and its relationship with genomic rearrangement. By deciphering the mechanisms of GI, we have produced the iTRAC and iRACIN indexes which account for transcription-and replication-related GI. We also generated the MAGIC classifier and demonstrated its prognostic value for LMS in outperforming both the current gold standard histologic and molecular challenger grading systems. Moreover, our indexes are potentially applicable to the Pan-Cancer cohort, as shown by the significant prognosis we established in twelve different cancer types using iPART and CNV abundance as a proxy for iTRAC/iRACIN.  www.nature.com/scientificreports/ nation (HR)-deficient samples [48][49][50][51][52][53][54] . However, most of these studies were merely based on the per patient counts of BRCA1/2 mutations, genome SNPs, loss of heterozygosity (LOH), or specific structural variations (CNV, Telomeric Allelic Imbalance, etc.). These studies assume that an a priori selection of chosen genomic alteration types can recapitulate the dynamics of GI in a tumor that is complex, heterogeneous and subject to selection pressure from the tumor micro-environment. Here we present a new measure of GI that is based on the quantification of all identified DNA BP in a tumor sample, irrespective of their SV type. We introduce the notion of Hscore, which is a BP hotspotness magnitude scale that measures the propensity of a given functional and/or structural genomic DNA element to harbor BP more than expected by chance. We also applied Hscore on TRACe and RACINe as a holistic approach to measure GI based on measuring the steady-state equilibrium between DNA damage and repair. The method quantifies the residual DNA BP remaining after unsuccessful repair, irrespective of SV type, and quantifies the relative contribution of two main contributors to GI: TRAC and RACIN. Hscore is measured on all tumor BPs without any biological-rational selection bias toward SV types. Values are comparable between different DNA elements in a patient and between patients. The higher the Hscore, the more unlikely it is that the observed BPs are due to random events. Therefore, the likelihood is greater that they are due to the structural and/or functional properties of those DNA elements.

MAGIC: towards a new standard for LMS grading? LMS prognostication is still challenging and man-
datory to evaluate which therapeutic strategy can benefit which patient. Since some patients have a poor prognosis associated with a transcription stress and others with a replication stress, we produced a simple patient-stratification tool called MAGIC. While neither histologic FNCLCC nor molecular CINSARC grading are suitable predictive methods in an LMS cohort, MAGIC achieved a high level of significance, with the high-risk group having a median MFS = 1.8 years, i.e. fivefold lower than the low-risk group: 10.5 years. MAGIC could be used to spot patients with a high risk of metastasis. Prospective validation of this hypothesis is now mandatory.

Prognostic relevance of iTRAC and iRACIN for metastatic risk stratification of other cancers.
Unexpectedly, the risk of metastasis was found to follow a lambda (Λ) shape with the increase in iTRAC and iRACIN: both low and high GI levels correspond to a lower metastatic risk, while a medium level indicates a higher metastatic risk. Interestingly, similar results were previously reported in a Pan-Cancer analysis of 12 cancer types 47 concerning the abundance of copy number variations (CNV). We improved Andor et al. results using iPART and we further highlighted the added value of quantifying transcription-and replication-associated GI and iPART algorithm versus a simple SV count. These results strongly suggest that our approach would work on different cancer types and and different GI measures on other highly remodeled cancer genomes.

Predictive relevance of iTRAC and iRACIN for chemotherapeutic response in LMS. The overall
contribution of curative and adjuvant cytotoxic chemotherapy to 5-year survival in adults was estimated to be 2.3% in Australia and 2.1% in the USA 55 . Furthermore, it has been estimated that any class of cancer drugs is ineffective in 75% of patients (Personalized Medicine Coalition; The personalized medicine report Opportunity, Challenges, and the Future 2017; http:// www. perso naliz edmed icine coali tion. org/ Userfi les/ PMC-Corpo rate/ file/ The-Perso naliz ed-Medic ine-Repor t1. pdf). Thus, predicting which patients are eligible for which treatment and those who are not is the holy grail of precision medicine. Here we show that chemotherapy for patients with a low iTRAC would be detrimental for their MFS and should therefore be prohibited. In addition, chemotherapy is likely to have no clinical benefit for patients with a medium iTRAC so another therapeutic strategy should be used for them. No conclusion can be drawn for patients with a high iTRAC because of the low statistical power of the log-rank test due to the low number of patients with a high iTRAC which underwent chemotherapy. Nevertheless, high-iTRAC patients would probably benefit from chemotherapy and more inclusions are needed to address this question. On the other hand, iRACIN was not predictively relevant for stratifying the chemotherapeutic response. Nevertheless, we expect it to be relevant for stratifying targeted therapies based on targeting replication and replication-associated repair. We do not consider using MAGIC in this setting as it is a combination of groups of patients albeit similar in their metastatic risk occurrence, but who have different GI mechanisms. Therefore, iPART/iTRAC/iRACIN could serve as a toolset to tackle the question of the relation of GI to therapeutic intervention based on the genomic processes undermining genomic integrity. www.nature.com/scientificreports/ Targeted therapy as a therapeutic strategy in LMS. Our understanding of GI may serve both for prognosis and to orient the choice of therapeutic agent. In patients with germline BRCA1 or 2 mutations, PARP inhibitors have led to major therapeutic advances in patients with ovarian cancer 56 and to a lesser extent, breast cancer 57 over the past years. Furthermore, it has been proposed that most LMS tumors display hallmarks of "BRCAness", including alterations in homologous recombination DNA repair genes, enrichment of specific mutational signatures, and cultured LMS cells sensitive to olaparib and cisplatin 58 . Thus, PARP1 inhibitors are a candidate treatment in LMS. The use of iTRAC/iRACIN in clinical settings would allow the selection of LMS patients who potentially would have a significant therapeutic response to PARP1 inhibitors.

Future directions.
Given the far-reaching consequences of GI for treatment success, therapeutic choices and clinical care, an accurate measure of GI and its dynamics is paramount in precision medicine. The development of robust biomarkers enabling GI dynamics to be captured is crucial if we are to leverage the potential of GI for patient stratification purposes and for exploiting this feature for making therapeutic choices. Deriving minimally invasive approaches that enable clinicians to assess whether GI is at play within a given tumor sample might be crucial for its efficient exploitation in clinical settings. Recent advances in whole genome/whole exome sequencing of formalin-fixed, paraffin-embedded (FFPE) tissues allows the application of this approach in routine clinical settings. Finally, prospective findings now need to be validated by measuring iTRAC/iRACIN in circulating tumor DNA (ctDNA) or in circulating tumor cells (CTC) and in subsequent clinical trials.

Materials and methods
Samples. LMS Samples (112) used in this study were collected as part of the ICGC program (International Cancer Genome Consortium; https:// icgc. org/) with patient consent. Samples were frozen tissues provided by pathologists and a blood sample for each included patient provided by medical oncologists. All cases were systematically reviewed by expert pathologists of the French Sarcoma Group according to the World Health Organization recommendations 59 .
DNA extraction. Genomic DNA from frozen samples was isolated using a standard phenol-chloroform extraction protocol 60 . DNA was quantified using a Nanodrop 1000 spectrophotometer according to manufacturer's recommendations (Thermo Scientific, Waltham, MA, USA). Blood material from included patients was also available. Genomic DNA from blood samples was extracted using customized automated purification of DNA from compromised blood samples on the Autopure LS protocol according to the manufacturer's recommendations (Qiagen, Hilden, Germany), with increased centrifugation of 10 min for DNA precipitation and DNA wash.
Whole genome sequencing and analysis. To construct short-insert paired-end libraries, a no-PCR protocol was used with the TruSeq™DNA Sample Preparation Kit v2 (Illumina Inc., San Diego, CA, USA) and the KAPA Library Preparation kit (Kapa Biosystems, Basel, Switzerland). Briefly, 2 µg of genomic DNA were sheared on a Covaris™ E220, size-selected and concentrated using AMPure XP beads (Agencourt, Beckman Coulter, Brea, CA, USA) in order to reach a fragment size of 220-480 bp. Fragmented DNA was end-repaired, adenylated and ligated to Illumina-specific indexed paired-end adapters. DNA sequencing was performed in paired-end mode in lanes of HiSeq2000 Flowcell v3 (2 × 100 bp) or Flowcell v4 (2 × 125 bp) or v4 (2 × 125 bp) or in sequencing lanes of NovaSeq 6000 Flowcell S4 (2 × 150 bp) (Illumina Inc., San Diego, CA, USA) to analyze tumor or matched normal blood samples (from the same patient) and to reach a minimal yield of 145 or 85 Gb, respectively. Two tumor samples (LMS2T and LMS5T) were sequenced in 20 lanes of HiSeq2000 Flowcell v3 to reach a minimal yield of 560 Gb. Image analysis, base calling and quality scoring of the run were processed using the manufacturer's software Real Time Analysis (RTA 1.13.48) and followed by generation of FASTQ sequence files by CASAVA (Illumina Inc., San Diego, CA, USA).
DNA reads were trimmed of the 5ʹ and 3ʹ low-quality bases (PHRED cut-off 20, maximum trimmed size: 30 nucleotides (nt)) and sequencing adapters were removed with Sickle2 (Joshi NA, Fass JN. 2011 available at https:// github. com/ najos hi/ sickle). Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1. ; where x = ni, n = Li, p = Pu. The probability of observing more than ni BP under RBM is defined as P(X > ni) = 1 − P(X ≤ ni) = 1 -n i=0 f (i) ; where n = ni, and X is the random variable accounting for the number of observed BP. For the bed file of each DNA element, overlapping intervals were merged (bedtools) and then BP number (ni) and interval size (Li)  www.nature.com/scientificreports/ computed as follows: ni was computed as the sum of BP in all the merged intervals and Li was computed as the sum of all merged interval sizes. Hscore is then computed as the − log10 of P(X > ni) under RBM. We define readable genome size as the total ungapped genome length as defined at https:// www. ncbi. nlm. nih. gov/ assem bly/ GCF_ 00000 1405. 26/. It is equal to 2,948,611,470 bp.

Breakpoint identification and detection of structural variants. Structural variants (SV) were
detected from paired tumor/normal whole genome high-quality sequencing data. Paired-end reads were aligned using Bowtie v2.2.1.0 64 , which is a sensitive local option allowing soft-clipped sequences. The algorithm has three main steps: (i) identification of potential breakpoints, (ii) characterization of the second side of the breakpoints, and (iii) selection of high-confidence breakpoints. All parameters were set to analyze 60× tumor and 30× normal sequencing depth. Very conservative filters were used to minimize false positive detection.
(i) Identification at this step, reads with at least one soft-clipped end were analyzed as singletons. A position was considered as a potential breakpoint if it was covered by at least 4 soft-clipped reads, 5 soft-clipped bases (with at least two occurrences of two different bases), and if they represented more than 5% of the total amount of reads at this position in the tumor sample. We selected potential somatic events by discarding positions covered by at least one read and one base in a surrounding 5-nucleotide window in the normal sample. We refer to them as the "first side" of the breakpoint. (ii) Characterization to determine the genomic positions of the soft-clipped sequence from selected reads, we used the UCSC blat server 65 . If no match was returned, the reverse complement sequence was pulled to test. If there was still no match, the BAM file was investigated for some soft-clip somatic position around the discordant or oversized-insert read mate (hereafter named abnormal) location from the first side of the breakpoint. Because of the small size of the soft-clipped sequence, multiple matches can be found. We used soft-clipped abnormal read mates to select matches with the most coherent chromosomic locations. We refer to them as the "second side" of the breakpoint. (iii) Selection Positions detected from both the first and second sides (in a 5-nucleotide window) were defined as the common pool. We considered as artifacts (due to repeat regions for instance) any couples of positions covered with reads and associated soft-clipped sequences separated by fewer than 15 nucleotides and discarded them. We classified the breakpoints in three groups: high-confidence breakpoints, breakpoints needing investigation, and unique position breakpoints. If a breakpoint was covered by reads and associated soft-clipped sequences having both positions belonging to the common pool, it was classified in the first group. If a breakpoint was covered by reads and associated soft-clipped sequences having only one of the positions belonging to the common pool, it was classified in the second group. Then the missing position was searched among the filtered positions. If it was present in the normal sample, the position was discarded and the breakpoint was completed otherwise. Finally, the third group corresponds to breakpoints with both sides outside the common pool and considered as unique: these were discarded. The sides of breakpoints were sorted according to their chromosomic positions to avoid duplicates. Data for CpG islands, microsatellites, simple repeats, low complexity, retrotransposons, long terminal repeats, self-chains and sequencing gaps were obtained from the UCSC Genome Browser website (http:// genome. ucsc. edu/; genome assembly hg38). All Non-B DNA except RLFS were generated using the non-B DNA research tool from the non-B DNA database 66 . RLFS data were generated using QmRLFS-finder 67 . CRM data were obtained from Remap2018 44 and data were downloaded from (http:// pedag ogix-tagc. univ-mrs. fr/ remap/). DNase I-accessible regulatory regions (with − log10(p) ≥ 2) were downloaded from the roadmap epigenomics project at https:// perso nal. broad insti tute. org/ meule man/ reg2m ap/ Honey Badge r2_ relea se/ and coordinates were converted from genome assembly hg19 to hg38 using Liftover, the UCSC coordinates conversion tool 68 . TRAC index/RACIN index. The DNA elements were sorted in separate indexes depending on whether their enrichment in BP was dependent or not on their presence inside or outside the genes (see "Ingene/outgene split" section). DNA elements enriched in BP independently of their position inside or outside the genes were sorted as Replication-Associated Chromosomal INstability elements (RACINe). DNA elements enriched in BP according to their position inside the genes were sorted as TRanscription-Associated Chromosomal instability elements (TRACe). For each TRACe and RACINe element, we pooled all bed files of the corresponding DNA elements into one file and sorted them according to interval positions (bedtools) and merged (bedtools) all overlapping intervals to obtain the corresponding index iTRAC and iRACIN, each as a single bed file. BP counts and interval sizes were computed for each index. These were then used to compute Hscores under RBM.
Heatmap. Hscore was computed with Holm's adjusted P-values procedure. Two patients (LMS78T and LMS131T) had maximal Hscores of 170.53 and 50.16 while all the other patients had a maximal Hscore less than 39. Therefore, the heatmap was unexploitable as only these two maximal data points were visible. For the sake of clarity, Hscores of patients LMS78T and LMS131T were normalized by dividing them by 170.53 and multiplying them by 39.
Ingene/outgene split. A DNA element was considered as inside a gene if overlapped by at least 1 bp the gene interval delimited by its Transcription Start Site (TSS) and Transcription End Site (TES). Gene coordinates were taken from curated RefSeq entries from the UCSC table browser page (https:// genome. ucsc. edu/ cgi-bin/ hgTab les; group = genes and genes prediction; track = NCBI RefSeq; table = RefSeq Curated). Only genes that had expression data in these tumors were considered (for list of genes, see Supplemental Table S4).

SCS-S/SCS-G.
Self-chains (SC) were prepared as in Ref. 69 except that we split SC segments (SCS) into those are self-aligned (SCS-S) and those are gapped (SCS-G) that is having spacing intervals separting each pair of SC. SCS are defined as the segment of any paired SCs in the same chromosome and their spacing gap. The paired SCs located in different chromosomes and those in the same chromosome but having long spacing intervals (SCS size 30 kb) were filtered out to account only for local interactions. In addition, any SCS-S/SCS-G overlapping with the human genome gaps and segmental duplications was further filtered out.
Statistical analysis. All statistical tests and the heatmap were carried out using R (R Core Team (2020); R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria;http:// www.r-proje ct. org/ index. html).
Leave-one-out cross-validation. Each time one patient was removed from the cohort of patients, thresholds of iTRAC and iRACIN was computed and used to classify the removed patient into Low, Medium, or High.
iPART . We consider iPART (Iterative multi-thresholds PARTitioning) to be an unsupervised decision tree (UDT). It is a method that combines the properties and objectives of both unsupervised clustering and decision trees (DT). Hence, iPART looks for thresholds that maximize the differences in groups instead of computing pairwise distances and constructing hierarchical clusters. It resembles DT and regression trees (RT) by using thresholds to split groups. It differs from DT in that it is unsupervised. It also differs from RT in that it does not attempt to predict quantitative variables. The fundamental difference with both RT and DT is its ability to use binary (splitting data into two groups) and ternary (splitting data into three groups) modes, i.e. a crucial feature in our method. It also differs from DT and RT in using the Kaplan-Meier (KM) estimate instead of the GINI purity index or information gain index and sum of squared residuals for DT and RT, respectively. It also resembles unsupervised machine learning like hierarchical clustering and k-means by aiming to find natural patterns/groups in data. On the other hand, it differs from them in that it does not compute pairwise distances or try to construct groups by minimizing their intra-group variance. Instead, it iterates all possible thresholds, find thresholds that maximize the difference between the split groups in terms of the speed at which metastatic events occur in the two groups by minimizing the P-value of the KM test. Briefly, it establishes natural frontiers that maximize the differences between groups instead of constructing groups that minimize intra-group and overall variance.
Ethics declaration. The ethics of this study was validated by French committee for protection of persons (CPP).

Data availability
The code to repreoduce the results presented in this paper is available at CODE OCEAN: https:// codeo cean. com/ capsu le/ 97034 39/ tree.