Etiology of oncogenic fusions in 5,190 childhood cancers and its clinical and therapeutic implication

Oncogenic fusions formed through chromosomal rearrangements are hallmarks of childhood cancer that define cancer subtype, predict outcome, persist through treatment, and can be ideal therapeutic targets. However, mechanistic understanding of the etiology of oncogenic fusions remains elusive. Here we report a comprehensive detection of 272 oncogenic fusion gene pairs by using tumor transcriptome sequencing data from 5190 childhood cancer patients. We identify diverse factors, including translation frame, protein domain, splicing, and gene length, that shape the formation of oncogenic fusions. Our mathematical modeling reveals a strong link between differential selection pressure and clinical outcome in CBFB-MYH11. We discover 4 oncogenic fusions, including RUNX1-RUNX1T1, TCF3-PBX1, CBFA2T3-GLIS2, and KMT2A-AFDN, with promoter-hijacking-like features that may offer alternative strategies for therapeutic targeting. We uncover extensive alternative splicing in oncogenic fusions including KMT2A-MLLT3, KMT2A-MLLT10, C11orf95-RELA, NUP98-NSD1, KMT2A-AFDN and ETV6-RUNX1. We discover neo splice sites in 18 oncogenic fusion gene pairs and demonstrate that such splice sites confer therapeutic vulnerability for etiology-based genome editing. Our study reveals general principles on the etiology of oncogenic fusions in childhood cancer and suggests profound clinical implications including etiology-based risk stratification and genome-editing-based therapeutics.

CRISPR targeting in HAL-01. Shown are NGS results for guides g1 (non-template insertion in cryptic exon; panel a), g2 (neo splice donor; panel d), g3 (neo splice acceptor; panel e), g4 (upstream negative control; panel b), and g5 (downstream negative control; panel c). For panel a, the rate of on-target editing (On-Target), the rate of lethal and non-lethal on-target editing are shown as a heatmap for three replicates from day 3 to day 19. For panels b and c, instead of using lethal/non-lethal, a tag of frameshift/in-frame was given to each indel according to its length because the target region is genuine intronic (thus a negative control). For panels d and e, the induced indels that happened to fall into coding region and lead to frameshift of TCF3-HLF are categorized into "Coding" group. Indels that directly disrupt the splice donor site are categorized into "Loss" group. For panels d and e, the induced indel may leave a residual GT/AG site. We calculated binding affinity of such residual splice sites by using a position specific weight matrix (PWM) approach (see Methods). Indels were grouped to bins according to their binding affinity scores (e.g., <2, 3-4, etc.). In e, most of the induced indels targeting neo acceptor fall into the coding region, so that only the "Loss" category has a sharp decrease of NGS read abundance from ~15% at day 3 to ~0% at day 19 post editing. On the other hand, ~15% of editing resulted in splice acceptor binding affinity to fall in bin of 5-6 at day 3, and these editing has resulted in a better fitness of host cells so that the NGS read abundance increased to >50% at day 19, a >3-fold increase. The binding affinity of the acceptors after these indels are predicted using position weight matrix (PWM) approach (see Methods). Replicate 2 of day 13 did not generate enough NGS reads for analysis and is indicated by black color. Due to rounding error, the row sums can be 100 or 99. Source data are provided in sheet Supplementary Fig.8 in Source Data file.   1 2 3 1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3  1 2 3 1 2 3  1 2 3  1 2   In panel c, the putative effect of indel length combinations is analyzed for isoforms α, β, and δ, respectively, according to frame status (I=in-frame; O=out-of-frame). The final impact on host cells is indicated by lethality (Y=Yes, N=No). Indel length is presented as modulus of 3 (reminder of division by 3). Source data are provided in sheet Supplementary Fig.9 in Source Data file.

Clinically recognized oncogenic fusions that generate chimeric proteins
Since the discovery of BCR-ABL1 fusion oncoprotein in Philadelphia-chromosome leukemia 1 , classical cytogenetics method has revealed many additional cancer subtypes such as t(1;19)(q23;q13) 2 that was later determined to generate TCF3-PBX1 3 , t(17;19) 4 that was later determined to generate TCF3-HLF 5 and t(11;22) (q24;q12) that was determined to generate EWSR1-FLI1 6 . With the advent of next generation sequencing technologies, oncogenic fusions are increasingly discovered in the past decade via simultaneously interrogating the genome (DNA sequencing) and the transcriptome (RNA sequencing) of tumors from patient cohorts of similar diagnosis, where DNA and RNA data cross validate each other. This include C11orf95-RELA (C11orf95 recently renamed as ZFTA) in pediatric ependymomas (EPD) 7 , KIAA1549-BRAF in pediatric low-grade glioma (LGG) 8 . Supporting evidence from both DNA (on the rearrangement) and RNA (on the generation of chimeric protein) is a critical feature of these findings.

Clinically recognized oncogenic fusions that leads to aberrant expression of proto-oncogenes
In addition to the above "conventional" oncogenic fusions where a chimeric fusion oncoprotein is generated, there is another category of "promoter-hijacking" fusions. In this category, a constitutively active promoter or enhancer region is brought to a proto-oncogene (via chromosomal rearrangement) that is typically silenced in corresponding lineage of the cancer cells. Such rearrangement leads to aberrant expression of the proto-oncogene. Prominent examples of this category include CRLF2/DUX4/EPOR aberrant expression via rearrangement to immunoglobulin heavy chain (IGH) region B-ALL 9,10 , TAL1/TAL2 aberrant expression via rearrangement to T-cell receptor region (TCR) in T-ALL 11 , GFI1 aberrant expression via intra-chromosomal rearrangements to active enhancers in medulloblastoma 12 , CRLF2 aberrant expression via intra-chromosomal rearrangement to P2RY8 promoter 13 , as well as our newly discovered BCL11B aberrant expression in lineage-ambiguous leukemia 14 . Because no chimeric proteins are generated, this fusion category is typically termed "promoter/enhancer-hijacking". Interestingly, other mutational mechanisms can also lead to such aberrant expression of proto-oncogenes. For example, the seminal work by Thomas Look and colleagues 15 has demonstrated that small insertions/deletions in enhancer regions of proto-oncogene TAL1 can be sufficient to lead to its aberrant expression in pediatric T-ALL. Although corresponding tumors do not have chromosomal rearrangements (or fusion events) involving TAL1, we still consider these tumors as TAL1 category.

Functional evidence of clinically recognized oncogenic fusions
Although experimentally challenging, putative oncogenic fusions such as ZFTA-RELA (also known as C11orf95-RELA) have recently been shown to be sufficient to drive pediatric ependymoma 16 . On the other hand, the success of imatinib on BCR-ABL1 17 , and the genetic knockout of oncogenic fusions such as TCF3-HLF in this work, have demonstrated that these oncogenic fusions, or more precisely the fusion oncoproteins they encode, plays an essential role to the survival of host cancer cells, which forms the basis of the hypothesis "oncogene addiction" that posits on the therapeutic value of targeting these oncogenic fusions.

Clinically recognized oncogenic fusions being invariable to clonal evolution and initiating driver
Comparison of tumors collected at initial diagnosis and at relapse for pediatric leukemia 13,18 has reinforced the notion that subtype-defining oncogenic fusions are cancer initiating events 19 . In these studies, the oncogenic fusions are always conserved between diagnosis and relapse tumors, although other subclonal mutations (e.g., CDKN2A loss and NT5C2 gain-of-function mutations) can be either eradicated or de novo acquired from diagnosis to relapse 13 . The clonal nature (i.e., being present in all cancer cells) of oncogenic fusions thus renders them ideal therapeutic targets.

Oncogene versus tumor suppressor gene (TSG)
In addition to the many oncogenic fusions mentioned above, extensive genome sequencing efforts in the past decade have led to the discovery of many additional significantly mutated genes also known as cancer drivers, in both adult 20,21 and childhood cancers 9,22 . In observation of these many cancer driver genes, Bert Vogelstein and colleagues 23 have pioneered the concept of classifying cancer driver genes into "tumor suppressor genes (TSG)" and "oncogenes", where a "TSG" is a gene that, when inactivated by mutation, increases the selective growth advantage of the cell in which it resides, while an "oncogene" is a gene that, when activated by mutation, increases the selective growth advantage of the cell in which it resides. Under this broad concept, the oncogenic fusions mentioned above belong to the category of "oncogene" because corresponding fusion oncoproteins are hyperactive. On the other hand, the well-known TSGs including CDKN2A and RB1 9,22 typically demonstrate inactivating (also known as loss-of-function) mutations, including gain of stop codon, protein-frame shifting, splice site altering, whole gene loss due to large deletion, or partial gene truncation due to focal deletion. A model of functional consequences on TSGs and oncogenes from diverse mutation types are illustrated in Supplementary Fig. 12-17, with data in figure adapted (Oct 15, 2022) from https://pecan.stjude.cloud/ 24 . Supplementary Fig. 12 Diverse mutation types to disrupt a tumor suppressor gene (TSG; a) and less diverse mutation types for hyperactivation (b). In a TSG, a mutation (#1) that disrupts promoter or enhancer can lead to expression loss, a mutation (#2) that disrupts translation start codon ATG, a mutation (#3) that disrupt the gene structure via an intronic breakpoint, a mutation (#4) that disrupts the splice sites, and a mutation (#5) that disrupt the protein codon can all lead to loss of function (total gene deletion not illustrated). On the other hand, there are limited ways to make an oncogene hyperactive (panel b), which include a stronger promoter/enhancer via mutation #1, a stronger amino acid via mutation #3, or forming a chimeric protein via rearrangement mutation #2. Supplementary Fig. 16 Example tumor suppressor gene RB1. RB1 gene has diverse mutation types in pediatric cancers, including stop gain mutations such as W78* in 8 specimens, R320* in 5 specimens. We also observed frameshifting mutations such as A74fs, L64fs, L317fs, D856fs. Moreover, the half-whitehalf-black circles indicate enrichment of structural rearrangements such as to gene RCBTB2 in 5 patients and another 9 patients to another region in chr13. Further, focal deletions were detected in pediatric T-ALL (6%) and B-ALL (2%) that removed last several exons of RB1. Supplementary Fig. 17 Example tumor suppressor gene CDKN2A. Unlike TSG RB1, CDKN2A is enriched with copy number loss in pediatric T-ALL (54%) and B-ALL (11%). Although in some tumors the detection can be so focal that only few exons are affected (in this case it is possible to detect a truncating "fusion" from RNAseq data), in many tumors the size of deletion can be as big as arm level so that no truncating "fusion" transcripts are expected in RNAseq data.

Clinically recognized fusion-negative samples
Although oncogenic fusions have been routinely used for clinical subtyping, not all human cancers are fusion positive. For example, in pediatric B-ALL it has long been known that fusion-negative subtypes exist, including hyperdiploid (that with >50 chromosomes) and hypodiploid (those with <45 chromosomes) B-ALL 25 . In pediatric neuroblastoma, extensive efforts in the study of whole genome, exome, and transcriptome sequencing data have not identified clinically meaningful oncogenic fusions for most samples, except the well-known high MYCN amplification in ~20% of patients 9,26,27 . In malignant rhabdoid tumours, SMARCB1 homozygous loss is the only hallmark of nearly all patient tumors 28 . Similarly, RB1 homozygous loss is the only hallmark of nearly all retinoblastoma tumors 29 . Clearly, candidate fusions detected in tumors of fusion-negative subtypes such as hyperdiploid B-ALL, neuroblastoma, rhabdoid or retinoblastoma tumors are more likely passenger events, if not artefacts, and scrutiny is warranted before accepting them as a true oncogenic fusion, as will discussed in section SN 11g on mutual exclusivity pattern among oncogenic fusions. This data highlights the critical need of knowledge on well-defined tumor subtypes to ensure scientific rigor in reporting novel oncogenic fusions. In fact, clinically-relevant novel fusion-negative subtypes continue to be discovered, such as the novel subtype of UBTF-ITD in pediatric AML among the known clinical fusion-negative subtypes of NPM1 and CEBPA 30 .

Remarks on clinically recognized oncogenic fusions
The above data highlights a few characteristics of clinically recognized oncogenic fusions such as BCR-ABL1: 1) to date all of these fusions are in-frame and activating (i.e., TSG does not belong to the category of oncogenic fusions); 2) promoter/enhancer-hijacking can be regarded as a different category of oncogenic fusion because they do not generate chimeric proteins; 3) these fusions are subtypedefining so that typically we see no more than one fusion per tumor, also known as mutual exclusivity rule 30 that will be discussed in section SN 11g; 4) despite extensive clonal evolution during the course of the life span of a tumor, subtype-defining oncogenic fusions typically remain intact; 5) like ZFTA-RELA (also known as C11orf95-RELA) and TCF3-HLF, these fusions are expected to be functionally sufficient and necessary to the host cancer cells; 6) not all human cancers are expected to have oncogenic fusions. Interestingly, to date all clinically recognized oncogenic fusions in pediatric cancers have supporting evidence from both DNA and RNA sequencing data whenever both data types are available, highlighting a critical bioinformatic pattern during technical evaluation of candidate oncogenic fusions.

Study design of this work
The above molecular mechanistic insights lead us to following strategy in this study design.

8.a) Tumor suppressor genes.
Due to the diverse mutation types (including substitutions (SNVs), small insertion/deletions (Indels), copy number loss (CNVs), or structural alterations (SVs)) that can all lead to loss-of-function, we always rely on DNA sequencing (especially whole genome sequencing) to definitively ascertain the mutation status for TSGs. Although occasionally truncating mutations can be detected in RNAseq, we deem a whole-genome sequencing cohort would better serve the goal of comprehensively and unbiasedly studying etiology. In fact, we are currently drafting a manuscript on the signatures of rearrangements (SVs) using whole genome sequencing (WGS) in >1,500 pediatric cancer patients. With this consideration, we decided to NOT include tumor suppressor gene in this study, which is designed to focused on oncogenic fusions like BCR-ABL1. However, in response to Reviewer #2's request, we analyzed highly frequent CDKN2A and NBAS truncating fusions in section SN 12.

8.b) Oncogenic fusions in the category of promoter/enhancer hijacking.
In this category, a rearrangement can bring a strong promoter/enhancer to an otherwise silenced proto-oncogene and lead to its aberrant expression. When the novel promoter is far, the proto-oncogene may start its transcription from its own transcription start site, thereby leaving no split reads or discordant read pairs in RNAseq data for bioinformatic detection (Supplementary Fig. 18a). On the other hand, the transcripts may contain part of the novel promoter sequences when the novel promoter is closer ( Supplementary  Fig. 18b). Moreover, it is also possible that a point mutation in the native promoter can convert it to a strong active promoter (such as TAL1 in pediatric T-ALL 15 ) and lead to aberrant expression (Supplementary Fig. 18c)-biologically this scenario is not promoter/enhancing per se. Clearly, without DNA (preferentially whole genome) sequencing data, scenarios a) and c) cannot be resolved by transcriptome sequencing and, a forced analysis will result in biased conclusions that does not meet our scientific rigor. Instead, such patterns are best studied in our ongoing project on the signatures of rearrangements (SVs) using whole genome sequencing in >1,500 pediatric cancer patients. Nevertheless, we provided results on known oncogenic fusions (CRLF2, DUX4, EPOR, BCL11B, Supplementary Data 26-30) in promoter/enhancer-hijacking category to address Reviewer #2's request though we did not perform systematic discovery. Fig. 18 Promoter/enhancer-hijacking. In hypothetical scenario (a), the blue chromosome (and a strong enhancer/promoter highlighted by blue oval) was brough proximity to the orange proto-oncogene and lead to its aberrant expression. Transcription only involves the proto-oncogene due to the space between the blue promoter and orange gene. In (b), the blue promoter contacts the transcription start site of orange gene, so that the transcription involves both the proto-oncogene and a small part of the blue chromosome. In (c), a point mutation in the promoter region may convert it to a strong active promoter to initiate the proto-oncogene without a fusion event (such as TAL1 enhancer mutation in Mansour et al (2014) 15 . Split reads or discordant read pairs are expected for scenario (b) but not scenario (a) or (c). DNA (preferentially whole genome) sequencing are needed to ascertain the fusion status. Nevertheless, the aberrant high expression of such proto-oncogene typically can help ascertain the tumor subtype. Fig. 1a, oncogenic fusions that generate chimeric proteins are obligated to have split read or discordant read pair signals in RNAseq data, either polyT protocol or total RNA protocol. It is this exact category that our large cohort of 5,190 RNAseq datasets can be used to generate scientifically rigor discoveries.

Predicting DNA breakpoints from RNAseq data
In this work, in addition to RNA junctions, we attempted to detect DNA breakpoints (here termed devent) from RNAseq data to interrogate the uniformity of DNA breakpoints in relative intronic regions. As shown in the model of Supplementary Fig. 19, although RNA splicing breakpoints (here termed mevent; also known as "fusion" or splice-junction) are guaranteed to be observed in mRNA species that have underwent splicing, theoretically d-events are only observed in total RNA sequencing but NOT in poly(T)-based mRNA sequencing. As a sanity check, we compared our d-event detections in RNAseq data against the ground truth d-events defined in DNA (whole genome) sequencing datasets and demonstrated that 91% of our detections are within 5-bp of ground truth (Supplementary Fig. 3a). The accuracy revealed by this sanity check enables us to reach reliable conclusion that DNA breakpoints are uniformly distributed in relative introns of oncogenic fusions. Supplementary Fig. 19 Detecting DNA and RNA breakpoints from next generation sequencing data. (a) During gene expression, genetic information encoded in wildtype human chromosomes (gray, thin line indicates intron/intergenic region, thick boxes indicate exons) are first transcribed into pre-spliced transcripts (pre-mRNA; black), which is in turn spliced (to remove introns and retain exons) to generate mature RNA species (mRNA, green). (b) This Central Dogma also applies to cancer genome (here blue and orange indicate two different chromosomal regions joined together by rearrangement). In this model, an intronic rearrangement (here termed as "d-event" to stress it is observed from DNA) happened between the two involved genes shown in blue and orange in the cancer genome. This devent is observable in pre-mRNA but typically is not observed in mRNA because the intronic regions are spliced out. On the other hand, in mRNA, the rearrangement is manifested as splicing junctions (here termed "m-event" to indicate it is observed in mRNA; commonly referred to as "fusion" events) which typically are not directly observed in the DNA of the cancer genome although biological inference is possible (with alternative splicing being the confounding factor in consideration). (c) In next generation sequencing, we can perform DNA sequencing (such as whole genome sequencing (WGS), targeted capture) to detect d-events and RNA sequencing to detect m-events. Earlier RNA sequencing practices typically utilize poly(T) protocol, which can only interrogate mRNA species and therefore can NOT be used to detect d-events. Recent RNA sequencing practices typically utilize total-RNA protocol, which can simultaneously interrogate mRNA species and pre-mRNA species and enables simultaneous detection of d-event (if the total RNA contains sufficient pre-mRNA species and therefore is not guaranteed) and mevent.

Mutual exclusivity of subtype-defining genetic alterations such as oncogenic fusions
It has long been recognized that pediatric cancers can be subtyped using genetic alterations. In childhood B-ALL, clinically well-defined subtypes 25 include hyperdiploid and hypodiploid caused by chromosomal gains and losses, respectively; oncogenic fusions/translocations including BCR-ABL1, KMT2A rearrangements, ETV6-RUNX1, etc. This data indicates two critical facts: 1) a childhood tumor may harbor no oncogenic fusions, because genetic alterations such as hyperdiploid and hypodiploid can also define subtypes. This is also true in other childhood cancers. For example, in neuroblastoma high MYCN amplification defines a genetic category. 2) a functional oncogenic fusion is sufficient to drive a distinct transcriptional program so that corresponding subtypes are well separated from other subtype, as clearly shown in childhood B-ALL 31 , AML 30 . These two facts further imply mutual exclusivity among oncogenic fusions: typically, no childhood tumor can harbor ≥2 functional oncogenic fusions. To illustrate mutual exclusivity, we first analyzed 63 clinically well-defined oncogenic fusions (termed "training set" hereafter) including 1) leukemias:  (Supplementary Data 29). This data clearly demonstrates the exclusivity pattern among subtype-defining oncogenic fusions.

Integration of predictions from 4 fusion calling methods
The four selected fusion calling methods (Arriba, Cicero, FusionCatcher, and STAR-Fusion) each produced 5,781,630 (230,565 for STAR-Fusion, 252,718 for Arriba, 1,632,086 for Cicero and 3,666,261 for FusionCatcher) of predictions that are challenging to manually review. For this, we developed a majority voting strategy to enable efficient and reproducible analysis.

11.a) Ambiguity in chromosomal coordinate difference between calling methods
We first studied potential ambiguity in the chromosomal coordinates between calling methods. For this, we defined two fusion candidates (of the same sample) to be the same event if involving breakpoints are within K base pairs. We varied K from 100 to 50, 20, 10, 5, 2, and 1. As can be seen from Supplementary Fig. 20a, the number of shared fusion predictions remained robust till K=5 for all pairwise comparisons. Therefore, we choose to use K=10 to determine whether two fusion candidates are same events. In this way, we classified all >5.7 million predicted candidate fusions from these methods into 1) 6,431 with 4 votes; 2) 9,917 with 3 votes; 3) 65,013 with 2 votes; and 4) 5,443,485 with 1 vote (Supplementary Fig. 20b). Supplementary Fig. 20 Identifying oncogenic fusions from 4 prediction methods (Arriba, Cicero, FusionCatcher, and STAR-Fusion) by majority voting. (a) Harmonization of fusion predictions. For chromosomal coordinates of same fusion events frequently differ between methods. We defined two predictions to be same events if the difference is within a predefined buffer length (from 100bp to 2bp) and the data suggests 10 to be a good cutoff. Because FusionCatcher and Cicero each have more than 1 million predictions, and because we will use majority voting, we choose to compare Arriba against other methods to determine the cutoff. (b) Recurrence of predictions. With the cutoff 10bp, we tallied frequency of predictions with 1, 2, 3, or 4 votes from different methods. Consistent with expectation, only 6,431, and 9,917 predictions have 4, 3 votes, respectively. This number (9917+6431=16348) of predictions are amenable for manual review. Therefore, we next tried to establish the enrichment of clinically relevant oncogenic fusions in these two bins by using 63 clinically well-known oncogenic fusions (c, d). First, we found that essentially all (>99.7%) fusion-positive tumors have exactly 1 oncogenic fusion (c), which is known as "mutual exclusivity rule". Second, we found that >93% of oncogenic fusions have 3+ votes. This data clearly established the enrichment of clinical-relevant oncogenic fusions within the high-vote bins, thus allowing us to efficiently analyze the whole cohort with a low false negative rate (<7%). Source data are provided accordingly as sheet Supplementary Fig.20a, Supplementary Fig.20b, Supplementary Fig.20c and Supplementary Fig.20d in Source Data file.

11.b) Primary and secondary calling of the same fusion in a sample and their votes
As illustrated in Fig. 4, oncogenic fusions can be subjected to alternative splicing, which in turn results in multiple fusion candidates for fusion detection methods. Clearly, transcript isoforms with low read supports are less likely to be detected by multiple methods. To account for such multi-calling, we classified fusion isoforms detected from the same sample into two categories: 1) "Primary call" has the highest read support; 2) all other calls of the same fusion are termed "Secondary calls". We then studied the enrichment of clinically defined well-known fusions (the training set in section SN 10) within the voting categories defined in section SN 11.a, for Primary and Secondary calls. Using Arriba as an example (Supplementary Fig.20c-d), 1% and 1.5% of oncogenic fusion events has only 1 and 2 vote, respectively. As a result, 97.5% of oncogenic fusion events have 3 or 4 votes, indicating the significant enrichment of information in high vote bins for Arriba. Taken together, in overall 4 fusion detection methods, >93% of primary calls (that of FusionCatcher; >95.8% for all other three methods) are in the 3-vote and 4-vote categories across all four methods. Therefore, we decided to focus on the categories of 3-vote and 4vote for discovery of novel oncogenic fusions, which can result in <7% false negatives. To further reduce impact of these potential 7% false negatives, we will check the newly discovered fusions in the 1-vote and 2-vote categories as detailed in section SN 11.e. This iterative process can be applied multiple times when effort permits.

11.c) Establishing "blacklists" using exclusivity pattern
As illustrated in section SN 11.b, 1,894 samples are determined to harbor a well-known oncogenic fusion. By using the mutual exclusivity rule among oncogenic fusions, we collected all fusion candidates (except the known oncogenic fusions) from these samples to establish "negative control", which is frequently referred to as "blacklist" in the field of cancer genomics, such as that used by Arriba 32 .

11.d) Manual review of candidate fusions
The blacklist approach in section SN 11.c resulted in 7 iii) low quality predictions as indicated by fusion detection methods (such as "low" and "medium" confidence and "readthrough" by Arriba), low mutant read count or low allele fraction, ambiguous mapping, or a frameshift version of the oncogenic fusion that is labeled as "byproduct". Interestingly, complex rearrangement 33 (also known as 3-way fusions) between KMT2A, MLLT10 and PIP4K2A was detect in a sample (SJAML065570_D1, Supplementary Data 21) and we only considered KMT2A-MLLT10 in this patient. Although this procedure started with 63 clinically well-known oncogenic fusions, the later steps are de novo discovery methods and ensures unbiased detection of oncogenic fusions.

11.e) Systematic identification of oncogenic fusions in all samples
With the comprehensive list of oncogenic fusions, including the training set (section SN 10) and 218 newly identified rare oncogenic fusion gene pairs (SN 11d), we extracted all predictions with 2 or more votes to maximize our detection. Upon manual review (Supplementary Data 31), we determined 2239 samples (involving 2005 patients) to be positive for chimeric oncogenic fusions.

11.f) Determining orientation of oncogenic fusions
Due to frequent balanced translocations (at DNA level), in RNAseq we can frequently detect reciprocal fusions for the same oncogenic fusions, such as RUNX1-ETV6 in ETV6-RUNX1 patients. To account for this possibility, we counted the number of patients/samples supporting the two possible orientations. We discovered that the larger of these two numbers matches the known oncogenic fusion orientation in literature for all 52 fusion gene pairs detected in ≥4 patients/samples. For oncogenic fusions with recurrence <3, 143 fusion gene pairs are discovered with one 1 orientation. Orientation of the remaining 77 fusions were determined based on a common fusion gene (Supplementary Data 32).

11.g) Confirmation of mutual exclusivity with complete list of oncogenic fusions
With the above curated oncogenic fusion gene pairs and corresponding samples, we re-evaluated the exclusivity rule as described in section SN 10. As it turned out, out of 2005 patients with subtypedefining oncogenic fusions, only 7 (0.35%) have ≥2 oncogenic fusions. Among these 7 patients, we discovered that SJBALL020141_D1 and SJBALL020142_D1 (both have oncogenic fusions MEF2D-DAZAP1 and KMT2A-MATR3) were sequenced on the same Illumina instrument (HWI-ST1188) with same flowcell (C49UKACXX) that may can cause contamination and lead to our observation. This mechanism may also apply to another two patients/samples, SJCBF124_D and SJCBF149_D (both have RUNX1-RUNX1T1 and CBFB-MYH11). Another patient PT_7DTGJYA7 was detected fusion FGFR1-TACC1 from one sample a9a478fc-6897-4c9c-b9f6-ffd2130f5166 and the other fusion FGFR3-TACC3 from another sample b3a2e094-e0fc-48a9-918a-efac4aa7f2fd. Therefore, we believe there are no more than 2 patients (0.1%) with ≥2 oncogenic fusions out of 2005 fusion positive patients.
When promoter-hijacking were considered, we observed frequent overlap between HOXA/HOXB cluster and KMT2A-rearranged AML that is consistent with previous report 34 , as well as overlap between MYC enhancer alteration 35 and several oncogenic fusions such as RUNX1-RUNX1T1 in various cancer types that may worth further functional validation. By excluding HOX and MYC, we observed an additional 7 patients (out of 2138 fusion positive patients, 0.33%) with double oncogenic fusions, and these 7 patients all have CRLF2 promoter hijacking, indicating a low but significant potential for CRLF2 aberrant expression as a nested oncogenic fusion (Supplementary Data 33).