Transcriptomic classes of BCR-ABL1 lymphoblastic leukemia

In BCR-ABL1 lymphoblastic leukemia, treatment heterogeneity to tyrosine kinase inhibitors (TKIs), especially in the absence of kinase domain mutations in BCR-ABL1, is poorly understood. Through deep molecular profiling, we uncovered three transcriptomic subtypes of BCR-ABL1 lymphoblastic leukemia, each representing a maturation arrest at a stage of B-cell progenitor differentiation. An earlier arrest was associated with lineage promiscuity, treatment refractoriness and poor patient outcomes. A later arrest was associated with lineage fidelity, durable leukemia remissions and improved patient outcomes. Each maturation arrest was marked by specific genomic events that control different transition points in B-cell development. Interestingly, these events were absent in BCR-ABL1+ preleukemic stem cells isolated from patients regardless of subtype, which supports that transcriptomic phenotypes are determined downstream of the leukemia-initialing event. Overall, our data indicate that treatment response and TKI efficacy are unexpected outcomes of the differentiation stage at which this leukemia transforms.

In BCR-ABL1 lymphoblastic leukemia, treatment heterogeneity to tyrosine kinase inhibitors (TKIs), especially in the absence of kinase domain mutations in BCR-ABL1, is poorly understood. Through deep molecular profiling, we uncovered three transcriptomic subtypes of BCR-ABL1 lymphoblastic leukemia, each representing a maturation arrest at a stage of B-cell progenitor differentiation. An earlier arrest was associated with lineage promiscuity, treatment refractoriness and poor patient outcomes. A later arrest was associated with lineage fidelity, durable leukemia remissions and improved patient outcomes. Each maturation arrest was marked by specific genomic events that control different transition points in B-cell development. Interestingly, these events were absent in BCR-ABL1 + preleukemic stem cells isolated from patients regardless of subtype, which supports that transcriptomic phenotypes are determined downstream of the leukemia-initialing event. Overall, our data indicate that treatment response and TKI efficacy are unexpected outcomes of the differentiation stage at which this leukemia transforms.
The BCR-ABL1 fusion drives two hematopoietic malignancies-chronic myelogenous leukemia (CML), a myeloproliferative disorder, and BCR-ABL1 lymphoblastic leukemia, an acute leukemia mostly of the B-cell lineage. Despite sharing the same initiating event, patients with these diseases show vastly different outcomes. Most CML patients can expect to live a near-normal lifespan owing to targeted therapy with tyrosine kinase inhibitors (TKIs), whereas the survival of BCR-ABL1 lymphoblastic leukemia patients remains dismal despite TKI intervention 1,2 . Approximately half of the patients diagnosed with BCR-ABL1 lymphoblastic leukemia die within 5 years of being diagnosed 2 . Most patients relapse due to the emergence of clones with kinase domain mutations, but 30-40% of patients relapse without kinase domain mutations, which is poorly understood 3 . TKI resistance is mostly viewed through the lens of kinase domain mutations, but whether kinase domain-independent mechanisms of therapy resistance contribute to the emergence of kinase domain-mutant clones is unknown.
The genomics of BCR-ABL1 lymphoblastic leukemia is well described. Loss of IKZF1 is present in 80% of patients 4 and mouse models have shown that IKZF1 alterations are linked to TKI resistance 5 . As many patients with BCR-ABL1 lymphoblastic leukemia demonstrate durable responses to treatment, some of them must harbor IKZF1 losses given the high frequency of this alteration. This discordance remains unexplained. Genetic alterations in other B-cell differentiation genes, such as PAX5, CDKN2A/B and EBF1, are also frequent in this disease 4

.
Article https://doi.org/10.1038/s41588-023-  were all expressed at the highest levels in C3. Notably, CD20, a mature B-cell antigen, was often absent in C1 and C2 but frequently expressed in C3. This may have implications for targeted therapy using monoclonal CD20 antibodies, such as rituximab 8 . C1 cluster also showed the highest expression of the stem cell antigen, CD34 (P = 0.18, Kruskal-Wallis test; Supplementary Fig. 7b). C1 frequently expressed both myeloid and lymphoid antigens, whereas C3 displayed more restricted expression of lymphoid antigens (further analysis is given in Supplementary Note). Surface antigen expression and gene expression of lineage markers closely mirrored each other ( Supplementary Fig. 8), indicating that transcriptomic clusters associate with surface phenotypes.
MPALs are enriched for the BCR-ABL1 fusion 9,10 . Because C1 leukemias co-expressed lymphoid and myeloid antigens, we investigated why these were not diagnosed as MPAL. Ninety-seven percent of C1 were diagnosed as either B-ALL (n = 32/36, 89%) or CML-LBC (n = 3/36, 8%). Despite the expression of surface myeloid antigens, levels of cyMPO were below the diagnostic threshold of 10% for MPAL 11 . However, cyMPO levels were still significantly greater in C1 than in C2 and C3 (median 2% versus 0%; Fig. 1c and Supplementary Table 5). RNA-seq showed a greater dynamic range of detecting low levels of MPO compared with flow cytometry (Supplementary Fig. 7e) 12 . It is possible that current diagnostic guidelines may underestimate the number of BCR-ABL1 lymphoblastic leukemia with mixed-lineage phenotype. Notably, despite the large number of gene expression studies in ALL, our focused approach identifies distinct transcriptomic subtypes of the disease.

Transcriptomic clusters align with B-cell differentiation stages
We compared the transcriptomic clusters to their normal counterparts in human cord blood including HSCs, multipotent progenitors (MPPs), myeloid progenitors (common myeloid progenitors (CMPs), megakaryocyte-erythroid progenitors (MEPs), granulocyte/macrophage progenitors (GMPs)) and eight different lymphoid and B-cell progenitors (lymphoid-primed MPPs (LMPPs), multilymphoid progenitors (MLPs), early pro-B, pro-B, pre-B I, pre-B II, immature B and mature B; Supplementary Table 6 and Supplementary Fig. 9a,b). Principal component analysis (PCA) revealed that C1 leukemias are closely associated with early pro-B cells (Fig. 2a), despite their strong stem/myeloid gene expression pattern. Early pro-B is the first step in B-cell development and retains myeloid and T-lymphoid lineage potentials 13 . C2 and C3 leukemias are associated with pro-B and pre-B I cells, which are progenitors fully committed to the B-cell lineage 14 . To further delineate the degree of differentiation in C2 and C3 leukemias, we assessed the expression of immunoglobulin heavy and light chains. Expression of rearranged heavy chain (IGH) was similar across all clusters ( Supplementary Fig. 10a), whereas expression of rearranged light chains (IGK or IGL) was most frequent in C3 (Fig. 2b). C3 leukemias also showed marked upregulation of B-cell transcription factors, EBF1 and PAX5, and B-cell receptor signaling genes (Fig. 2c,d). This suggests a later differentiation block in C3 compared with C2. These analyses support that C1, C2 and C3 leukemias are blocked at different stages of early B-cell development. Similar findings were observed from human adult BM progenitors 15 (Supplementary Fig. 9c). Taken together, we labeled the transcriptomic subtypes as follows: C1, early B-cell progenitor ('Early-Pro'); C2, intermediate B-cell progenitor ('Inter-Pro'); and C3, late B-cell progenitor ('Late-Pro'; Fig. 2e).

Early-Pro leukemias display significant hematopoietic lineage plasticity
We then applied single-cell RNA-seq (scRNA-seq) to nine samples (four Early-Pro, one Inter-Pro and four Late-Pro; n = 29,411 total single cells). Nonleukemic cells were removed based on lineage marker genes (for example, DNTT and CD19; Methods). SingleR was used to match individual leukemic cells to their closest normal cell counterparts 16 .
The co-occurrence of multiple driver alterations makes it difficult to decipher their roles in the heterogeneity of treatment outcomes among patients. Furthermore, this disease presents two drastically different molecular phenotypes-mixed-lineage and classic B-cell precursor. These phenotypes are thought to reflect the differentiation potential of the cell-of-origin, but this remains controversial. There are two major isoforms of BCR-ABL1. The long isoform (p210) has been shown to originate in a hematopoietic stem cell (HSC), whereas the short isoform (p190) has been shown to originate in a B-cell progenitor 6 . Rationally, this leads to the notion that mixed-lineage phenotype is related to the p210 isoform originating in an HSC, whereas the classic B-cell phenotype is related to the p190 isoform originating in a committed B-cell progenitor. However, recent work tracking residual disease in patients shows that p190 isoform can be sustained in HSC but can produce acute leukemia with a typical B-cell phenotype. The origins of these disease phenotypes and how they influence treatment outcomes are not well understood 7 .
To address the issues raised above, we collected a cohort of BCR-ABL1 lymphoblastic leukemia patient samples with extensive treatment response data. Our data propose a model of pathogenesis that explains the molecular and clinical heterogeneities observed in this disease.

Distinct transcriptomic clusters of BCR-ABL1 lymphoblastic leukemia
Two patient cohorts spanning 20 years (1992-2019) with clinical follow-up data were collected (Extended Data Fig. 1a). The first cohort of 57 samples from 53 patients (Supplementary Table 1; 46 de novo BCR-ABL1 B-cell acute lymphoblastic leukemia (B-ALL), 5 CML in lymphoid blast crisis and 2 mixed phenotype acute leukemia (MPAL)) was purified for blasts and subjected to RNA sequencing (RNA-seq; Extended Data Fig. 1b). Non-negative matrix factorization (NMF) followed by consensus clustering uncovered three distinct transcriptomic clusters, initially labeled C1, C2 and C3 (Fig. 1a Table 2). These clusters were replicated in the second cohort of 40 patients (Supplementary Figs. 3 and 4 and Supplementary Table 3). In the combined cohort (97 samples, 93 patients), the proportions of C1, C2 and C3 were 40%, 22% and 38%, respectively. These clusters were not associated with age, sex, sample source (blood or bone marrow (BM)) or BCR-ABL1 isoform (p190 or p210); white blood cell counts at diagnosis were greater in C1 compared with C2 and C3 (Extended Data Fig. 3 and Supplementary  Table 4). We also found these clusters in Ph-like ALL, suggesting they may apply to other forms of ALL ( Supplementary Fig. 5).
We first used gene set enrichment analysis (GSEA) to examine these transcriptomic clusters. C1 was marked by the aberrant expression of stem and myeloid lineage genes (for example KIT, CSF3R and MECOM; Fig. 1b) and was enriched for pathways related to innate immunity (TNFα, inflammatory response; Supplementary Fig. 6, top). C2 showed some expression of both myeloid (CSF2RA and CSF1R) and lymphoid genes (MS4A1/CD20 and IL7R) and was enriched for pathways related to ERK signaling, unfolded protein response (UPR) and other pathways not related to hematopoiesis (myogenesis, epithelial to mesenchymal transition (EMT); Supplementary Fig. 6, middle). C3 showed the highest expression of genes associated with B-cell differentiation, such as IL7R, MS4A1, BACH2 and TCL1A, and was strongly enriched for cell cycle pathways (E2F targets, G2M checkpoint) and B-cell receptor signaling ( Supplementary Fig. 6, bottom).
We then compared these transcriptomic clusters to flow cytometry data. Eight antigens were differentially expressed between the clusters (P = 0.0059-0.046, Kruskal-Wallis test; Fig. 1c and Supplementary Fig. 7a-d). Myeloid antigens, CD13, CD33 and cytoplasmic myeloperoxidase (cyMPO), and a T-cell antigen, CD7, were expressed at the highest levels in C1, whereas B-cell antigens, CD10, CD19 and CD20, Article https://doi.org/10.1038/s41588-023-01429-4 In the four Late-Pro samples, 95.3% of leukemic cells were assigned as either pro-B (46.0 ± 7.3%) or pre-B I cells (48.8 ± 3.7%) ( Fig. 2f and Extended Data Fig. 4a,b). The Inter-Pro sample showed decreased assignment to pro-B (36.6%) and pre-B I (40.5%) cells and increased assignment to early pro-B cells (21.4%), consistent with an earlier block in differentiation than Late-Pro leukemias. Early-Pro samples showed significantly more lineage heterogeneity compared with Inter-Pro and Late-Pro samples. They were composed of more early pro-B cells (40.4 ± 7.5%) and fewer pro-B (8.8 ± 5.7%) and pre-B I cells (11.0 ± 3.2%) than the other two subtypes (Fig. 2g). Early-Pro was the only subtype where we observed transcriptional similarity to earlier stages of lymphoid development, such as LMPP (6.0 ± 1.4%) and MLP (0.7 ± 0.7%), as well as GMP (23.6 ± 9.7%). Interestingly, GMPs in both mice and humans show lymphoid potential 17,18 . The Shannon indices of Early-Pro leukemias were significantly greater than those of Late-Pro leukemias (median of 1.49 versus 0.76; P = 0.029, Wilcoxon rank-sum test; Extended Data Fig. 4c) confirming that they are more heterogeneous. Similar results were obtained when using adult BM 19 ( Supplementary Fig. 11a-c). Pseudotime analysis of Early-Pro samples showed GMP-like and pro-B/pre-B I-like cells emerge from early pro-B-like cells, suggesting that mixed-lineage program originates from a lymphoid state ( Supplementary Fig. 11d). Together, Early-Pro leukemias show considerable lymphomyeloid lineage plasticity compared with Inter-Pro and Late-Pro leukemias.
Overall, Early-Pro leukemias harbor alterations that deregulate early steps of B-lymphoid commitment.
Although IKZF1 alterations were present among all transcriptomic subtypes, bi-allelic losses were significantly enriched in Inter-Pro leukemias (n = 7/8, 88%; P = 2.0 × 10 −7 ; Fig. 3a and Extended Data Fig. 5c). Inter-Pro leukemias were also enriched for the dominant negative Ik6 deletion of IKZF1, which lacks the DNA binding domain (62%, 36% and 17% in Inter-Pro, Late-Pro and Early-Pro, respectively; P = 0.012; Extended Data Fig. 5d). Inactivation of IKZF1 activates stromal genes 5 , and accordingly, Inter-Pro samples were significantly enriched for aberrant expression of these genes (Extended Data Fig. 5e). Interestingly, one Late-Pro leukemia, Ph36-D, harbored a dominant negative mutation in IKZF1 (p.N159T) 24 and showed a similar pattern of stromal cell disruption and clustered close to Inter-Pro leukemias ( Supplementary  Fig. 12). This suggests that complete inactivation of IKZF1 drives the Inter-Pro subtype.
Late-Pro cases were enriched for PAX5 deletions (n = 15/22, 68%; P = 0.0022). PAX5 is activated downstream of EBF1 and facilitates pro-B to the pre-B cell transition, immunoglobulin gene rearrangements and pre-B cell receptor signaling 25,26 . Increased frequency of PAX5 deletions is consistent with a later block in B-cell differentiation in Late-Pro leukemias. Late-Pro leukemias were also significantly enriched for homozygous deletions of CDKN2A/B and RB1 (P = 0.0013 and 0.0047). CDKN2A/B locus was often lost together with PAX5 as part of chromosome 9p losses. CDKN2A/B and RB1 regulate the pre-B cell receptor checkpoint to ensure cells with nonproductive IGH rearrangements undergo cell cycle arrest and apoptosis 27 . Together, losses of CDKN2A/B, RB1 and PAX5 reconcile how Late-Pro leukemias escape the pre-B cell  Table 7). Hyperdiploidy was also only observed in Late-Pro leukemias (n = 3/22, 14%; P = 0.12). Mutation loads of single-nucleotide variants (SNVs) and indels were similar across all transcription subtypes (Extended Data Fig. 5g). These patterns of genetic alterations suggest that each transcriptomic subtype transforms at a distinct stage of B-cell development.   ****P < 5 × 10 −5 < ***P < 5 × 10 −4 < **P < 0.005 < *P < 0.05.

DNA footprints of lymphoid enzymes inform on the timing of transformation
Recombination-activating gene (RAG) endonuclease (RAG1/2) introduces DNA double-strand breaks during V(D)J recombination in lymphoid progenitors at a specific motif known as the recombination signal sequences (RSS; Extended Data Fig. 6a) 28 . The presence of the RSS motifs at a structural variant (SV) outside of the antigen receptor genes indicates off-target RAG activity and is common in lymphoid leukemias 29 . From the WGS data, 632 SVs were resolved at the nucleotide level for RSS motif analysis (Supplementary Table 8; Methods). Nineteen percent of these SVs (n = 120/632) were associated with BCR-ABL1 translocations, and the remaining 81% (n = 512/632) were associated with cooperating events (for example, deletions in IKZF1, PAX5 and CDKN2A/B; Fig. 3b and Supplementary Fig. 15). SVs related to cooperating events were significantly enriched for the RSS motif compared with SVs from BCR-ABL1 translocations (78% versus 8%; P < 1 × 10 −16 , Fisher's exact test; Fig. 3c-e). Only SVs related to cooperating events showed a decline in breakpoint frequency as a function of distance from the RSS motif and formed a negative binomial distribution ( Fig. 3f and Extended Data Fig. 6b,c). Interestingly, leukemias with deletions of SLX4IP harbored significantly more RAG-mediated SVs ( Fig. 3g and Supplementary Note). Early-Pro, Inter-Pro and Late-Pro leukemias showed similar numbers of RAG-mediated SVs (Extended Data Fig. 6d), supporting that this mutational process drives transformation in all transcriptomic subtypes. Terminal deoxynucleotidyl transferase (TdT), encoded by DNTT, is another key enzyme in V(D)J recombination. Similar to RAG1/2 genes, DNTT is expressed in early lymphoid progenitors but is absent in multipotent cells such as HSC. TdT introduces nontemplate sequences (NTS) during repair of double-strand breaks. NTS insertions (mean length of 5.58 nucleotides) were detected in 94.2% of SVs related to cooperating events with RSS motifs (n = 376/399; Supplementary Table 8 and Supplementary Fig. 16a,b). This is consistent with RAG-mediated recombination in B-cell progenitors. Approximately two-thirds of the inserted nucleotides were G or C, reflecting the preferential usage of G:C over A:T bases by TdT 30 (Supplementary Fig. 16c). Interestingly, 65.5% of SVs related to cooperating events without RSS motifs (n = 74/113) also showed NTS insertions with a similar mean length (5.93 nucleotides). This suggests that TdT was active at SVs not created by RAG recombination. For SVs associated with BCR-ABL1 translocation, 91.8% (n = 109/120) did not harbor an NTS. The mean length of NTS for the small fraction of BCR-ABL1-associated SVs with NTS (9.2%; n = 11/120) was 2.55 nucleotides ( Supplementary Fig. 16a,b). This supports that TdT was not active when the BCR-ABL1 translocation occurred. To summarize, SVs related to cooperating events in the transcriptomic subtypes carried DNA footprints of lymphoid enzyme activity, which suggests the cooperating events defining each subtype were conceived in a lymphoid cell downstream of BCR-ABL1.

BCR-ABL1 translocation arises independently and upstream of cooperating events
To investigate this notion, hematopoietic stem and progenitor cells (CD34 + CD19 − CD45RA − ) from 14 patients (six Early-Pro, two Inter-Pro and six Late-Pro) were sorted and subjected to colony formation assay, an assay that removes residual blasts and promotes stem/progenitor differentiation into myeloid cells (Fig. 4a, Extended Data Fig. 7a and Supplementary Note). Colonies (n = 771; 21-87 per sample) were picked and analyzed for patient-specific genomic alterations using nested PCR (Extended Data Fig. 7b and Supplementary Table 9). BCR-ABL1 translocations were detected in granulocyte/macrophage (GM) and erythroid colonies in 4 of 14 patients (29%; three p210 and one p190; Fig. 4b, Extended Data Fig. 7c and Supplementary Table 10), a frequency consistent with previous studies 7, 31 . We expected more Early-Pro cases to harbor BCR-ABL1 + colonies given that mixed-lineage leukemias have been shown to arise from HSC 32 . Interestingly, leukemias with BCR-ABL1 + preleukemic clones were not restricted to the Early-Pro subtype (P = 0.78, Fisher's exact test). No myeloid colonies with the BCR-ABL1 translocation carried cooperating events, suggesting that the BCR-ABL1 translocation likely occurs in a multilineage progenitor (HSC/MPP) before lymphoid commitment. This is in accordance with previous reports of detecting rare BCR-ABL1 + clones in blood from healthy individuals 33,34 . Although our numbers are small, we did not find evidence that the mixed-lineage phenotype of Early-Pro subtype is due to the transformation of a multipotent cell.
Four patients had paired samples from diagnosis and relapse. Three had relapsed within 2 years from diagnosis (two Early-Pro and one Late-Pro; Fig. 4c). In these patients, most variants present at diagnosis were found at relapse (Fig. 4d and Supplementary Table 11). The transcriptomic subtypes, immunophenotypes and antigen receptor rearrangements also remained unchanged between diagnosis and relapse ( Fig. 4e and Supplementary Figs. 17 and 18). This indicates that relapse emerged from the major diagnostic clone in these three patients. The fourth patient (Ph53) relapsed after a 4-year remission. Although the same BCR-ABL1 translocation was shared, only 14% of the diagnostic mutations were present at relapse (Fig. 4d, bottom, and Extended Data Fig. 8a,d). Relapse was not driven by a kinase domain mutation in BCR-ABL1. There were significant changes in immunophenotypes and antigen receptor loci at relapse, indicating a new clone had emerged (Supplementary Figs. 17d and 18d and Supplementary Table 12). From RNA-seq, the diagnostic leukemia was classified as Late-Pro, whereas the relapse leukemia was classified as Early-Pro, indicating a switch in transcriptomic subtype (Fig. 4e). Genomic analysis showed an inactivating SV in PAX5 at diagnosis, consistent with its Late-Pro subtype, but it was not detected at relapse (Fig. 4f and Extended Data Fig. 8d). Instead, the relapse sample harbored a RUNX1 mutation consistent with the Early-Pro subtype ( Fig. 4g and Extended Data Fig. 8e). This suggests a BCR-ABL1 + precursor clone, which lacked PAX5 inactivation and RUNX1 mutation, independently gave rise to both Late-Pro leukemia at diagnosis and Early-Pro leukemia at relapse ( Supplementary  Fig. 19). Together with the colony data, this supports that BCR-ABL1 and cooperating events defining the transcriptomic subtypes occur at different stages of hematopoietic differentiation. Thus, the leukemic cell-of-origin does not define the transcriptomic subtype.

Molecular subtypes impact treatment outcomes in patients
We obtained detailed patient response data for our cohort. Twelve of 93 patients were excluded from this analysis (five CML-LBC, five not given TKI and two died within 30 d). The remaining 81 patients received a mostly uniform frontline regimen of Dana Farber Cancer Institute (DFCI) protocol plus imatinib 35 (Supplementary Tables 1 and 3). Similar proportions of Early-Pro, Inter-Pro and Late-Pro patients (32-39%) received bone marrow transplants (BMTs). Residual disease was monitored by BCR-ABL1 transcript levels with a median follow-up time of 6.3 years (95% confidence interval (CI): 4.9-10.1 years).
A major molecular response (MMR) is defined as ≥3 log reduction (≤0.1% transcript) in residual disease (Fig. 5a). Significantly fewer Early-Pro (10%, n = 2/20) and Inter-Pro patients (17%, n = 2/12) achieved MMR compared with Late-Pro patients (56%, n = 14/25) during induction therapy (P = 0.0021, Fisher's exact test). Median log reductions after induction for Early-Pro, Inter-Pro and Late-Pro patients were 1.4, 2.2 and 3.5, respectively (P = 0.022, Kruskal-Wallis test; Fig. 5b). A deep molecular response (DMR), defined as ≥4 log reduction, is a predictor of good outcome 36 and was restricted to Late-Pro patients (Fig. 5b). Accordingly, overall survival (OS) and event-free survival (EFS) rates were considerably greater in the Late-Pro subtype (P = 0.019 and 0.015, respectively; Fig. 5c). OS rate for Late-Pro patients was 73%, analogous to patients with low-risk B-ALL 37 . By contrast, Early-Pro and Inter-Pro patients had worse OR rates of 33% and 54%, respectively. Notably, Late-Pro patients showed good prognosis despite frequent single-copy losses or mutations in IKZF1 (n = 14/22, 64%), suggesting single-copy Poorer outcomes of Inter-Pro patients are likely due to bi-allelic losses of IKZF1 (ref. 38). Strikingly, the loss of HBS1L, exclusive to Early-Pro leukemias (Fig. 3a), was associated with extremely poor prognosis (Extended Data Fig. 9a). Early-Pro patients without the HBS1L deletion showed similar survival outcome to Inter-Pro patients (Extended Data  Fig. 9b). The role of HBS1L in the normal or malignant lymphoid cell is unknown. Interestingly, alterations specific to Early-and Inter-Pro subtypes showed similar prognostic value as the transcriptomic subtypes (Extended Data Fig. 9c), suggesting DNA-or RNA-based risk stratification may have clinical utility. The Jaccard index measures the degree of similarity between pairs 49 . e, PCA of leukemic transcriptomes using NMF component genes. Diagnosis/relapse pairs are labeled and highlighted. f, PAX5 inactivation (box) and monosomy 7 are private to Ph53-D. Red and blue segments denote copy number gains and losses, respectively. g, RUNX1 mutation is private to Ph53-R. RUNX1 gene is on the minus strand and orange and blue represent G and C bases, respectively. A gray line represents a sequencing read, and a bar at the top represents the number of reads covering each position. Frequencies of reads with the c.445G>C mutation are shown. Lym, lymphoid progenitors; Mye, myeloid progenitors; GM, colony-forming unit (CFU)-granulocyte/macrophage; G, colony-forming unit-granulocyte; M, colony-forming unit-macrophage; E, burst-forming uniterythroid; rel, relapse; D, diagnostic; R, relapse.
Article https://doi.org/10.1038/s41588-023-01429-4 Next, we investigated TKI efficacy among the transcriptomic subtypes. Our cohort spanned diagnoses over two decades (1992-2019), during which multiple generations of TKIs against BCR-ABL1 were approved, some of which are as follows: imatinib, dasatinib, nilotinib and ponatinib 39 . Imatinib, the first-generation TKI, targets the wildtype BCR-ABL1, whereas dasatinib, nilotinib and ponatinib, which are second-or third-generation TKIs, can target both wildtype and various mutant forms. Sixty-nine percent (n = 56/81) of patients received imatinib-only, and 31% (n = 25/81) were switched to secondor third-generation TKIs during their treatment. Early-Pro patients that received imatinib-only showed significantly poorer outcomes compared with those switched to a second-or third-generation TKI (5-year OS: 23% versus 50%; P = 0.044, log-rank test; Fig. 5d). A similar trend was seen for Inter-Pro patients but was not conclusive due to a small sample size. Late-Pro patients showed a trend toward improved outcomes with imatinib-only therapies. Using the multivariable Cox proportional hazards model with interaction, Inter-Pro and Early-Pro subtypes showed a significant decrease in the hazard ratio (HR) when treated with second-or third-generation TKIs ( Fig. 5e; Inter-Pro: HR = 0.05 (95% CI = 0.004-0.67), P = 0.023 and Early-Pro: HR = 0.07 (95% CI = 0.011-0.39), P = 0.003). This suggests that Early-Pro and Inter-Pro patients may specifically benefit from newer-generation TKIs, although this work needs further validation. Overall, the transcriptomic subtypes display differential sensitivity to TKIs.

Innate therapy resistance in Early-Pro and Inter-Pro leukemias
Kinase domain mutations contributed to treatment resistance and poor survival in our cohort, particularly in those with Early-Pro leukemias 40 (Fig. 5a and Supplementary Fig. 22a). Interestingly, Early-Pro and Inter-Pro patients still showed worse survival compared with Late-Pro patients even after patients with kinase domain mutations were removed (Supplementary Fig. 22b). This suggested that these leukemias harbor kinase domain-independent mechanisms of TKI resistance. As noted earlier, Late-Pro leukemias were enriched for cell cycle-related programs. By contrast, the cell quiescence gene set was significantly enriched in Early-Pro leukemias 41,42 (Fig. 6a,b). In scRNA-seq data, Late-Pro leukemias contained more cells in G 1 /S and G 2 /M phases (median 23%) than Early-Pro (8%) and Inter-Pro (2%) leukemias (P = 0.057, Wilcoxon rank-sum test for Early-Pro versus Late-Pro; Fig. 6c,d and Supplementary Fig. 23). Increased cell cycling in Late-Pro patients helps to explain their robust responses to induction therapy. Moreover, effective reduction of the leukemic load during induction likely reduces the opportunity for kinase domain mutations to develop.
In addition to cell quiescence, Early-Pro leukemias were also enriched for BCR-ABL1 signaling and its downstream target STAT5 (Fig. 6e,f) [42][43][44] . Proteomics showed increased expressions of STAT5A and STAT5B, and immunodetection confirmed higher phosphorylation of STAT5 in Early-Pro samples (Supplementary Fig. 24a-c). By contrast, Inter-Pro leukemias showed marked upregulation of the UPR and ERK  signaling pathways ( Fig. 6e and Supplementary Fig. 24d). Increased STAT5 and UPR signaling are known mechanisms of TKI resistance in the absence of kinase domain mutations 45,46 . The interplay between cell quiescence and these kinase domain-independent resistance mechanisms promotes leukemic cell survival, which in turn may foster the development of kinase domain mutations at relapse in Early-Pro and possibly Inter-Pro patients (Extended Data Fig. 10). Four patients (three Early-Pro and one Inter-Pro) in our cohort were switched from imatinib to a second-or third-generation TKI, dasatinib or ponatinib, after 2-6 months due to poor responses at induction. Kinase domain mutations were absent in all four leukemias indicating that they were innately resistant to treatment. Notably, all four patients achieved MMR after switching to a second-or third-generation TKI ( Supplementary Fig. 25). This suggests that primary refractoriness in Early-Pro and Inter-Pro patients can be addressed with upfront use of more potent TKIs.

Discussion
Resolving the transcriptomic subtypes improves our understanding of treatment response heterogeneity in BCR-ABL1 lymphoblastic leukemia. One highlight of our work is that the emergence of therapy resistance is not random but is linked with the transcriptomic subtype of the disease. Identifying Early-Pro and Inter-Pro patients at diagnosis could impact treatment plans for these high-risk patients. As these patients often do not achieve DMRs by standard chemotherapy and imatinib, earlier transition to second-or third-generation TKIs before relapse may result in better responses. Clinically, our work also suggests that kinase domain-independent mechanisms of TKI resistance contribute to the development of kinase domain mutations.
From a tumor evolution perspective, our data have implications for how the transcriptomic subtypes of this disease develop.
Building on previous works 6,31,47 , we propose a model of pathogenesis (Fig. 7). All transcriptomic subtypes are predicted to initiate from a common cell-of-origin that establishes a preleukemic phase. Isolation of preleukemic BCR-ABL1 + clones capable of myeloid cell differentiation but lacking cooperating events suggests that the disease cell-of-origin, at least in some cases, is a multipotent cell. BCR-ABL1 alone cannot drive transformation, and as preleukemic precursors enter lymphoid differentiation, the V(D)J recombination machinery is hijacked to drive the acquisition of cooperating events. Each step in normal B-cell differentiation (for example, early pro-B, pro-B or pre-B) is highly regulated by a unique set of transcription factors and presents a different molecular hurdle for transformation. Thus, the selection pressure changes during differentiation and shapes the genetic route of transformation. This model explains why each subtype is defined by cooperating events in genes that regulate a specific step of B-cell development. We posit that the differentiation stage at which the leukemia transforms, that is the 'cell-of-transformation' and not the cell-of-origin, determines the transcriptomic phenotype of the disease. Overall, our study provides a molecular classification scheme for BCR-ABL1 lymphoblastic leukemia and provides important insights into the origins of this disease. It is possible that molecular programs that define each transcriptomic subtype can be targeted in combination with TKIs to improve the outcomes of patients with this aggressive disease.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-023-01429-4.

RNA-seq
Total RNA was isolated using the PicoPure RNA Isolation Kit (Life Technologies) and DNA removal was performed using the RNase-free DNase Set (Qiagen). Extracted RNA quantity was measured by Qubit RNA High Sensitivity Assay Kit (Thermo Fisher Scientific) and quality was assessed by running the samples on RNA ScreenTape Assay on the 2200 TapeStation System (Agilent Technologies) or RNA 6000 Nano Kit on the 2100 Bioanalyzer (Agilent Technologies). Only RNA with sufficient RNA integrity numbers (RIN; at least seven and mostly greater than eight) were used to generate libraries. cDNA sequencing libraries were prepared using the TruSeq RNA Access Library Prep Kit (Illumina) according to the protocols provided by the manufacturer. Using the KAPA Illumina Library Quantification Kits, library pools were quantified on the Eco Real-Time PCR Instrument. All libraries were sequenced on the Illumina HiSeq 2500 platform using paired-end cluster generation and 2 × 126 cycles.

NMF and consensus hierarchical clustering
NMF was used to resolve expression signatures. NMF was performed on the top 2,000 variable genes using the NMF R package 57 . A total of 200 replicates were run for each k value between 2 and 10. Based on cophenetic scores, three NMF components with a total of 163 genes were identified. Subsequent consensus clustering was performed on these 163 genes utilizing ConsensusClusterPlus v1.40.0 (ref. 58). Using the Pearson correlation distance, clustering was performed using hierarchical clustering and complete linkage with 5,000 resamplings of 80% of samples per run to give rise to the final consensus. Based on the Silhouette scores, the relative change in cumulative distribution function between k values and the cophenetic coefficient, three main gene expression clusters were identified.

Differential gene expression
Differential gene expression analysis was performed with DESeq2 v1.18.1 (ref. 54) using recommended settings. HTSeq counts were read into R, followed by size factor estimation, dispersion estimate and a negative binomial GLM fitting as implemented in DESeq2. Sex and batch were corrected in the linear mode. Differential expression was run across the three subtypes in a pairwise fashion. Next, genes were ranked based on the P value and the sign of the log 2 fold change. Gene set enrichment was then conducted using GSEA Prerank v4.1.0 with recommended settings 59   For the IGH@, which is rearranged in all samples in our cohort, SV breakpoint positions were also cataloged.

Detection of residual disease
Residual disease levels of patients were assessed by a real-time quantitative PCR using a TaqMan primer-probe set for the BCR-ABL1 fusion. The test was adapted from a published method 63,64 and performed by the Genome Diagnostics Laboratory at the University Health Network. Residual disease level was calculated as log reduction (base 10) relative to baseline level at diagnosis. The lower limit of detection was 5.0-5.5 log reduction.

Reference dataset of hematopoietic cell compartments from human cord blood
Human cord blood samples with informed consent were obtained from Trillium Health, Credit Valley Hospital and William Osler Health System using procedures approved by the University Health Network Research Ethics Board. Freshly sorted populations from three to five independent cord blood pools ( Supplementary Fig. 9a) were directly resuspended and frozen in PicoPure RNA Isolation Kit Extraction Buffer. RNA was extracted using PicoPure RNA Isolation Kit (Thermo Fisher Scientific) according to the manufacturer's protocols. RNA was analyzed on a Bioanalyzer Pico chip (Agilent Technologies), and samples that passed quality control according to integrity (RIN > 8) and concentration were subjected to further processing. Using SMART-Seq v4 Ultra Low Input RNA Kit for Sequencing (Takara), cDNA conversion was performed. Next, RNA-seq libraries were prepared using Nextera XT DNA Library Preparation Kit (Illumina) and equimolar quantities of libraries were pooled. Libraries were sequenced on the Illumina HiSeq 2500 for generating paired-end reads of 125 bp in length and a target sequencing depth of 55-75 million reads per sample.

scRNA-seq and phenotype annotation
scRNA-seq was performed for nine BCR-ABL1 ALL samples. Seven of nine samples were flow-sorted to enrich for leukemic blasts and two samples (Ph2-D and Ph22-D) were not flow-sorted. Using Chromium Single Cell 3′ v2 kit (10× Genomics), 10,000 leukemic cells from each sample were loaded to partition each cell and reagents into a droplet emulsion. scRNA-seq libraries were then prepared using the Single Cell 3′ v2 reagent kit (10× Genomics) according to the manufacturer's protocols. Initial library quantification was done using the Qubit dsDNA High Sensitivity Kit (Thermo Fisher Scientific). To assess library quality, samples were run on the LabChip GXII Touch HT (PerkinElmer) using the DNA High Sensitivity Kit (PerkinElmer). Sequencing libraries were quantified using the KAPA Illumina Library Quantification Kit (Roche) according to the manufacturer's instructions. Cluster generation and sequencing were performed on the Illumina HiSeq 2500 and NovaSeq 6000 platforms. For each sample, at least 40,000 mean reads per cell were acquired. Using CellRanger v3.0.1 (10× Genomics), scRNA-seq data were aligned to hg19. This was followed by further processing using Seurat v3. 1.2 (ref. 65). Cells with high ribosomal expression (>60%), high mitochondrial expression (>30%) or fewer than 1,000 expressed genes were removed from further analysis. A loess curve was fit to the UMI by gene curve, and cells that were found greater than two standard deviations away were removed. Furthermore, genes expressed in fewer than five cells were removed, along with mitochondrial and ribosomal genes. The data were normalized using the sctransform normalization method provided by Seurat and UMI count, percent mitochondria and cell cycle scores were regressed out. Normalized data were clustered using the self-assembling manifolds algorithm 66 with the Leiden clustering method at a resolution of 1. Uniform manifold approximation and projection (UMAP) were then performed 67 . To enrich each sample for leukemic cells, we removed cells that do not express DNTT but express LYZ (mono/gran), GZMH (NK) or IL32 (NK/T) and cells that express CD3E, CD3G or CD8A (T). When >60% of cells in a cluster showed those expression patterns, the whole cluster was removed.
Single cells from each leukemia sample were annotated with normal hematopoietic cell types using SingleR v1.0.6 (ref. 16). scTransform-normalized expression profile of each leukemia cell was iteratively compared to the reference dataset of bulk hematopoietic cell compartments (Kim et al., this study) to identify the cell type with the highest gene expression similarity. Pseudotime analysis was performed using Monocle 3 v0.2.3.0 (ref. 68). Based on the expression of RUNX2, which is a defining marker gene of C1/Early-Pro subtype, the starting point of each trajectory was derived. Cell cycle scores for each of the G1/S and G2/M phases were calculated using UMI counts normalized by library size (total UMI count) where cell cycle scores had not been regressed out. We used G1/S and G2/M genes from ref. 69, excluding M/G1 genes. Cells were scored for the G1/S and G2/M gene sets by taking the average log 2 expression of the genes 70 . Each cell was then assigned to G0, G1/S or G2/M phase using cutoffs, which were determined for each gene set as one standard deviation greater than the mean score across all samples in our dataset. Cells below both the G1/S and G2/M cutoffs were assigned G0, cells exceeding the G1/S cutoff but below the G2/M cutoff were assigned G1/S, while remaining cells exceeding the G2/M cutoff were assigned G2/M.
To scRNA-seq data were normalized through scTransform with 2,000 variable genes, and a UMAP dimensionality reduction was performed with the following parameters: 30 PCs, 50 neighbors and cosine distance metric. Leiden clustering with multilevel refinement was performed at a resolution of 0.5, and clusters were labeled based on prior annotations, RNA marker genes, as well as key protein-level surface markers, including CD133, CD34, CD33, Tim3, CD38, CD45RA, CD10, CD19, CD20, CD21 and IgD. Then, Symphony v0.1.1 was used to project single leukemic cells from the current study onto the normal B-cell development hierarchy 71 . Briefly, Symphony was used to build a reference based on the scTransform-normalized expression of 2,000 variable genes. scRNA-seq data from 9 BCR-ABL1 ALL samples were normalized with scTransform and projected onto the B-cell development reference map, and each query cell was assigned a cell type label based on 30 nearest-neighbors within the reference map.

3′ RNA-seq analysis of a second cohort
We obtained a second cohort of 40 BCR-ABL1 ALL samples from the University Health Network Leukemia Biobank (Supplementary Table 3). The median age at diagnosis was 54 years old (range: 24-88 years old). Thirty-nine samples were from PB and one sample was from BM. The composition of BCR-ABL1 isoforms in the cohort was 26 p190 and 14 p210. Human CD19 MicroBeads (Miltenyi Biotec) were used to enrich for CD19 + cells from 10 million bulk cells. While CD19 − cells would be depleted by this method, we predicted this to have minimal impact as most samples in the main cohort showed blasts with high to dim levels of CD19. Furthermore, we previously observed that when blast populations with different antigen expression levels are taken from a given sample, their RNA profiles are highly consistent. RNA was extracted from CD19 + cells using PicoPure RNA Isolation Kit (Life Technologies) with RNase-free DNase Set (Qiagen). Extracted RNA quantity was measured by Qubit RNA High Sensitivity Assay Kit (Thermo Fisher Scientific) and quality was assessed using RNA ScreenTape Assay on the 2200 TapeStation System (Agilent Technologies).
Article https://doi.org/10.1038/s41588-023-01429-4 3′ RNA-seq libraries were constructed according to the Smart-3SEQ method as previously described 72 . In brief, 10 ng of RNA was fragmented and reverse-transcribed. The resulting double-stranded cDNA libraries were subjected to 17 PCR cycles for library amplification and index addition. Completed libraries were quantified by Qubit DNA High Sensitivity Assay Kit (Thermo Fisher Scientific) and quality was assessed on the LabChip GXII Touch HT using the DNA High Sensitivity Kit (PerkinElmer). Libraries were pooled according to their DNA concentrations and sequenced on the Illumina NextSeq 500 instrument with a Mid Output Reagent Kit v2.5 and 10% PhiX spike-in (Illumina), reading 150 nucleotides for read 1 and 6 nucleotides for the P7 index read. RNA-seq analysis was performed as described above. Consensus hierarchical clustering was performed using 163 genes identified from the original cohort.

Whole-genome sequencing
Genomic DNA was extracted from leukemic blasts and T-cells of 57 samples in the main cohort using the Gentra Puregene Cell Kit (Qiagen

WGS data alignment and variant calling
WGS data were aligned using bwa v0.7.12 against human genome build 19 (hg19) 73 and postprocessing was done using Picard v1.90 (http://broadinstitute.github.io/picard). Using the Genome Analysis Tool Kit (GATK) v3.6.0, germline variants were called according to the 'GATK best practices' for the version 74 80 , and CNVs and SVs were annotated with overlapping genes and repetitive sequence elements to highlight biologically meaningful events, such as gene fusions, and filter out potential false positives.

Nested PCR
SVs and SNVs were validated by a custom touchdown, nested PCR with leukemia-specific primers followed by Sanger sequencing. For SVs, such as translocations and deletions, two pairs of primers were designed to anneal to the opposite sides of the breakpoint junction (Supplementary Table 13). Presence or absence of PCR amplification corresponded to presence or absence of the variant. To enhance specificity, the internal primer pair for each variant was designed to only amplify the PCR amplicon of the external primer pair, and the size difference between the two amplicons was at least 100 bp. PCR was carried out using the Phusion High-Fidelity PCR Master Mix with HF Buffer (New England Biolabs), and custom DNA oligos were acquired from Integrated DNA Technologies. Sanger sequencing was performed at the Centre for Applied Genomics (TCAG) DNA Sequencing Facility at the Hospital for Sick Children, Toronto.

SiMSen-seq for detection of low-frequency mutations in ABL1 kinase domain
SiMSen-seq was performed according to the developer's protocol 81 to detect low-frequency mutations in the kinase domain of ABL1. In brief, standard PCR primers of 18-27 nucleotides in length were designed using BatchPrimer3 (ref. 82) to generate eight amplicons covering frequently mutated regions in the kinase domain (amino acids 221-266, 275-424, 428-473 and 475-504). Standard primers with the highest amplification efficiency and specificity were chosen for each amplicon, and barcoding primers were designed by adding unique molecular barcode and adapter sequences. Final amplicon sizes ranged from 281 to 300 bp. Barcoding primers were pooled to 1 μM each (primer pairs with relatively lower efficiency were pooled at slightly higher concentrations). 70 ng of genomic DNA from 48 samples (44 diagnostics and 4 relapses) were used for each SiMSen-seq reaction. Initial barcoding PCR (three cycles) and adapter amplification PCR (27 cycles) were performed on the C1000 Thermal Cycler (Bio-Rad). Libraries were run on the LabChip GXII Touch HT using the DNA High Sensitivity Kit (PerkinElmer) to assess the quality and quantified using Qubit dsDNA High Sensitivity Kit (Thermo Fisher Scientific). Paired-end sequencing (2 × 150) was performed on the Illumina NextSeq 500 instrument with a Mid Output Reagent Kit v2.5 and 15% PhiX spike-in (Illumina). Debarcer algorithm was used to align sequence data, build sequence consensus and call variants 81 . Reads that map to the same amplicon and have the same molecular barcode are termed family of reads, and those with minimum family size of 20 were used to build consensus reads and remove PCR errors. The threshold for detection of low-frequency mutations was either three or more supporting consensus reads with consensus mutation frequency >0.4% or five or more supporting consensus reads.

Automated, capillary western blot assay
Proteins were extracted from cell pellets (1-2 million cells) by lysing with RIPA buffer (Thermo Fisher Scientific), 1× HALT Protease and Phosphatase Inhibitor Cocktail (Thermo Fisher Scientific) and 5 mM EDTA. Samples were pipette mixed and placed on ice for 5 min, followed by centrifugation at 15,000g at 4 °C for 15 min to pellet debris. The supernatant containing the extracted protein was transferred to a new tube. The protein concentration was determined using the Pierce BCA Protein Assay Kit (Thermo Fisher Scientific) according to the manufacturer's protocols.
Protein analyses were conducted on the western blot assay system (ProteinSimple) according to the manufacturer's protocols. The 12-230 kDa separation module was used for phosphorylated-STAT5 (C11C5; Cell Signaling Technology, 9359; 1:100 dilution) and total protein detection assays. In brief, extracted protein was diluted with 0.1× Sample Buffer and 5× Master Mix and heat-denatured for 5 min at 95 °C before loading. For the phosphorylated-STAT5 and c-Abl assays, the primary antibodies were diluted with Milk-Free Antibody Diluent and Antibody Diluent 2, respectively. All reagents were added to a microplate and a capillary cartridge was loaded on the western blot assay instrument, and run parameters were set using the COMPASS software v4.0.0 (ProteinSimple). The corresponding band of each protein was visualized using the COMPASS software, and protein concentration of the band was normalized to the total protein content.
SUC2 protein (yeast invertase) of 2 pmol was added as a digestion control. Disulfide bonds were reduced with 5 mM dithiothreitol, followed by a 30-min incubation at 60 °C. Free sulfhydryl groups were alkylated Article https://doi.org/10.1038/s41588-023-01429-4 by incubating samples in 25 mM iodoacetamide in the dark at room temperature for 30 min. An additional 5 mM of dithiothreitol (DTT) was added to quench the alkylation reaction and samples were incubated at room temperature for 5 min. The magnetic bead-based SP3 protocol 85 was used to capture proteins before digestion. In brief, magnetic beads were added to proteins in a 10:1 (wt/wt) ratio. Absolute ethanol was added to bring the ethanol concentration to 50%. Samples were shaken at room temperature for 5 min at 1,000 rpm, and supernatant was discarded. The beads were rinsed three times with 80% ethanol and air-dried. Proteins were digested in 25 mM ammonium bicarbonate containing 2 μg of trypsin/Lys-C enzyme mix (Promega) at 37 °C overnight. Samples were centrifuged at 18,500g for 1 min to pellet the beads, and the supernatant containing digested peptides was removed. Hundred microliter of water was used to elute remaining peptides off the beads. Peptides were desalted by C18-based solid phase extraction, then lyophilized in a SpeedVac vacuum concentrator. Peptides were solubilized in mass spectrometry-grade water with 0.1% formic acid.
Synthetic iRT peptides (Biognosis) were spiked into each sample at a 1:10 ratio before data acquisition. LC-MS/MS data were acquired using an Easy nLC 1000 (Thermo Fisher Scientific) nano-flow liquid chromatography system with a 50 cm EasySpray ES803 column (Thermo Fisher Scientific) that is coupled to Orbitrap Fusion tandem mass spectrometer (Thermo Fisher Scientific). Peptides were separated by reverse phase chromatography using a 4-h nonlinear chromatographic gradient of 4-48% buffer B (0.1% formic acid (FA) in acetonitrile (ACN)) using a flow rate of 250 nl min −1 . Column temperature was kept at 45 °C. Mass spectrometry data were acquired in XCalibur 4.2.47 in positive-ion data-dependent mode. MS1 data were acquired at a resolution of 120,000 in the orbitrap, AGC target of 1 × 10 6 and maximum injection time (maxIT) of 50 ms and 40 s dynamic exclusion, while MS2 data were acquired in the ion trap at 'Normal' scan rate, maxIT of 100 ms. Higher energy collisional dissociation (HCD) fragmentation was performed at a normalized collision energy of 31%. Data were searched in MaxQuant v1.6.3.3 (ref. 86) using a merged UniProt protein sequence database containing human protein sequences from UniProt (complete human proteome, 2015-01-27; number of sequences 42,842), Suc2 (yeast) protein sequences from UniProt and iRT synthetic peptide sequences (Biognosis). Searches were performed with a maximum of two missed cleavages, and carbamidomethylation of cysteine as a fixed modification. Variable modifications were set as oxidation at methionine and acetylation (N-term). The false discovery rate for the target-decoy search was set to 1% for protein, peptide and site levels. Intensity-based absolute quantification (iBAQ), label-free quantitation (LFQ) and match between runs (matching and alignment time windows set as 0.7 min and 20 min, respectively) were enabled. The proteinGroups.txt file was used for subsequent analysis. Proteins matching decoy and contaminant sequences were removed, and proteins identified with two or more unique peptides were carried forward. LFQ intensities were used for protein quantitation 87 . For proteins with missing LFQ values, median-adjusted iBAQ values were used as replacement 88 . Protein intensities were log 2 -transformed for further analysis (Supplementary Table 14).

Analysis of SV breakpoints and junctions
SVs (deletions, insertions, inversions, intrachromosomal and interchromosomal translocations) were delineated to nucleotide-level resolution and analyzed for breakpoint and junction characteristics. Cryptic RSS were defined as the canonical heptamer ('CACAGTG'), its reverse complement ('CACTGTG') or variants retaining the first 'CACA' and a subset of other canonical bases at nonantigen receptor loci 28,89 . SVs were categorized as RSS + or RSS − based on the presence of RSS motifs on the inside of both breakpoints. When two breakpoints of an SV separate genomic regions into segments 1A/1B and 2A/2B, where segments 1A and 2B are then joined together, search ranges for RSS motifs consisted of the first 30 bp of 1B and 2A. A few exceptions to the rule were made, which are as follows: (1) If an RSS motif is found at one breakpoint, the other RSS is allowed to be a variant retaining the first 'CAC' and a subset of other canonical bases; (2) If an RSS is found within 30 bp on the inside of one breakpoint, the other RSS is allowed to be located further from the breakpoint (>96% of RSS are found within 30 bp and >99% are found within 50 bp); (3) If a breakpoint is near a breakpoint cluster, such as those in IKZF1, the RSS is allowed to be located further from the breakpoint; (4) A less conserved RSS motif is considered functional if it is near a cluster of breakpoints. Notably, 'CATACTG' was found near a breakpoint cluster in LEMD3, and 'CCCAGTG' was found near a breakpoint cluster in PCMTD1. In total, 796 cryptic RSS motifs were identified from 399 RSS + cooperating event SVs; one breakpoint in IKZF1 and one in CBWD2 did not harbor RSS motifs despite the opposite breakpoints being positioned in RSS + breakpoint clusters. Sequence logos were generated using WebLogo v2.8.2 (ref. 48).
H3K4me3 ChIP-seq signal of GM12878 (B-lymphoblast cell line) from the ENCODE database was used to calculate the distance between each SV breakpoint and the nearest H3K4me3 peak 90 . Chromatin state annotations of GM12878 by ChromHMM were used to assign each SV breakpoint with a chromatin state 29,91 . SV junctions were assessed for NTS insertion (addition of nucleotides not present at either breakpoint), microhomology (identical sequence of nucleotides between two breakpoints) or clean junction (neither NTS insertion nor microhomology) 47 .

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
WGS and RNA-seq data are deposited and publicly available on the European Genome-Phenome Archive (EGAS00001007167). Mass spectrometry data can be found in Supplementary Table 14.