Fusion gene map of acute leukemia revealed by transcriptome sequencing of a consecutive cohort of 1000 cases in a single center

Fusion genes (FGs) are important genetic abnormalities in acute leukemias, but their variety and occurrence in acute leukemias remain to be systematically described. Whole transcriptome sequencing (WTS) provides a powerful tool for analyzing FGs. Here we report the FG map revealed by WTS in a consecutive cohort of 1000 acute leukemia cases in a single center, including 539 acute myeloid leukemia (AML), 437 acute lymphoblastic leukemia (ALL), and 24 mixed-phenotype acute leukemia (MPAL) patients. Bioinformatic analysis identified 792 high-confidence in-frame fusion events (296 distinct fusions) which were classified into four tiers. Tier A (pathogenic), B (likely pathogenic), and C (uncertain significance) FGs were identified in 61.8% cases of the total cohort (59.7% in AML, 64.5% in ALL, and 63.6% in MPAL). FGs involving protein kinase, transcription factor, and epigenetic genes were detected in 10.7%, 48.5%, and 15.1% cases, respectively. A considerable amount of novel FGs (82 in AML, 88 in B-ALL, 13 in T-ALL, and 9 in MPAL) was identified. This comprehensively described real map of FGs in acute leukemia revealed multiple FGs with clinical relevance that have not been previously recognized. WTS is a valuable tool and should be widely used in the routine diagnostic workup of acute leukemia.


INTRODUCTION
Fusion genes (FGs) are major molecular biological abnormalities in acute leukemia, and all well-known FGs in leukemias are founder variations and play as crucial tumorigenesis factors. They exist stably with tumor cells and have been used as molecular markers for the diagnosis, classification, risk stratification, and targeted therapy of leukemia. They can also be used as molecular markers for monitoring minimal residual disease (MRD) with high sensitivity. Based on their essential role in leukemogenesis, the WHO classification of neoplastic diseases of the hematopoietic and lymphoid tissues has incorporated dozens of FGs as essential molecular markers since 2001 [1]. Ever since then, screening multiple common FGs simultaneously and then quantitatively monitoring the positive ones have been introduced into the routine clinical diagnostic workup of acute leukemia.
We have previously reported common FGs were presented iñ 41% of acute myeloid leukemia (AML) and 29% of acute lymphoblastic leukemia (ALL) cases, respectively [2,3]. The distribution of FGs in acute leukemia presented a typical longtail phenomenon, which meant that several FGs with highfrequencies were followed by a large number of FGs with lowfrequencies which gradually "tails off" asymptotically. The fusion events at the far end of the tail had a very low probability of occurrence. In AML, 23 kinds of distinct FGs were detected in 1292 of 3135 patients. Eight FGs with frequencies of more than 1% accounted for 94% of all positive FGs. The other 15 FGs with frequencies below 1% constituted the long tail of the distribution. Similarly, in ALL, only 5 FGs had positive rates of more than 1% and accounted for 89% of all positive FGs. The individual positive rates of dozens of FGs were all below 1%, even if they have been frequently reported in the literature [2,3].
The rapid development of sequencing technology and the decline of sequencing costs in recent years have made whole transcriptome sequencing (WTS) more accessible, which can analyze known FGs and has unique advantages in identifying unknown rare and variant FGs. Several groups have discovered numerous novel FGs, such as those involving ZNF384, MEF2D, PAX5, and DUX4 rearrangements, among cases that were once regarded as B-other-ALL with no defining cytogenetic abnormalities [4][5][6][7][8][9]. To better understand the incidences of FGs and their pathological characteristics, we proposed the conception of the "fusion gene family, FG-FM" to classify fusions that involve one protagonist gene and various fusion partners [10]. FGs in the same family often share commonalities in pathogenicity, clinical features, and treatment outcomes. Although most newly identified FGs are individually rare, the overall incidence is significant due to the wide variety. To date, the exact population of pathogenetically driver FGs undiscovered in acute leukemias and the total positive rate of them remains unknown. FGs with pathological significance, even the individually rare FGs, still have definite significance in clinical diagnosis, treatment guidance, and MRD monitoring. Thus, it is essential to decipher the distribution feature of FGs in acute leukemias and investigate effective detection methods.
Considering the versatility provided by WTS would uncover otherwise undetected FGs, we have started to use WTS to analyze FGs for accompanying diagnosis in our hospitalized acute leukemia patients since September 2018. Here, we provide the retrospective overview of the FG map in our patients.

SUBJECTS AND METHODS Patients
From September 2018 to September 2020, a consecutive cohort of 1000 cases with confirmed diagnoses of acute leukemia in Hebei Yanda Lu Daopei hospital was enrolled in this study, including 405 children (≤18 years, median age eight years, range 8 months-18 years; 237 males, 168 females) and 595 adults (>18 years, median age 43 years, range 19-89 years; 320 males, 275 females). Among them, 539 were AML (137 children and 402 adults); 437 were ALL (257 children and 180 adults), including 365 B-ALL and 72 T-ALL; and 24 were mixed-phenotype acute leukemia (MPAL) (11 children and 13 adults). The diagnosis was made according to the 2016 revision to the WHO classification of tumors of hematopoietic and lymphoid tissues [11,12]. Fifty healthy donors in Hebei Yanda Lu Daopei Hospital were included as controls. The study was approved by the medical ethics committees at Hebei Yanda Lu Daopei Hospital. Written informed consent for medical record review was obtained from all patients and healthy controls or their guardians following the Declaration of Helsinki.

Sample preparation
Bone marrow samples were collected. Nucleated cells (1.0 × 10 7 ) were used for genomic DNA extraction. Nucleated cells (5.0 × 10 6 ) were used for total RNA extraction by the guanidinium thiocyanate-phenol-chloroform method using a TRIZOL reagent according to the manufacturer's recommendations (Invitrogen Corporation, Carlsbad, CA, USA). Complementary DNA (cDNA) was synthesized using M-MLV Reverse Transcriptase (Promega Corporation, Madison, WI, USA) or a Maxima First Strand cDNA Synthesis Kit (Thermo Fisher Scientific, Inc., Waltham, MA, USA) according to the manufacturer's protocol.

FGs screening
All enrolled cases underwent common FGs screening parallelly. A total of 131 mRNA isotypes of 41 FGs (Table S1), which have been reported as recidivists in leukemia patients, were screened by multiplex-nested reverse-transcription PCR (RT-PCR) according to the protocols we previously reported [2,3].

Library preparation, WTS
Total RNA quality was assessed using NanoPhotometer spectrophotometer, Qubit 2.0 Fluorometer (Life Technologies), and Agilent 2100 Bioanalyzer. High quality RNA was then subjected to library preparation using a NEBNext ® Ultra TM Directional RNA Library Prep Kit for Illumina ® (New England BioLabs) according to the manufacturer's instructions with input ≥1 μg of total RNA. Paired-end sequencing with a read length of 150 bp was performed on Illumina HiSeq 2500 platform, yielding at least 50 (71.35 ± 11.10, mean ± s.d.) million sequence reads mapped to 16

FG detection in WTS data
We utilized the Arriba (v1.0.1) [13] algorithm for the detection of fusion transcripts in WTS data. To each prediction Arriba assigns a confidence of low, medium, or high. The confidence reflects three aspects, namely the likelihood that the transcript is aberrant (not seen in healthy tissue); it can be explained by an underlying genomic rearrangement; it is not an artifact. The number of supporting reads is one of the most helpful attributes to distinguish artifacts from true events. Arriba assumes a polynomial relationship between the number of supporting reads and the level of background noise. Only candidates with more supporting reads than the estimated level of background noise are reported. In addition, the statistical model of Arriba incorporates several covariates that correlate with the level of background noise, including the sequencing depth, the breakpoint distance, the library preparation protocol, and the location of the breakpoints.
In our analysis, we defined "positive fusion events" as in-frame FGs which were selected from the high-confidence predictions, unless Arriba annotated the events as 'read-through'. Reciprocal fusion transcripts were counted as one fusion event. FGs were defined as "novel" if they were not reported in previous literature and not found by performing queries against the Atlas of Genetics and Cytogenetics in Oncology and Haematology (http://atlasgeneticsoncology.org/index.html), the Tumor Fusion Gene Data Portal (https://www.tumorfusions.org), and ChimerDB [14]. For the validation of the FGs not detected in FGs screening, RT-PCR was performed, followed by Sanger sequencing.

Pathogenicity evaluation of FGs
We classified the final FGs list into four tiers based on our current understanding of their pathogenic impact: (A) pathogenic: well-known FGs or new members of common FG-FMs with definite pathogenicity in hematological malignancies or other tumors, e.g., BCR-ABL1 or new members of ABL1-FM; (B) likely pathogenic: rarely reported FGs or new members of rare FG-FMs in hematological malignancies or other tumors without functional verification, e.g., TBC1D15-RAB21, which was reported in acute promyelocytic leukemia, but no functional verification was reported [15]; or one of the partner genes was reported in hematological malignancies in other forms of abnormalities, such as mutation, e.g., ASXL2-ITSN2 (ASXL2 is frequently mutated in AML patients [16]; (C) uncertain significance: both fusion partners not reported before in hematological malignancies in any form of genomic alterations, e.g., ANAPC7 -GPN3; and (D) non-pathogenic: FGs detected in normal samples.

Statistical analysis
We used χ2 and Fisher's exact tests to compare differences in frequencies of FGs between different age groups and leukemia subtypes. We performed all of the analyses with SPSS Statistics software, version 20 (IBM Corp., Armonk, New York, US). A two-sided P-value of <0.05 was considered to be statistically significant.

Spectrum and incidence of FGs in all patients
We identified 792 high-confidence in-frame fusion events of 296 distinct FGs in 1000 acute leukemia cases. We further classified these FGs into four tiers based on pathogenicity investigation, and the number of tier A, B, C, and D FGs was 116 (39.2%), 114 (38.5%), 58 (19.6%), and 8 (2.7%), respectively. The 8 tier D FGs (39 total fusion events), which were unlikely to be pathogenic, were not analyzed further.
Tier A, tier B, and tier C FGs were detected in 513 (51.3%), 146 (14.6%), and 59 (5.9%) cases, respectively. Concurrence of tier A and tier B fusions was detected in 50 (5.0%) cases; coexistence of tier A and tier C fusions was identified in 26 (2.6%) cases; cooccurrence of tier B and tier C fusions was found in 8 (0.8%) cases; and the remaining 8 (0.8%) cases had tier A and tier B and tier C fusions simultaneously (Fig. 1b).
We found 57 kinds of recurrent FGs that occurred at least twice, including 39 tier A, 16 tier B, and 2 tier C FGs, respectively. Fourteen FGs with relative high incidences were: RUNX1-RUNX1T1

FG map of ALL
Within the ALL group, 69 kinds of tier A FGs, 61 kinds of tier B FGs, and 29 distinct tier C FGs were detected in 282 (64.5%) of the 437 ALL cases (Fig. 2b, c). The incidence of FGs in adult ALL was 72.2% (130/180) and significantly higher than 59.1% (152/257) in pediatric ALL cases (P = 0.006). The incidence of FGs in B-ALL cases was 67.4% (246/365) and significantly higher than 50.0% (36/72) in T-ALL cases (P = 0.007).

FGs detected in MPAL
We detected 22 fusion events (including 15 tier A, 2 tier B, and 5 tier C fusion events, respectively) in 14 (63.6%) of the 24 MPAL cases (Fig. 2d). BCR-ABL1 was detected in 5 patients. Three ETV6 KMT2A fusions KMT2A-FM is a large FG-FM that has been systematically studied and reported in acute leukemia, with more than 100 partner genes, and their specific breakpoint regions have been identified [18,19]. In this study, fusions of KMT2A with 14 different partner genes were detected in 76 cases (7.6%), including a novel KMT2A-CARS fusion identified in a pediatric T-ALL case (Fig. S1).
Transcription factor gene fusions Chromosomal translocations involving transcription factors are frequently seen in acute leukemia, and some of them have been used as genetic markers for leukemia classification because of their distinctive clinicopathological features and prognostic significance, such as AML with RUNX1-RUNX1T1 and B-ALL with TCF3-PBX1. AML with RUNX1-RUNX1T1 or CBFB-MYH11 and acute promyelocytic leukemia with PML-RARA are considered to be acute leukemias without regard to blast cell count [12]. In this study, 521 transcription factor-associated FGs (149 distinct fusions) were detected in 485 (48.5%) cases (54.9% in AML, 41.7% in MPAL, 41.6% in B-ALL, and 37.5% in T-ALL). Most of them were corebinding factor (RUNX1 and CBFB) fusions, zinc-finger transcription factor (mainly ZNF292 and ZNF384) fusions, transcriptional coactivator (most of them KMT2A) fusions, and nucleoporin (NUP98 and NUP214) fusions (Fig. 4).

Epigenetic gene fusions
Recent studies have demonstrated that the genes controlling the chromatin modifier and epigenetic programs include genes that drive human cancer, leading to an increased awareness of the epigenetic protein families as potential drug targets. Inhibitors of DNA methylation and histone deacetylase (HDAC) inhibitors have been approved for clinical use in hematological malignancies, thus providing proof of concept for epigenetic therapies [23,24]. FGs involving a chromatin modifier and epigenetic gene were detected in 151 (15.1%) cases (20.8% in   MPAL, 17.6% in AML, 12.6% in B-ALL, and 6.9% in T-ALL). Most of them (79.6%, 121/152) were histone methyltransferases-or histone acetyltransferases-related FGs. KMT2A-related FGs, which may be responsive to DOT1L inhibitors [25], were detected in 76 cases (50.3%). EP300-related FGs, which may be responsive to the potent, selective EP300 inhibitor C646 [26], or histone deacetylase inhibitor vorinostat [27], were detected in 17 cases (11.3%) (Fig. 5).

Novel FGs
Notably, a considerable number of so-far unreported FGs were detected in this cohort. Among the 187 novel fusions (231 fusion events), 13 FGs were detected in 2 or more cases (ZNF292-PNRC1 in 30 cases; DDX5-GLUL, FOSB-UBC, and PTMA-CXCR4 in 3 cases each; the other 9 FGs in 2 cases each), and the other 174 kinds of novel FGs were observed in 1 patient each. The ZNF292-PNRC1 fusion, which has a prevalence of 3.0% in all cases, was observed in both AML and ALL. The ZNF292 gene encodes a zinc-finger transcription factor that functions as a tumor suppressor, and eight fusion partners (B3GAT2, CGA, FIG4, GRM4, EYS, PRSS54, PTPRO, and MAP3K4) have been reported in several solid tumors (invasive breast carcinoma, lower-grade glioma, prostate adenocarcinoma, and sarcoma) in the Tumor Fusion Gene Data Portal. The fusion point of ZNF292 in all these reported cases was restricted in exon 1, which was the same as in our ZNF292-PNRC1 positive cases, so we deemed this novel fusion a likely pathogenic FG.

Comparison of results between WTS and FGs screening
Multiplex-nested RT-PCR, which was designed to detect 41 common FGs (all belonged to tier A FGs), was performed in all 1000 cases, and only 376 (37.6%) cases were positive. All FGs detected by FGs screening were also observed in WTS.
WTS detected tier A FGs in 513 (51.3%) cases. Therefore, the application of WTS found significantly pathogenic FGs in additional 13.7% cases. Moreover, tier B FGs without tier A FGs were identified in 88 (8.8%) cases, tier C FGs without tier A or B FGs were found in 17 (1.7%) cases (Fig. 7). Although the pathogenicity of tier B FGs needs further clinical and functional verifications, they have a high likelihood that they are pathogenic. Tier C FGs involve many potential functional relevant genes, indicating that some of them may have a potential impact on the pathogenesis of the respective patients.

DISCUSSION
Recurrent FGs in hematological malignancies are major genetic variants that contribute to tumor genesis. In this cohort, we detected FGs in 61.8% of acute leukemia patients, and the real map of FGs was different from what we expected. Some FGs or FG-FMs may actually have a high incidence but have not been effectively identified before due to cytogenetically cryptic and no observable karyotype abnormalities by conventional chromosome banding analysis. Examples of these previously underestimated FGs include NUP98-NSD1, both partner genes located close to the telomeric end of chromosome 11 (NUP98) and 5 (NSD1) [28], or fusions involving ZNF384, which is located close to the telomeric end of chromosome 12 [4]. Besides, WTS detected more patients (17.8%) harbored more than one FG, which was detected in 0.3% of patients when detected by limited FGs screening [29]. Mechanism of the concurrence of multiple FGs and the possible cooperative pathogenic mechanism among them merit further study.
The application of WTS could not only reliably detect all FGs revealed by common FGs screening but also found pathogenic (tier A) FGs in an additional 13.7% of cases. Moreover, WTS identified 187 novel FGs in this 1000 cases cohort. More than half of novel FGs comprised intra-chromosomal rearrangements, which are often missed by karyotype analysis. Therefore, WTS proved to be a powerful tool for FGs analysis and has unique advantages for identifying unknown rare or variant FGs. For example, more than 100 KMT2A partner genes have been identified, and multiplex-nested RT-PCR methods cannot fully cover them. Moreover, there must be varieties of KMT2A fusions that exist but have not yet been identified. Even for the FGs with relatively high incidence, they may also be missed due to the variant breakpoints in KMT2A, just as the 12 cases in the present study who had KMT2A fusions but showed negative results in FGs screening. Identifying FGs that can be used for sensitive MRD monitoring or with potential therapeutic relevance has critical clinical significance, particularly in cases so far lacking a respective marker or target. For instance, the use of JAK2 inhibitors might be of use for the patients harboring novel JAK2 fusions (ERC1-JAK2, NPHP3-JAK2, RNPC3-JAK2 [30], ROCK1-JAK2). In addition, the transcriptional data could provide comprehensive genetic information, including FGs, somatic mutations, tandem duplications, and gene expression. Integrated analysis of these genetic information has the potential to permit precise classification with risk assessment in acute leukemia cases and improve personalized treatment. Based on the experience of implementing WTS as a diagnostic tool in leukemia, Arindrarto et al. [31] showed that WTS is potential to replace all genetic tests for classification and risk assessment of AML except for metaphase cytogenetics.
However, it is worth noting that most nonrecurrent fusion transcripts detected by massively parallel sequencing methods, including WTS, are merely stochastic events and non-pathogenic [32]. How to effectively analyze and differentiate the key FGs from multitudinous background sequences is a great challenge to the analysts' professional background and analytical capabilities. Attention should also be paid to avoid the false-negative results caused by methodologies or analysis procedures. To better understand the pathological characteristics of FGs, we put forward the conception of FG-FM and classified the final FGs list into four tiers based on our current understanding of their pathogenic impact. FGs in the same FG-FM share commonalities in pathological and clinical features in general, and attention should also be paid to the influence of partner genes on the pathogenicity of FGs. If both FG partners are protagonist genes of one FG-FM, the FG should be classified mainly according to its major pathogenicity and corresponding clinical features. For example, PAX5-JAK2 should be classified under JAK2-FM rather than PAX5-FM since the gene expression signature and the clinical manifestations of PAX5-JAK2-positive ALL are similar to BCR-ABL1like ALL.
Taken together, WTS has provided a powerful methodology for deciphering the real FG map in acute leukemia comprehensively. Moreover, the map will be increasingly more precise, with the improved ability to analyze the pathological significance of novel FGs and the continuous enrichment of various databases. We described the map of FGs detected in a large cohort of acute leukemias and revealed a considerable number of FGs that have clinical relevance but have not been previously recognized. Classifying FGs according to FG-FMs can help us better understand their pathological significance and suggest new classification patterns for acute leukemia. WTS is a valuable tool and should