Article | Open | Published:

# GFusion: an Effective Algorithm to Identify Fusion Genes from Cancer RNA-Seq Data

## Abstract

Fusion gene derived from genomic rearrangement plays a key role in cancer initiation. The discovery of novel gene fusions may be of significant importance in cancer diagnosis and treatment. Meanwhile, next generation sequencing technology provide a sensitive and efficient way to identify gene fusions in genomic levels. However, there are still many challenges and limitations remaining in the existing methods which only rely on unmapped reads or discordant alignment fragments. In this work we have developed GFusion, a novel method using RNA-Seq data, to identify the fusion genes. This pipeline performs multiple alignments and strict filtering algorithm to improve sensitivity and reduce the false positive rate. GFusion successfully detected 34 from 43 previously reported fusions in four cancer datasets. We also demonstrated the effectiveness of GFusion using 24 million 76 bp paired-end reads simulation data which contains 42 artificial fusion genes, among which GFusion successfully discovered 37 fusion genes. Compared with existing methods, GFusion presented higher sensitivity and lower false positive rate. The GFusion pipeline can be accessed freely for non-commercial purposes at: https://github.com/xiaofengsong/GFusion.

## Introduction

A fusion gene is a hybrid gene formed from two different genes rejoining through chromosomal translocation, deletion or inversion1. Because of a close link to human cancer, fusion genes have attracted attentions of many researchers2. The first discovered fusion gene is BCR-ABL13,4,5, which is formed by a translocation event involving chromosome 9 and 22 and identified as the predominant factor predisposing to chronic myelogenous leukaemia (CML). Fusion BCR-ABL1 represents one class of fusion genes which impact cancer development through encoding chimeric proteins with carcinogenic biological function. In addition, some gene fusions can lead to activation of oncogenes (for example, R-spondin fusion proteins active the oncogenic Wnt/β-catenin signaling in colon cancer) or inactivation of tumor suppressor genes (for example, LACTB2-NCOA2 disrupts NCOA2 in colorectal cancer)6, 7.

The existence of fusion genes in cancer such as breast, lung, colon, prostate cancers and colorectal lymphoma has been confirmed in numerous researches8,9,10,11. Demichelis and colleagues found that recurrent gene fusions between the androgen-regulated gene TMPRSS2 and the ETS (E26 transformation-specific) family genes ERG, ETV1 or ETV4 are expressed in most prostate cancers12. Tomlins and colleagues found that ERG or ETV1 was markedly overexpressed in 57% prostate cancer cases, whereas overexpression was never observed across benign prostate tissue samples13. Singh and colleagues found that some GBMs (glioblastoma multiforme) involve FGFR-TACC fusion and the fusion protein displays oncogenic activity when introduced into astrocytes in the mouse brain14. The discovery of the EML4-ALK fusion in non-small-cell lung cancer, SYT-SSX4 fusion in synovial sarcoma and many other fusion events in cancers indicated that gene fusions are widespread existed in tumor types15,16,17.

To improve the sensitivity and specificity of fusion detection, and avoid the above limitations, we present a novel pipeline named GFusion, a powerful and efficient fusions detection method for both paired-end and single-end RNA-seq data, to predict fusion genes by comprehensive analyzing alignment split reads and mate reads. GFusion firstly splits unmapped reads into three segments and aligns the first and last segments against reference genome using read aligner Bowtie31. Potential fusion boundaries are confirmed by the split reads in which the segments are from different genes. Finally, GFusion filters out false fusions through a series of filtering steps including analyzing the mate reads of split reads for paired-end read, detecting spanning fragments and constructing fusion reference and realigning fusion fragments (fusion reads for single-end data). We have demonstrated the effectiveness of the GFusion on a normal breast RNA-Seq dataset as the control group and four cancer RNA-Seq datasets involving both paired-end and single-end read data. As a result, GFusion successfully detected 34 from 43 previously reported fusions in four cancer datasets. Further results illustrated that GFusion performed higher sensitivity and have lower false positive rate by comparing with other existing fusion detection pipelines.

## Methods

GFusion is a Perl based software designed to identify fusion genes from single-end or paired-end RNA-Seq read data, which are aligned against the reference genome to find out the genomic locations, followed by integrated filtering steps to determine the best gene fusion candidates (Fig. 1). GFuison involves several steps to eliminate false candidate fusions that are caused by misalignment, random pairing of transcript fragments or artificial errors. We identify fusion gene candidates from selected fragments harboring the fusion boundaries in their sequenced reads or insert segments. In the output file, the GFusion reports a list of information about the fusion result including detected fusion genes, genomic locations of breakpoints, fusion transcript model, as well as the number of supporting split fragments and spanning fragments (split reads for single-end data), that characterize fusion genes comprehensively at transcript expression level.

### Locating fusion boundaries

To locate fusion boundaries, we proposed an evaluation criterion based on the human gene annotation (GTF file) and potential split reads which harbor the fusion boundaries in the gaps. Potential split reads with alignments where two anchors locate near the approximate exon boundaries is selected and exon junction boundaries are considered to be approximate fusion boundaries. Thus, the distance between fusion boundary and anchor alignment coordination should be less than gap length. We defined Cor as an anchor alignment position (the first alignment base coordination) in reference genome. The anchors aligning to forward reference are defined as strand+, while anchors aligning to reverse-complement reference are defined as strand−. We also defined G1 as the start coordinate of an exon, and G2 as the end coordinate of an exon. The abbreviation for anchor length and gap length is expected to be al and gl separately. In addition, we added a parameter (default: p = 3), which is defined as the maximum allowing distance between anchor boundary site and its mapped exon boundary site. The most probable range of anchor location is restricted as follows:

$$\{\begin{array}{ll}-p < Cor-{G}_{1} < gl+p\,, & if\,ancho{r}_{1}strand-or\,ancho{r}_{2}strand+\\ al-p < {G}_{2}-Cor < al+gl+p, & if\,ancho{r}_{1}strand+or\,ancho{r}_{2}strand-\end{array}$$
(1)

### Confirming fusion models

We defined four fusion models to denote the fusion gene strand direction based on supporting split fragment aligning orientations and gene transcribing orientations. f f is one fusion model that upstream gene and downstream genes are both forward transcribed and splicing in the forward orientation. Similarly fused genes both having reversed strand orientations are defined as rr model. Besides, fr is another fusion model for the fusion gene which is generated by upstream gene with forward orientation and downstream gene with reversed orientation, while fusion model rf confronts fr that fusion gene with rf model concatenates the reversed strand sequence of upstream gene to the forward strand sequence of downstream gene.

### Aggregating candidate fusions

For paired-end RNA-seq data, GFusion identifies candidate fusions based on the number of split fragments and spanning fragments as an evidence of gene fusion events. Spanning fragments are special discordant pair alignments harboring fusion boundaries in the insert segments with each read aligning to the different genes. To search spanning fragments, firstly we selected the aligned reads with high mapping quality scores from initial alignment results. If the reads with common query name hit different gene exons on the side of potential fusion boundaries, the read pairs are stored as potential spanning fragments. By default, GFusion only reports fusion candidates supported by at least one split fragment and one spanning fragment to remove false positive fusions.

### Constructing fusion index and realigning

After identifying the potential fusions, we constructed the bowtie reference index using the potential fusion sequences and realign all the supporting fragments against the index for further filtering. This is the last filtering step to eliminate false positives. All sequences of fusion genes on the each side of fusion boundaries are collected and concatenated to produce the pseudo fusion reference indexes. For example, when the fusion model is fr, GFusion extracts upstream sequence from the first base to the fusion boundary base of upstream gene and extracts downstream sequence from the fusion boundary base to the last base of downstream gene using reference genome sequence file (hg19.fa). While the fusion model is rf, GFusion extracts upstream sequence from the fusion boundary base to the stopping base of downstream gene with reverse-complemented transforming and extracts downstream sequence from the fusion boundary base to the stop base of upstream gene. The upstream and downstream sequences are concatenated together to produce potential fusion reference sequences at fusion boundaries. Using fusion reference sequences, GFusion is able to build fusion bowtie index for supporting fragments realigning by Bowtie. Every candidate split fragments and spanning fragments are realigned against constructing index to recount the number of supporting fragment. As described in Step6, GFusion finally confirms fusion genes with fusion boundaries, fusion transcript model supported by at least one split fragment and one spanning fragment.

### Tools selected for comparison

The fusion detection tools that we chose to benchmark together with GFusion were the following: Tophat-Fusion, FusionMap, JAFFA, and ChimPipe. Tophat-Fusion and FusionMap were chosen because they were extensively used by the community. In addition, they were also commonly used to conduct the comparison with other subsequent fusion detection tools. JAFFA and ChimPipe were chosen because they were two latest tools for fusion discovery and showed a better performance than other state-of-the-art tools. Therefore, these four tools were finally selected for comparison.

### Tophat-Fusion and FusionMap parameters

In the Tophat-Fusion, the fusion-min-dist means that for intrachromosomal fusions, Tophat-Fusion tries to find fusions separated by at least this distance, and the num-fusion-reads means that fusions with at least this many supporting reads will be reported. In the FusionMap, α denotes the minimum end length of a seed read, β denotes the maximum hits of a read end, G denotes the non-canonical splice pattern penalty, MinimalHit denotes the minimal distinct fusion reads, and MinimalFusionSpan denotes the minimal distance (bp) between two fusion break points.

## Results

We tested GFusion on three types of RNA-Seq dataset: (1) paired-end RNA-Seq datasets: three breast cancer cell lines BT474, SKBR3, MCF7 and a normal breast cell line described by Edgren et al. and can be available from the NCBI Sequence Read Archive [SRA:SRP003186]35; (2) single-end RNA-Seq datasets: the K562 CML cell line, containing 14 million 76 bp SE reads from work by Levin et al.36; (3) simulated datasets: consisting of approximately 24 million 76 bp paired-end reads and artificial fusion reads. We mapped all reads and short fragments to the human genome (hg19) with TopHat and Bowtie, and identified the genes involved in each fusion using the human gene annotations.

### Testing on Paired-end Datasets

In three cancer breast cell lines, 37 fusion genes have been discovered in the previously research35, 37. GFusion found 34 fusion candidates containing 28 previously reported fusions showed in Table 1 (see details in Table S2). We also ran Tophat-Fusion, FusionMap, JAFFA, and ChimPipe on the same datasets to compare performance with GFusion. Tophat-Fusion found 66 fusion candidates containing 30 known fusion events and FusionMap reported 26 candidate fusion genes, 16 of which were previously known fusion genes. JAFFA found 37 fusion candidates containing 21 known fusion events, while ChimPipe reported 38 candidate fusion genes including 30 previously known fusion genes. It can be seen that Tophat-Fusion and ChimPipe achieves the highest sensitivity (81.08%) following by GFusion (75.67%). For the precision, GFusion has the highest value (82.35%) following by ChimPipe (78.94%), while the Tophat-Fusion has the lowest value (46.45%).

For the MCF7 cell lines, ChimPipe found 5 known fusion genes following by the 4 known fusions respectively detected by GFusion, Tophat-Fusion, and JAFFA. FusionMap only reported 3 known fusion transcripts (BCAS4-BCAS3, ARFGEF2-SULF2, RPS6KB1-VMP1), which were also found by other four tools. In addition, GFusion has the highest precision as it only reported two novel fusions (Table S2), while Tophat-Fusion reported other seven novel fusion genes so it’s precision is the lowest. In the two novel fusion genes detected by GFusion, the novel fusion (PAPOLA-AK7) was also found by other four programs except the ChimPipe. Figure 3 illustrates BCAS4-BCAS3 fusion supporting fragments identified by GFusion. The facts that fusion reads spanned the fusion boundary between the BCAS4 and BCAS3 genes on chromosomes 20 and 17 proved that BCAS4 and BCAS3 gene fused and expressed in the MCF7 breast cancer cell line.

For the BT474 cell line, GFusion detected the 17 fusions out of 21 previously reported fusion transcripts, and predicted only one novel fusion transcript (Table S2). FusionMap found 9 reported fusion genes and predicted 3 novel fusion transcripts. Algthough TopHat-Fusion found 19 known fusions, it predicted 24 novel fusions. JAFFA detected 12 known and 6 novel fusions, while ChimPipe reported 18 known and 4 novel fusions. The above results shows GFusion achieves the highest precision, and its sensitivity differs by only about 1% from that of Tophat-Fusion. Furthermore, GFusion missed one known fusion (LAMP1-MCF2L) because this fusion was supported by only two low mapping quality split fragments and it was filtered out in our initial filtering step. Figure 4 shows MED1-ACSF2 fusion supporting fragments identified by GFusion, these reads clearly span the intra-chromosomal (chromosome 17) fusion boundary of the two genes. The upstream fusion gene MED1 is reversed transcribed, while downstream fusion gene ACSF2 is forward transcribed.

In the SKBR3 breast cancer cell line, GFusion, Tophat-Fusion and ChimPipe all found the same seven fusions out of ten previously reported fusion genes and predicted three (Table S2), five and one novel fusions respectively. While FusionMap missed six known fusions and found just four previously reported fusion transcripts and predicted four novel fusions. JAFFA reported five known fusions and five novel fusions. Among the fusion genes missed by GFusion, one known fusion, CSE1L-ENSG00000236127, was not found because ENSG00000236127 annotation has been removed from the recent Ensembl database. NFS1-PREX1 was filtered out because fusion supporting fragments had low mapping quality. However in the previously researches35, the evidences of fusion NFS1-PREX1 was not enough, as only a short segment of NFS1 was included in the fusion. Figure 5 shows the fusion supporting fragments for one of four fusion isoforms of TATDN1-GSDMB with strong alignment evidence that found by GFusion in SKBR3 cells. The upstream fusion gene TATDN1 and downstream fusion gene GSDMB are both reversed transcribed.

In fusion genes detection, the biggest challenge is to reduce the huge number of false positives resulted from error of sequencing and alignment. In order to decrease the number of false fusions, we have designed a series of strict filtering algorithms during fusion genes detection process. In order to estimate the false positive rate of three pipelines, we ran them on RNA-Seq datasets of normal breast cell line. GFusion and FusionMap both predicted just one novel fusion transcript. JAFFA and ChimPipe both predicted three novel fusion transcripts, and Tophat-Fusion predicted 4 novel fusions. These results indicate that GFusion had far fewer false positives than Tophat-Fusion, JAFFA, and ChimPipe.

In addition, it should be noted that the running time of GFusion was about 30 minutes through × 64 eight-core computer and six running threads and the majority of running time was spent on the read aligning. Using the same computer and threads, Tophat-fusion took an hour and 40 minutes, ChimPipe used an hour and 34 minutes, and FusionMap was the fast one for fusion analysis based on the fast alignment program GSNAP. JAFFA took five hours and 13 minutes with 12 threads.

In conclusion, GFusion can achieve high sensitivity and precision, and should be a reliable tool for fusion gene detection because of a series of filtering steps. Unlike Tophat-Fusion and FusionMap as well as other fusion detection tools, one of powerful features in GFusion is that it focuses more on the mapped mate reads and split reads.

### Testing on Single-end Dataset

Since recent advances in NGS platforms has resulted in vast increases in coverage and read length, gene fusion boundaries are now adequately represented by single-end reads. Fusion detection using single-end reads may be more powerful than the paired-end reads when the discordant read pairs or the short insert segment is used in detection algorithm.

We further applied GFusion on the published K562 CML cell line RNA-seq data containing 14 million 76 bp single-end reads. Two main high expression level gene fusions, BCR-ABL1 and NUP214-XKR3 already validated before were detected by GFusion19, 36. In the result, 281 split reads were found to bridge the fusion boundary of the fusion from upstream gene BCR at chr22:23632600 to downstream gene ABL1 at chr9:133729451 at the transcription level. As showed in Fig. 6, split reads are abundant for the fusion boundary between the chromosome 22 and 9, illustrating the fusion boundary precision achieved by GFusion even on single-end datasets.

In close proximity to BCR and ABL1, another gene fusion between NUP214 and XKR3 resided on chromosome 22 and 9 respectively was also detected. NUP214 have been known to be related in T-cell acute lymphoblastic leukemia38. One isoform of fusion NUP214 and XKR3 was identified by GFusion to be supported by 60 split reads connecting the end of exon 29 in NUP214 (chr9: 134074402) to the start of exon 2 in XKR3 (chr22:17288973). Another two additional isoform of fusion NUP214 and XKR3: exon 29 of NUP214 fused to exon 3 of XKR3, and exon 29 of NUP214 fused to exon 4 of XKR3 only supported by 2 and 8 split reads separately, were also detected by GFusion. The anchors orientation of the split reads detected by GFusion shows that fusion strand is forward to reverse. The fourth isoform of fusion NUP214-XKR3 involving exon 27 of NUP214 and exon 2 of XKR3 could also be found when we reduced the anchor length in GFusion. All of these detected fusion isoforms are in agreement with previous results36. In addition to BCR-ABL1 and NUP214-XKR3, GFusion identified all the other previously reported fusion genes, SNHG3-PICALM, PRIM1-NACA, NCKIPSD-CELSR3, SLC29A1-HSP90AB1. Among these six fusion genes, the two genes of fusion PRIM1-NACA, NCKIPSD-CELSR3, and SLC29A1-HSP90AB1 are close neighbors on the genome separately, and are likely to be fusion transcripts caused by read-through events39. The inner distances between two fused points on the genome are respectively 19 Kb in PRIM1-NACA, 21 Kb in NCKIPSD-CELSR3 and 15 Kb in SLC29A1-HSP90AB1.

To compare the performance of GFusion with FusionMap and Tophat-Fusion, we comprehensively analyzed their sensitivity, false positive rate as well as running time on the same dataset. As described in above results, GFusion detected the six fusion genes that are all previously reported by Levin et al.36 and predicted other 28 novel fusions (Table S3). Tophat-Fusion, by comparison, reported just three known fusion genes and 71 novel fusion genes with the parameters: fusion-min-dist = 10000, num-fusion-reads = 4, and missed three fusion PRIM1-NACA, NCKIPSD-CELSR3, and SLC29A1-HSP90AB1. Tophat-Fusion appears not to consider those read-through transcripts to be gene fusions. FusionMap detected five known fusions and predicted 221 novel fusion genes with the default parameters except: α = 25, β = 1, G = 2, MinimalHit = 2, MinimalFusionSpan = 10000 (Table S4)22. In addition, using the same computer as above-mentioned, GFusion took an hour that was five times faster than TopHat-Fusion. FusionMap is the fastest one which took only 10 minutes. Obviously, the problem concern of FusionMap is its sensitivity and lower false positive rate. GFusion is a superior and powerful tool for fusion gene detection using single-end reads that it shows greater sensitivity and has lower false discovery rate than Tophat-Fusion and FusionMap.

### Testing on Simulated Dataset

To further compare the performance of GFusion with the existing methods, we generated a simulation dataset that consists of a normal background data and artificial fusions. We used paired-end RNA-Seq data of human embryonic stem cells as the normal background dataset in which it was not expected to harbor any fusions. The background data used in our work can be downloaded from NCBI Sequence Read Archive (SRA: SRR521477) with approximately 24 million 76 bp paired-end reads generated by WiCell Research Institute. We firstly bridged sequences from exons of two different genes and obtained 42 artificial fusion transcripts. Secondly, the fusion supporting reads were generated at random to represent the expression level of artificial gene fusions and their insert segment length distribution was similar to the normal paired-end reads. Finally, the simulate dataset was produced by mixing the fusion supporting reads with normal background reads.

After that, we ran GFusion, Tophat-Fusion, FusionMap, JAFFA, and ChimPipe on the simulated dataset, and counted the number of true and false supporting fusion reads to calculate sensitivity and false positive rate of three methods separately. It can be seen from Table 2 and Fig. 7A that GFusion had higher sensitivity and lower false positive rate than other four tools in the fusion detection on the simulated dataset. GFusion achieved a high sensitivity of 88% by identifying 37 out of 42 true fusion genes as well as just predicting seven false fusion genes with parameters: anchor length = 30, max intron length = 5000. Tophat-Fusion found 35 fusion genes with 83% sensitivity and reported 50 false fusion genes with parameters (fusion-min-dist = 5000, num-fusion-reads = 1). FusionMap reported 22 of 42 true fusion genes and 160 false fusions with parameters α = 30, β = 1, G = 2, MinimalHit = 1, MinimalFusionSpan = 5000. JAFFA detected 22 true fusion genes and 77 false fusions with the direct mode, and the hybrid mode was not used because this mode is very time-consuming. ChimPipe has the lowest sensitivity of 21% with the default parameters.

As shown in Table 2, the number of false positives detected by FusionMap were heavily more than that detected by other tools, especially comparing with GFusion. In addition, only 4 fusions were detected in common by all of the five methods as shown in Fig. 7B, and 12 fusions were detected at least four tools. 8 fusions were only found by GFusion, while only 1 fusions was found by other four tools but GFusion. This result also shows that these five tools are really different from each other in the detection of fusion genes.

For fusion detection, the number of supporting reads is another important indicator to confirm the fusions from candidate fusion genes. We calculated the Positive Predictive Values (PPV) with the number of supporting reads to estimate the fusion genes detection ability for these three pipelines.

$${\rm{PPV}}=\frac{TP}{TP+FP}\,$$
(2)

where TP means the number of true fusions; FP means the number of false fusion gene.

GFusion reported the most number of true fusions and the fewest false fusions when supporting fragments number is greater than one. Considering the split fragments containing the fusion boundary in the reads (seed reads of FusionMap, spanning reads of Tophat-Fusion and JAFFA, nbSpanningReads of ChimPipe) (Fig. 8A), GFusion and JAFFA had higher PPV than other three tools. In addition, if we selected three split fragments as a minimum threshold to support the fusions, GFusion and JAFFA would filter out all the false fusion genes successfully. In contrast, Topaht-fusion, FussionMap and ChimPipe still predicted many false fusions. Furthermore, Tophat-fusion and ChimPipe could not filter out all the false fusion genes until the minimum supporting split fragments is set to be 17. By analyzing all the supporting split fragments and spanning fragments (rescued reads denoted in FusionMap, spanning pairs denoted in Tophat-Fusion, nbTotal denoted in ChimPipe) (Fig. 8B), the fusions reported by GFusion and Tophat-Fusion have more supporting fragments than FusionMap and ChimPipe. In addition, compared with Tophat-Fusion, GFusion has a higher PPV value which always kept stable greater than 80% as the supporting fragments increase, indicating that GFusion are more reliable in performance than Tophat-Fusion and FusionMap.

The distribution of supporting fragments for fusion genes can also be seen in Fig. 9, the number of false fusions supporting fragments reported by GFusion is relatively lower than the true fusions, as well as the ChimPipe. However in Tophat-fusion and FusionMap, the distribution of supporting fragments for true fusions is similar to the false fusions, and it is hard for these methods to reduce the false positive rate by setting the threshold of supporting fragments number.

## Conclusions

GFusion is an effective and reliable Perl script tool for fusion gene detection on both single-end and paired-end RNA-Seq data. GFusion can achieve highly sensitive and fewer false positive rate using split-realign protocols: splitting reads, realigning anchors, locating fusion boundaries and constructing fusion transcript reference. In order to remove the false fusion genes and keep the higher sensitivity, GFusion designed several strict filtering steps including locating the reads position, confirming transcribing orientations, constructing fusion reference, and realigning candidate fusion fragments. Compared with discordant pairs in which each reads align against different genes or individual unmapped reads harboring the fusion boundaries, GFusion focuses more on the mapped mate reads and split reads. GFusion requires the anchor length more than 20 bp to minimize impact of short sequence multi-mapping.

We have tested the performance of the GFusion on published RNA-Seq datasets with known fusion genes involving paired-end and single-end read dataset as well as a simulated dataset. Both of these results indicated that GFusion performed higher sensitivity and far lower false positive rate compared with other existing fusion detection methods.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

1. 1.

Edwards, P. A. Fusion genes and chromosome translocations in the common epithelial cancers. The Journal of pathology 220, 244–254 (2010).

2. 2.

Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nature Reviews Cancer 7, 233–245 (2007).

3. 3.

Nowell, P. C. & Hungerford, D. A. Chromosome studies on normal and leukemic human leukocytes. Journal of the National Cancer Institute 25, 85–109 (1960).

4. 4.

Shtivelman, E., Lifshitz, B., Gale, R. P. & Canaani, E. Fused transcript of abl and bcr genes in chronic myelogenous leukaemia. Nature 315, 550–554 (1985).

5. 5.

Ren, R. Mechanisms of BCR–ABL in the pathogenesis of chronic myelogenous leukaemia. Nature Reviews Cancer 5, 172–183 (2005).

6. 6.

Seshagiri, S. et al. Recurrent R-spondin fusions in colon cancer. Nature 488, 660–664 (2012).

7. 7.

Yu, J. et al. Disruption of NCOA2 by recurrent fusion with LACTB2 in colorectal cancer. Oncogene 35, 187–195 (2016).

8. 8.

Sakugawa, S. T. et al. API2-MALT1 fusion gene in colorectal lymphoma. Modern pathology 16, 1232–1241 (2003).

9. 9.

Jun, H. J. et al. The oncogenic lung cancer fusion kinase CD74-ROS activates a novel invasiveness pathway through E-Syt1 phosphorylation. Cancer research 72, 3764–3774 (2012).

10. 10.

Li, Z. et al. ETV6-NTRK3 fusion oncogene initiates breast cancer from committed mammary progenitors via activation of AP1 complex. Cancer cell 12, 542–558 (2007).

11. 11.

Karlsson, J. et al. Activation of human telomerase reverse transcriptase through gene fusion in clear cell sarcoma of the kidney. Cancer letters 357, 498–501 (2015).

12. 12.

Demichelis, F. et al. TMPRSS2: ERG gene fusion associated with lethal prostate cancer in a watchful waiting cohort. Oncogene 26, 4596–4599 (2007).

13. 13.

Tomlins, S. A. et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. science 310, 644–648 (2005).

14. 14.

Singh, D. et al. Transforming fusions of FGFR and TACC genes in human glioblastoma. Science 337, 1231–1235 (2012).

15. 15.

Soda, M. et al. Identification of the transforming EML4–ALK fusion gene in non-small-cell lung cancer. Nature 448, 561–566 (2007).

16. 16.

Skytting, B. et al. A novel fusion gene, SYT-SSX4, in synovial sarcoma. Journal of the National Cancer Institute 91, 974–975 (1999).

17. 17.

Panagopoulos, I., Gorunova, L., Bjerkehagen, B., Boye, K. & Heim, S. Chromosome aberrations and HEY1-NCOA2 fusion gene in a mesenchymal chondrosarcoma. Oncology reports 32, 40–44 (2014).

18. 18.

Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics 10, 57–63 (2009).

19. 19.

Maher, C. A. et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 458, 97–101 (2009).

20. 20.

Shah, N. et al. Exploration of the gene fusion landscape of glioblastoma using transcriptome sequencing and copy number data. BMC genomics 14, 818 (2013).

21. 21.

Sboner, A. et al. FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data. Genome biology 11, R104 (2010).

22. 22.

Ge, H. et al. FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics 27, 1922–1928 (2011).

23. 23.

Kim, D. & Salzberg, S. L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome biology 12, R72 (2011).

24. 24.

Torres-García, W. et al. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics 30, 2224–2226 (2014).

25. 25.

Jia, W. et al. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome biology 14, R12 (2013).

26. 26.

Nicorici, D. et al. FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data. bioRxiv, 011650 (2014).

27. 27.

Davidson, N. M., Majewski, I. J. & Oshlack, A. JAFFA: High sensitivity transcriptome-focused fusion gene detection. Genome medicine 7, 43 (2015).

28. 28.

Rodríguez-Martín, B. et al. ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data. BMC Genomics 18, 7 (2017).

29. 29.

Liu, S. et al. Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data. Nucleic acids research 44, e47–e47 (2016).

30. 30.

Kumar, S., Vo, A. D., Qin, F. & Li, H. Comparative assessment of methods for the fusion transcripts detection from RNA-Seq data. Scientific reports 6 (2016).

31. 31.

Langmead, B. Aligning short sequencing reads with Bowtie. Current protocols in bioinformatics, 11.17. 11–11.17. 14 (2010).

32. 32.

Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

33. 33.

Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

34. 34.

Ameur, A., Wetterbom, A., Feuk, L. & Gyllensten, U. Global and unbiased detection of splice junctions from RNA-seq data. Genome biology 11, R34 (2010).

35. 35.

Edgren, H. et al. Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome biology 12, R6 (2011).

36. 36.

Levin, J. Z. et al. Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome biology 10, R115 (2009).

37. 37.

Kangaspeska, S. et al. Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms. PloS one 7, e48745 (2012).

38. 38.

Graux, C. et al. Fusion of NUP214 to ABL1 on amplified episomes in T-cell acute lymphoblastic leukemia. Nature genetics 36, 1084 (2004).

39. 39.

Varley, K. E. et al. Recurrent read-through fusion transcripts in breast cancer. Breast cancer research and treatment 146, 287–297 (2014).

## Acknowledgements

The study was supported by grants from the National Natural Science Foundation of China (No. 61571223, 61171191, X.S; No. 81270700, P.H), the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20133218110016, X.S), the “Six Talent Peak” Project of Jiangsu Province under Grant No. SWYY-021(X.S).

## Author information

### Affiliations

1. #### Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China

• Jian Zhao
• , Qi Chen
• , Jing Wu
•  & Xiaofeng Song

• Ping Han

### Contributions

J.Z., Q.C. and J.W. implemented the study and wrote the manuscript. X.S. and P.H. conceived study and revised the manuscript. All authors read and approved the final manuscript.

### Competing Interests

The authors declare that they have no competing interests.

### Corresponding authors

Correspondence to Ping Han or Xiaofeng Song.