Fusion transcripts in normal human cortex increase with age and show distinct genomic features for single cells and tissues

Fusion transcripts can contribute to diversity of molecular networks in the human cortex. In this study, we explored the occurrence of fusion transcripts in normal human cortex along with single neurons and astrocytes. We identified 1305 non-redundant fusion events from 388 transcriptomes representing 59 human cortices and 329 single cells. Our results indicate while the majority of fusion transcripts in human cortex are intra-chromosomal (85%), events found in single neurons and astrocytes were primarily inter-chromosomal (80%). The number of fusions in single neurons was significantly higher than that in single astrocytes (p < 0.05), indicating fusion as a possible contributor towards transcriptome diversity in neuronal cells. The identified fusions were largely private and 4 specific recurring events were found both in cortex and in single neurons but not in astrocytes. We found a significant increase in the number of fusion transcripts in human brain with increasing age both in single cells and whole cortex (p < 0.0005 and < 0.005, respectively). This is likely one of the many possible contributors for the inherent plasticity of the adult brain. The fusion transcripts in fetal brain were enriched for genes for long-term depression; while those in adult brain involved genes enriched for long-term potentiation pathways. Our findings demonstrate fusion transcripts are naturally occurring phenomenon spanning across the health-disease continuum, and likely contribute to the diverse molecular network of human brain.


Results
Fusion transcripts in frontal cortex show a bias for intra-chromosomal events. We analyzed 59 whole-transcriptome datasets comprising of 49 prefrontal cortices (public domain) and 10 frontal cortices (sequenced in-house). By analyzing ~4.2 billion sequencing reads we identified 88 fusion transcripts representing 38 non-redundant events (Supplementary Table S1).
We observed an enrichment of intra-chromosomal fusions amongst these events (87%, 33/38; Fig. 1a). Almost half of the events (47%, 18/38) were found at least twice, and 5 of them were in minimum 4 samples ( Table 1). Analysis of the sequence context around the identified break-points revealed, 82% (31/38) of the events were fused at the exon boundaries from either one or both parental genes (Fig. 1b), and 91% (28/31) of them retained the canonical GT-AG splice sites (Fig. 1c). This indicated that a majority of the fusion transcripts, identified through our analysis, was formed by the classical splicing machinery -albeit joining fragments from 2 different mRNAs.
We used the available frontal cortex (FC) samples for experimental validation of the identified fusions. Based on the recurrence and supporting reads, 3 selected fusion events were validated using fusion specific PCR followed by Sanger sequencing. They include, (i) two recurring events, namely, KANSL1-ARL17A ( Fig. 2a) (also see later) and RP11-572M11.1-C3ORF17 (Fig. 2b) along with (ii) a private event: MTOR-UBIAD1 (Fig. 2c). MTOR-UBIAD1 was further quantified by qPCR and was found to have lower expression (~0.3 fold) with respect to MTOR -one of the parent genes ( Supplementary Fig. S1).

Fusion transcripts in single cells are rich in inter-chromosomal events.
We used single cell transcriptome data from human brain to resolve the tissue level data with increased resolution. For this we considered publically available transcriptomic libraries from 131 neurons and 62 astrocytes collected from 7 different healthy adult brains. Upon analyzing 500 million reads (minimum 1 million reads per cell, Supplementary Fig. S2), a total of 912 non-redundant fusion events were identified (Supplementary Tables S2 and S3).
Interestingly, fusion transcripts identified in single cells demonstrated a clear bias for inter-chromosomal events with 82% (749/912) fusions involving genes from two different chromosomes (Fig. 3a) -which was in contrast to our findings for the cortex samples. Fusion transcripts from single cells revealed a smaller fraction that harbor exonic boundaries from either of the two partner genes (28%, 251/911) and only ~33% (82/251) of them harbor the canonical GT-AG site ( Supplementary Fig. S3).
Three recurring fusions were identified in single neurons which were also found to recur in cortex samples ( Supplementary Fig. S4). In contrast, single astrocytes had all private events and none of them were common with the events found in the cortex samples. Surprisingly, we found significantly higher (one tailed Wilcoxon p value < 0.005) number of fusions in neurons when compared to the astrocytes (Fig. 3b). Simultaneously, expression analysis revealed that 87% of the genes harboring fusion in neurons, have a considerable expression (FPKM > 1) in astrocytes ( Supplementary  Fig. S5A) and only 27% (462/1726) of them were differentially expressed. The expression analysis indicates the difference in fusion load is not due to absence of the expression of genes that are involved in fusion. Adult brain harbor more fusion transcripts than fetal brain. To ascertain any association between the fusion burden and aging brain we considered another data set encompassing brain transcriptomes from two distinct age groups. Towards this we analyzed a total of ~2 billion RNA sequencing reads from Dorso Lateral Frontal Cortices (DLFC) derived from 16 adult and 8 newborns together with 131 and 110 single neurons derived from 7 adult and 5 fetal brains respectively. We identified 63 unique fusions in 16 adult brains while 10 fusions in fetal brain. Similarly, 724 fusion events were observed in adult neurons and 302 in fetal neurons (Supplementary  Tables S3-S6).
Strikingly, fusion transcripts were found to be higher (~2 times) in adult brain compared to the fetal brain, be it a cell or a tissue (one tailed Wilcoxon p value < 0.05) (Fig. 4a,b). Concurrently, expression analysis demonstrated that 88% of the genes harboring fusion in adults also have a considerable expression (FPKM > 1) in fetal brain ( Supplementary Fig. S5B,C), with only 17% (287/1726) of them were differentially expressed between them.
To know the functional significance of these genes that undergo fusion, we also performed their enrichment analysis using GSEA. Genes harboring fusion demonstrated a significant enrichment for the long-term potentiation pathway in adults while in fetal there was a significant representation for long-term depression pathway (both with p value < 0.00001) (Fig. 4c,d).

Translational ability of fusion transcripts.
To test the translational ability of these fusion events we performed in-silico translation around the junction by translating them into six reading frames. These putative fusion peptides were first subjected to in-silico digestion with trypsin followed by MS identification. We used Topology of the chromosomes inside a cell and fusion transcripts. To test if fusions can occur between spatially proximal regions of the genome, we also performed an unbiased chromatin interaction mapping. To test the same, we used ENCODE Hi-C data from 10 different human cell lines. This set of analysis led us to identify 32190 significant genome wide chromatin interactions (Z score > 1.96) reported in at least 2 cell lines at a 40 kb resolution ( Fig. 5a and Supplementary Table S8). This set of analysis demonstrated frequent cis-chromatin interactions within chr 17 and maximum number of trans-interactions between chr 1 and chr 16 (Fig. 5b). By comparing fusion breakpoints with the Hi-C interaction maps of three-dimensional chromosome conformation led us to detect nine fusion partners located in broad chromatin domains that are spatially proximal in normal cell nuclei (Table 3). These events involve an interesting inter-chromosomal ENSG00000227733-HYDIN fusion involving ENSG00000227733 and HYDIN genes from chromosome 1 and 16 respectively. ENSG00000227733-HYDIN fusion was one of the recurring events identified in both single neurons as well as in brain tissues indicating its neuronal preference. Co-occurrence of Hi-C interaction map around its breakpoint suggests spatial proximity could be an important trigger to facilitate this fusion between the two distantly located genomic loci. KANSL1-ARL17A transcript and its status in GBM. KANSL1-ARL17 fusion was one of the most recurring events comprising of first three exons from KANSL1 and last three exons from ARL17. It was identified in ~20% (12/59) of normal brain tissues (Supplementary Table S1). CNV signature around its junction from our analysis signifies its genomic origin and it is also a well-documented DNA level event in multiple cancers (Table 4).
To test the status of this particular fusion in disease condition we used a most malignant brain tumor i.e. Glioblastoma multiforme (GBM). By performing junction specific PCR, we validated the same event in 7 out of www.nature.com/scientificreports www.nature.com/scientificreports/ 9 (78%) GBM samples and further confirmed it by Sanger sequencing (Fig. 6a and Supplementary Fig. S6). The large difference in frequency of this fusion event in GBM and normal brain led us to investigate the 17q21.31 locus, which harbours both the parent genes (KANSL1 and ARL17).
17q21.31 is a cryptic locus and known to have two different haplotypes along with their subtypes 21 . H1β duplication is the only subtype capable to produce KANSL1-ARL17 fusion. We confirmed H1β duplication in our in-house normal brain samples which were observed to have an allele frequency of 70%, (7/10) as it was detected in the case of GBM samples (Fig. 6b). These results highlight the prominence of H1β duplication in south Asian population with an allele frequency close to 70%.
The expression of the KANSL1-ARL17 fusion transcript and its parent genes were also assessed with Real-Time PCR between GBM and our in-house normal brain tissues. Interestingly, KANSL1-ARL17 fusion transcript was found to be down regulated (5 ×) in GBM while it was up-regulated (5 ×) to the same extent in normal brain when compared to its parent gene (KANSL1) (Fig. 6c,d). These results indicate fusion product might have different function compared to its parent genes. Likewise, it's not the presence rather the expression levels of the KANSL1-ARL17 fusion can be linked with cancer etiology.

Discussion
The life span and 'fitness' of an organism is the sum of deleterious changes and counteracting repair and maintenance mechanisms that trigger in response to stimuli 22 . This is not only true for whole organisms but also true at the levels of tissues, organs and individual cells. Recent surge of next generation sequencing data has revealed that in the human brain this diversity is achieved and maintained not only by the inherited set of variations but also by the variability introduced de novo at the somatic level for both DNA and RNA [23][24][25] . Fusion transcripts are one of many possible ways to create and maintain the much needed diversity -especially in post-mitotic cells like human neurons. Fusion transcripts are now widely reported in the literature albeit with a bias towards their presence and function in diseases -especially for their oncogenic role in human cancers. However, these transcripts can have a big influence on the normal phenotypic outcome. It has been shown for fusion transcripts that the fusion event when created at the RNA level provide a growth advantage while the same event resulting from a DNA level translocation may lead to cancer 13 . The underlying explanation for this observation is likely due to the obligatory (and hence more) expression of the fusion RNA when it is formed at the DNA level. This suggests a functional continuum of the fusion transcripts, where the same events in moderation can provide an advantage while an excess is detrimental. The spectrum of fusion transcripts in normal human tissues and to what extent they are functionally relevant is an unanswered question. We embarked on this study in an attempt to address this issue focused to the normal human brain, particularly the cortex.
Our analysis showed the organization and distribution of the fusion transcripts are unique and distinct between the tissue and the single cells. At the tissue level we found more recurrence and majority of the events formed between two genes residing on the same chromosome. Notably, our analysis has identified KANSL1-ARL17 fusion as a recurring event in normal human cortex. Contrasting expression pattern between the KANSL1-ARL17 fusion and its parent gene, KANSL1 in GBM compared to the normal brain warrants further investigation, as it suggests their potential role in cancer.
On the other hand, the spectrum of fusion transcripts occurring from the single neurons and astrocytes show minimal or no recurrence and the vast majority formed by fusion of two transcripts coming from different chromosomes. Recurrence is an important factor to estimate their abundance. In our case, majority of our events were non-recurring between cells which might signify their low abundance or they may be false positive generated by template switching of reverse transcriptase during RT-PCR 26,27 or possibly a mix of both. Further, a complete lack of redundancy in the single astrocytes was striking and cannot be explained. Further research on the same source of tissue and single cells might be able to shed light on these observations. We found significantly higher number of fusions in neurons compared to the astrocytes. This observation, together with a considerably higher number of fusions in adult brain compared to the newborns -indicate a role of fusion transcripts in maintaining required diversity in neurons with restricted regenerative capabilities and age-dependent decline in turnover rate of the cells 28,29 . Genes harboring fusion demonstrated a significant www.nature.com/scientificreports www.nature.com/scientificreports/ enrichment for long-term potentiation (LTP) and long-term depression (LTD) pathways in adult and fetal brain respectively. Both of these pathways in conjunction provide neuronal synaptic plasticity that undergo age-related alterations [30][31][32][33] . Results from our analysis suggest aging leading to a substantial fusion load, which might affect the neuronal synaptic function. Evidence for their corresponding junction peptides further support their biological relevance. Later on we also identified the inter-chromosomal proximity between ENSG00000227733 and HYDIN genes from chromosome 1 and 16, respectively. Co-occurrences of chromatin interaction maps with these fusion breakpoints suggest spatial proximity can be one of the possible triggers.
In summary, our genome-wide analysis establish fusion transcripts are naturally occurring phenomenon that span the health-disease continuum of the human brain.

Materials and Methods
Sample collection. Frontal cortexes (Grey matter) were procured from post-mortem samples of road accident victims collected from NIMHANS Brain Bank, Bangalore, India. GBM samples were obtained from AIIMS, New Delhi, India. The samples were collected according to the Helsinki Declaration and the ethical review board of All India Institute of Medical Sciences, Delhi, India approved the project. Diagnosis and grading of tumor samples were done as per 2007 WHO classification. The details of the in-house samples used in the study are provided in Supplemental Table S9 and S10. Informed consent were obtained from the GBM patients or from their family members (next of kin). For the frontal cortex (post-mortem) samples obtained from the brain bank, these samples were already covered under the ethical approval obtained by the Brain bank.  Similarly, bulk tissue from adult brain (red) also demonstrated higher number of fusions compared to fetal (blue) (one tailed Wilcoxon p value < 10 −3 is marked by **) where horizontal axis denotes brain samples while vertical axis represents number of fusion transcripts normalized with total data generated in a samples. Pathway enrichment analysis of genes harboring fusions has revealed enrichment for genes implicated in processes involved in synaptic plasticity. In the figure the horizontal axis shows the negative logarithm of FDR corrected p-value while vertical axis has different biological processes. Top biological processes that were significantly enriched in adult brain (c) and fetal brain (d) (both from tissue and single cells). www.nature.com/scientificreports www.nature.com/scientificreports/ DNA isolation, library preparation and exome sequencing. DNA isolation was done by using Omniprep Genomic DNA isolation kit (G-Biosciences, USA) as per manufacturer's protocol. Exome capture was done using Illumina TruSeq Exome capture kit. 100 base pair paired end sequencing was done using Illumina HiSeq. 2000 (Ilumina, USA). The exome sequence data is also deposited at the sequence read archive (SRA ID: SRP045655).

Identification of candidate fusion transcripts.
Raw data was checked for per base quality score and reads having 80% bases with phred quality score 30 and greater were only be considered for downstream analysis. Fusion transcripts were identified using the published pipeline 34 with the default parameters. Briefly, the quality filtered reads were aligned using Tophat (version 2.0.5) against transcriptome (UCSC hg19 annotations) and genome (hg19) both with 2 mismatches. Discordant reads were used to identify potential fusion candidates using its fusion-search module. Tophat-fusion-post was further used to filter out events supported by minimum 5 supporting reads. The software and detailed description is available at < https://ccb.jhu.edu/software/tophat/ fusion_tutorial.shtml >. The same pipeline was used for both in-house and publicly available dataset for calling fusion transcripts.

Fusion transcripts associated with CNVs.
To determine if the origin of the detected fusions were genomic, we explored signatures for copy number variation around fusion breakpoint Genome-wide CNV calling was performed using two different platforms: exome sequencing and illumina's 660 quad microarray for 4 out of 10 FC samples (Supplementary Table S9).
Exome sequencing data analysis and CNV calling. Raw data from exome sequencing was checked for per base quality score and reads having 80% bases with phred quality score 30 and greater were carried forward for downstream analysis and rest were discarded. Filtered reads were aligned to the reference genome (hg19)  Genome-wide genotyping and CNV calling. Isolation of the genomic DNA was done using a standard salting-out procedure to perform genome-wide genotyping by using the Infinium Human660W-quad BeadChip method (Illumina, Inc., San Diego, CA, USA). We used 200 ng of genomic DNA for each sample, in accordance with the manufacturer's guidelines. The raw data files were processed by the GenomeStudio software package. To call CNVs, we used the PennCNV algorithm and applied a stringent criterion of at least 10 consecutive probes to show altered intensity to qualify as a CNV call. We used a threshold of 0.35 for the standard deviation for logR ratio of normalized intensity (LRR) and a threshold of 0.05 for the standard deviation for B allele frequency as explained in our earlier study.
In order to confirm their genomic evidences for fusion events, we overlaid CNV coordinates obtained from above analysis with fusion breakpoints ( ± 500 Kb).     . PCR products (307 bp) were run on 2% gel with 100 bp ladder. Relative expression of fusion transcript along with its parent genes was checked in GBM compared to controls. We performed real time PCR using (c) KANSL1-ARL17 fusion, (d) KANSL1 primers with cDNAs from 6 in-house normal brains along with 6 GBM samples. In both the boxplots, vertical axis represents Log 2 ΔCt calculated using the expression of B2M. We also performed the same set of experiment using fusion specific primers for ARL17 and depicted in the Supplementary Fig. S8. Strategy used to design the fusion specific primer is depicted in the Supplementary Fig. S7 www.nature.com/scientificreports www.nature.com/scientificreports/ to check if there is any association of chromosomal proximity with fusion transcripts, we overlaid the identified chromatin map with these fusion breakpoints.
In-silico translation and proteomics data analysis. To check the translational evidence for identified fusions, MS/MS Spectral datasets from human cortex were considered from PRIDE projects PXD000263, PXD004076, PXD004987, PXD005629 and PXD002775. These were downloaded from the PRIDE repository and were searched against the six frame translated database using OMSSA 39 and X!Tandem 40,41 in integrated transcriptomic-proteomic pipeline EuGenoSuite 42 .
We performed in-silico trypsin digestion for our probable peptides with one miss-matched site. Further parameters involve 20 ppm precursor ion tolerance; 0.5 Da product ion tolerance; carbamidomethylation of cysteine as fixed modification; and oxidation of methionine and peptide N-terminal acetylation were used as variable modifications. Stringent FDR threshold of ≤ 1% was applied to the resulting PSMs and junction specific peptides were identified using a customized Perl script. Reading frames for these filtered hits were further confirmed by again mapping them with human proteome (RefSeq GRCh37.p13) using stand-alone BLAST tool 43 . Gene and pathway enrichment analysis. To understand the functional significance of genes harboring fusions, pathway analysis was performed using Gene Set Enrichment Analysis (GSEA) 44,45 . We applied gene ontology enrichment analysis using KEGG pathways to generate gene clusters enriching similar biological processes. Gene cluster and their corresponding pathways with highest enrichment score and FDR q-value < 0.05 were only considered for literature mining.
Statistical analysis. Differential numbers of fusion transcripts between groups of samples were screened out using Wilcoxon rank-sum test. P < 0.05 was considered as statistically significant. Inter-individual differences between cells and tissues were quantified by using Principal Component Analyss and were performed in R platform (http://www.rproject.org) using rlog-transformed read count from whole transcriptome.
Validation and quantification. cDNA conversion of the extracted RNA (1 micro-gram) was done using High-Capacity cDNA Reverse transcription kit (Thermo Fischer Scientific) as per manufacturer's protocol in a reaction volume of 20 µl. To target fusion junction we designed fusion specific primers by using Primer3 software (http://frodo.wi.mit.edu/primer3/) and their genomic locations were confirmed by UCSC's In-silico PCR option. The list of primers used in this study is provided in Supplementary Table S10. These primers were subjected to PCR and to amplify the junction specific PCR product. Quantitative PCR (qPCR) was carried out using SYBR Green master-mix (KAPA) on Roche LC480 system using primers detailed in Supplementary Table S10. Fusion expression was determined using the delta-delta CT method 46 and using B2M as the house-keeping gene.