Characterization of the nuclear and cytosolic transcriptomes in human brain tissue reveals new insights into the subcellular distribution of RNA transcripts

Transcriptome analysis has mainly relied on analyzing RNA sequencing data from whole cells, overlooking the impact of subcellular RNA localization and its influence on our understanding of gene function, and interpretation of gene expression signatures in cells. Here, we performed a comprehensive analysis of cytosolic and nuclear transcriptomes in human fetal and adult brain samples. We show significant differences in RNA expression for protein-coding and lncRNA genes between cytosol and nucleus. Transcripts displaying differential subcellular localization belong to particular functional categories and display tissue-specific localization patterns. We also show that transcripts encoding the nuclear-encoded mitochondrial proteins are significantly enriched in the cytosol compared to the rest of protein-coding genes. Further investigation of the use of the cytosolic or the nuclear transcriptome for differential gene expression analysis indicates important differences in results depending on the cellular compartment. These differences were manifested at the level of transcript types and the number of differentially expressed genes. Our data provide a resource of RNA subcellular localization in the human brain and highlight differences in using the cytosolic or the nuclear transcriptomes for differential expression analysis.

The mammalian transcriptome harbors a myriad of coding and non-coding RNA transcripts, with biogenesis, abundance and subcellular localization tightly regulated to match the physiology and function of the cell. Genomewide technologies such as RNA sequencing have been instrumental in characterizing the transcriptome architecture. However, transcriptome analysis using RNA sequencing has primarily been performed on total or polyA+ RNA from whole cells, overlooking the spatial dimension of gene expression at the subcellular level. Nevertheless, an increasing number of reports are pointing to the importance of investigating the subcellular repertoire of RNA molecules and understanding the mechanisms that govern their distribution inside the cell [1][2][3][4] . Subcellular RNA localization is widespread and conserved from bacteria to mammals, and it is becoming evident that this process plays crucial roles in regulating gene expression 5,6 . For mRNA, subcellular localization provides a means to spatially control protein production and target proteins to their site of function [7][8][9][10][11] . Therefore, the differential distribution of mature mRNA between the nucleus and the cytoplasm may have a direct impact on protein expression levels. In fact, it has been recently shown that nuclear retention of mature mRNA is a mechanism that involves a wide range of protein-coding transcripts. This nuclear retention is believed to ultimately fine-tune mRNA translation in the cytoplasm 12 . Similarly, identifying the subcellular localization of non-coding RNA such as long non-coding RNAs (lncRNAs) can provide substantial insights into their biology and function. Based on the link between localization, function, and regulation it is important to map the subcellular localization of coding and non-coding RNA in different cells and tissues to obtain a more comprehensive understanding of RNA's biological functions.
Early attempts to determine the cellular localization of RNA molecules were performed on one transcript at a time, mostly using RNA fluorescent in situ hybridization (RNA FISH) 13,14 . However, in the last few years, the field has benefited from the use of various high-throughput technologies, such as expression microarrays and RNA-seq in combination with cellular fractionation 11,[15][16][17] . The general aim of these studies was to characterize the contribution of each of the nuclear and cytoplasmic transcriptomes to the global gene expression profiles. Several studies have demonstrated that although the nucleus and cytoplasm contain overlapping populations of RNA transcripts, there are significant differences between the two fractions, and many coding and non-coding transcripts are unique to one fraction 17 . The differentially localized transcripts exhibit particular global characteristics. For example, transcripts localized to the cytoplasm tend to have shorter coding sequences and shorter UTRs compared to the transcripts localized in the nuclear fraction 15,18 .
The subcellular expression patterns of lncRNA have also been the focus of several studies during the last years. Although a large number of lncRNAs have been identified, the biological relevance for the majority of lncRNAs is still elusive 19,20  In this study, we utilized an efficient cytoplasmic and nuclear RNA extraction protocol 28 to purify populations of subcellular RNA fractions from human fetal and adult frontal cortex, as well as fetal cerebellum. We investigate the differences in relative transcript abundance between cytosol and nucleus and highlight genes and gene categories that are over-represented in either compartment. Our results also provide important insight into the subcellular localization of nuclear-encoded-mitochondrial proteins (NEMPs) and noncoding RNA in brain tissue samples.

Results
To characterize the abundance of transcripts in the nuclear and cytosolic RNA fractions from brain tissue we sequenced total RNA extracted from six adult frontal cortex samples, three fetal frontal cortex samples, and cerebellum from the same three fetal samples. The nuclear and cytosolic RNA was also sequenced from the human neuroblastoma cell line (SHSY-5Y) for reference.  Figure 1A). Similarly, principal component analysis (PCA) separated all the samples according to their subcellular fractions and tissue sources along the PC1 and PC2, respectively ( Figure 1B).
To further investigate the distribution of transcripts across the cytosolic and nuclear fractions and identify genes that show differential abundance between nucleus and cytosol, we first investigated how different classes of transcripts are distributed. We found that protein-coding genes were equally distributed between the nuclear and the cytosolic fractions, whereas lncRNAs and small nucleolar RNAs (snoRNAs), as previously shown 29 , were more abundant in the nuclear fraction ( Figure 1C and D). We could confirm the nuclear localization of a number of highly expressed lncRNAs with known nuclear localization (including XIST and MALAT1). We observed similar distribution patterns when each brain tissue was analyzed separately (Supplementary Figure 1). However, the most skewed distribution was exhibited by snoRNAs, which are well-known to have their primary function in the nucleus. Examples of snoRNA with extreme nucleus bias include SNORD114-12 and SNORD113-8 ( Figure 1C). Overall, the PCA and distribution of the different transcripts are in agreement with previous findings and indicate that our analysis readily identifies transcripts with a previously demonstrated cytosolic or nuclear localization 33 .

Protein coding genes
To identify transcripts exhibiting differential distribution in the cytosolic and nuclear fractions, we first analyzed protein-coding transcripts across all tissues. In total, we identified 5,109 transcripts that were significantly more abundant in the cytosol and 5,397 transcripts significantly more abundant in the nucleus (P adj. <0.05, Supplementary   Figure 2A) and that NEMPs were significantly enriched in the cytosol compared to all other protein-coding genes (p= 2.8x10e-76, Figure 2B). In our cellular fractionation protocol, we used a centrifugation power of 10,000 rpm, which results in the mitochondria being pelleted together with the nuclear fraction. Therefore, it is unlikely that the enrichment of NEMPs in the cytosol was due to NEMP RNAs being physically anchored to the mitochondria. We further validated this by showing that the transcripts produced by the mitochondria were clearly nuclear in their localization (p= 6.4x10e-08, Figure   2A and  Figure 2C). To further test if the cytosolic enrichment of NEMPs was consistent across various tissue types, we extended our analysis to cytosolic and nuclear RNA-seq data from 11 cell lines from the ENCODE project 22,37 . In agreement with our brain data, we found that NEMPs exhibited preferential cytosolic localization in all the cell lines (Supplementary Figure 5), indicating that the cytosolic localization of NEMPs is a general phenomenon that involves various human tissues.
One explanation for the increased abundance of a group of transcripts in the cytosol could be an increased mRNA half-life. To examine if the preferential cytosolic localization of NEMPs was due to long half-life, we used published mRNA half-life data 38 and inquired whether these transcripts have a longer half-life than the rest of protein-coding transcripts. We found that NEMPs displayed significantly longer half-life compared to other protein-coding transcripts (p= 7.1x10e-12, Figure 2D), which could explain the preferential enrichment of NEMPs in the cytosol. An analysis of the half-life of all proteincoding transcripts significantly associated with the cytosolic or nuclear compartments showed a significant difference in half-life, with a longer half-life for transcripts enriched in the cytosol (p= 1.6x10e-21, Figure 2E).

Long non-coding RNA (LncRNAs)
We next investigated if lncRNAs also exhibit differential distribution in the nuclear and cytosolic fractions across all samples. Overall, we found a higher representation of lncRNAs in the nucleus ( Figure 3A). Out of all expressed lncRNAs, we identified 1,114 lncRNAs in the nucleus compared to 118 lncRNA in the cytosol (P adj. <0.05; Supplementary Table 4). The analysis in Figure  Although there was an overlap within the cytosolic and within nuclear lncRNAs between the tissues, each tissue type harbored a group of unique lncRNAs. We further investigated these tissue-specific cytosolic and nuclear lncRNAs, which demonstrated differential localization between the cytosol in a single tissue and found that the majority of these were either not expressed or expressed at low levels in the other tissues.
To further understand how subcellular localization of lncRNAs varies between different tissue types, we explored the subcellular distribution of lncRNAs in 10 cell lines from the ENCODE project. In this analysis, we only focused on lncRNAs that showed significant subcellular localization in either the cytosol or the nucleus in the brain samples. Although we found an overlap of lncRNAs between the subcellular compartments, there were many lncRNAs that exhibited differential subcellular localization between the brain samples and the cell lines ( Figure 3C). A range from 35 to 69 lncRNA displayed differential localization between the brain and each of the different cell lines (Supplementary Table 5), such as LINC00672, which was cytosolic in the brain and nuclear in 7 out of 10 cell lines. The lncRNAs RP11-452F19.3 and RP11-449H11.1 were nuclear in the brain and cytosolic in 5 and 3 different cell lines, respectively. We also found that many of the lncRNAs that display differential localization between the brain and cell lines differ within the different cell types (Supplementary Table 5). These results further support that subcellular localization of lncRNAs is often tissue specific. The lncRNAs that displayed similar localization patterns between the brain and cell lines are listed in Supplementary Table 6.

The cytosolic and nuclear transcriptomes as a source for gene expression analysis
There is a longstanding interest in understanding the reliability of using the cytosolic or nuclear RNA as a proxy for whole-cell gene expression analysis.
To obtain a more accurate view of the differences between the cytosolic and nuclear transcriptomes as a source for gene expression analysis, we compared the results of differential gene expression analysis between cytosolic and nuclear transcriptomes from the fetal frontal cortex and the adult frontal cortex (cytosolic vs cytosolic [cyto-cyto], and nuclear vs nuclear [nucnuc]). To control for the differences in the amount of sequencing data produced from the fetal and adult tissues (Supplementary Table 1), we first subsampled the read counts to obtain equal library sizes for the samples.
Differentially expressed genes (DEGs) between the two samples for the cytocyto and nuc-nuc before and after subsampling showed that the subsampling had a marginal effect on the number of DEGs ( Figure 4A). Subsampled data were used for the rest of the analysis in Figure 4. We then compared the number of up-regulated and down-regulated genes between the two tissues in the cyto-cyto and nuc-nuc data. There was a clear difference in the number of DEGs detected in cyto-cyto and nuc-nuc, with a higher number in the cytocyto analysis compared to the nuc-nuc ( Figure 4B). Comparing the lists of DEGs between the two analyses revealed that although there was an overlap between the DEGs identified from the two analyses, a large fraction of the DEGs was unique to cyto-cyto analysis ( Figure 4B). GO analysis for the unique list of DEGs obtained from the cyto-cyto comparison revealed enrichment of transcripts related to translational processes, various metabolic and biosynthetic processes and gene expression and RNA processing.
Meanwhile, the unique DEGs from the nuc-nuc analysis were mainly enriched for transcripts that belong to mitochondrial translation, which as we mentioned was a result of the mitochondria being separated with the nuclear fraction in the extraction protocol (Supplementary Table 7 -10). Dividing the lists of DEGs from Figure 4B into protein-coding and lncRNA revealed that even though the nuc-nuc analysis resulted in a smaller number of protein-coding genes, it harbored a larger number of differentially expressed lncRNAs as compared to the cyto-cyto analysis, which was not surprising given that most lncRNAs were localized to the nucleus ( Figure 4C). Collectively, these results indicate that although the cytosol and the nuclear fractions contain largely overlapping lists of DEGs, each fraction also contains unique groups of DEGs. To our surprise, when we investigated the expression fold change of all DEGs obtained from cyto-cyto, and the nuc-nuc analysis, a group of 61 genes showed opposite differential expression patterns ( Figure 4D). For example, ETV2, which encodes a transcription factor and is overexpressed and critically required during development, was up-regulated in the fetal sample when using cyto-cyto analysis and up-regulated in the adult sample when using the nuc-nuc analysis. Other genes that showed opposite differential expression patterns are highlighted in Figure 4D and listed in Supplementary Table 11 and Supplementary Table 12. These results imply that gene expression analysis using either of these fractions could lead to different conclusions on differential expression between samples. Therefore, we suggest that neither the cytosolic nor the nuclear transcriptomes alone represent an accurate proxy for a complete view of gene expression levels in a cell. In addition, using either the cytosol or the nucleus may lead to conflicting data.
A recent study by Lake et al. revealed a high concordance in the nuclear and whole-cell transcriptomes from mice when profiling cellular diversity using celltype-specific marker gene expression signatures 39 . We compared our list of significant cytosolic and nuclear transcripts obtained from all brain samples with the whole-cell and nuclear data from Lake et al. and found that our analysis resulted in a larger number of cytosolic/nuclear-localized genes (5201 compared to 1518 cytosolic genes, and 5548 compared to 791). This is more likely due to the higher complexity and heterogeneity of samples used in our study. We also found that 65% of their cytosolic-accumulated and 45% nuclear-accumulated genes overlapped with our results, which validates the reliability of our analysis (Supplementary Figure 7). We investigated for overlaps between our list of genes that demonstrated opposite differential expression patterns between the cytosol and nucleus (Supplementary Table   11 and 12) with the list of cell-type-specific genes from Lake et al. We found three cell-type specific markers: Hsd17b10 for Oligodendrocyte, AK3 for Astrocyte, and Chrnb1 for Endothelial cells which showed opposite differential expression patterns between the cytosol and the nucleus i.e. Hsd17b10 and Chrnb1 were upregulated in the fetal sample when using cyto-cyto analysis and upregulated in the adult sample when using the nuc-nuc analysis.
Meanwhile, AK3 was upregulated in the adult sample when using cyto-cyto analysis and upregulated in the fetal sample when using the nuc-nuc analysis.

Discussion
To date, transcriptome analysis has mainly relied on analyzing RNA sequencing data from whole cells, overlooking the impact of subcellular RNA localization and its influence on our understanding and interpretation of gene expression patterns in cells and tissues. We argue that defining gene expression patterns at the subcellular level is crucial to 1) understand the differences between the cytosolic and nuclear transcriptomes and how each fraction contributes to the global gene expression patterns, 2) investigate the dynamics of subcellular RNA localization between different cells and different developmental stages as a regulatory mechanism to control protein expression and influence RNA function, and 3) evaluate the reliability of using nuclear transcriptome in single-cell studies as a proxy for whole-cell analysis.
In this study, we therefore aimed to obtain a comprehensive overview of the subcellular localization of the different RNA transcripts in the human brain.
One group of transcripts strongly associated with the cytosol was the nuclear- where independent extraction protocols were employed. We also showed that the cytosolic localization of NEMP mRNAs was not a result of their association to the mitochondrial membrane since the mitochondria end up in the nuclear fraction in our cellular fractionation procedure. NEMPs have previously been assigned to classes based on their translation location in relation to mitochondria (MLR class) 43 , but we found no correlation between increased cytosolic localization and MLR class (data not shown). We were also unable to find any correlation with the strength of the mitochondrial localization signal or scores based on evidence for mitochondrial localization (TargetP score or MitoCarta2.0 score) from the MitoCarta database 34 (data not shown). We further determine that NEMP mRNAs on average have a longer half-life than the rest of protein-coding genes. Thus, our data suggest that NEMP mRNAs are rapidly transported from the nucleus after transcription and accumulate in the cytoplasm until their translation is required by the mitochondria. In line with our results, a recent study in yeast sought to monitor nuclear and mitochondrial gene expression during mitochondrial biogenesis.
The study demonstrated that the synchronization between the two compartments does not occur on the transcriptional level. Instead, they are rapidly synchronized on protein translation level 44 .
It is becoming increasingly evident that subcellular localization of lncRNAs is strongly associated with their function. In light of this, ongoing efforts such as the LncATLAS and RNALocate are aiming to establish a complete map for subcellular localization of lncRNAs in various tissue types 3,29 . So far, these databases rely on data from cell lines and lack knowledge about the subcellular localization of lncRNA in tissues and during development. Studies that have investigated cellular localization of lncRNAs during development were mainly performed in drosophila and zebrafish 45,46 . Therefore, in this study, we sought to provide global data of lncRNA subcellular localization in the human brain for two developmental stages. In accordance with previous studies, we found that the majority of lncRNAs are more abundant in the nucleus than in the cytosol 22 . We further validated previous findings of tissuespecific expression and subcellular localization of lncRNAs in the adult frontal cortex, fetal frontal cotex, and cerebellum. The higher number of cytosolic and nuclear-enriched lncRNAs detected in the adult frontal lobe compared to the fetal samples is most probably due to the larger number of adult brain samples used in this study, as well as a more homogeneous expression in the adult brain compared to the fetal samples (represented by 14 and 38-week embryos), providing better power to detect significant differences in localization Considering the differences in subcellular localization between the different brain samples, we investigated whether subcellular localization of lncRNAs could vary between different tissue contexts. When we compared the subcellular localization patterns of lncRNAs in the brain samples and ten cell lines from the ENCODE project, we clearly show that some lncRNAs have opposite subcellular localization patterns in different samples. This indicates that the subcellular localization of certain lncRNAs is dynamic, which could be explained by several processes including tissue-specific functionality or regulation, or that some lncRNA are retained in the nuclear compartment to regulate their function, assuming that their site of action is in the cytosol. Our results provide a resource for lncRNAs enriched in cytosol and nucleus in both fetal and adult human brain samples. We also provide evidence for the differential subcellular localization of lncRNAs in tissues. These data further encourage the need to understand the mechanisms that control the subcellular fate of lncRNAs and the biological consequences of this process.
The impact of using transcriptome data obtained from either cytosolic or nuclear RNA as a reliable proxy for gene expression analysis is an active topic of investigation. Recently, single nucleus RNA sequencing has been introduced as an alternative to whole single-cell sequencing, particularly in tissues where it is challenging or impossible to recover intact cells 36,47,48 . In addition to the data presented in this study, several other studies demonstrated significant differences between the nuclear and the cytosolic transcriptome 17,49 . Also, the compartmentalization of mRNA transcripts in the nucleus was shown to play an active role in regulating gene expression in different biological contexts and to prevent fluctuations arising from bursts in gene transcription 12,50 . On the same note, several studies revealed a higher correlation between RNA-seq data from the cytosol and whole-cell RNA-seq compared with RNA-seq data from nuclei 15,50 . On the other hand, several studies comparing data from single-nuclei and single-cell concluded that nuclear RNA sequencing faithfully replicates data from the whole-cell 36,39,47 .
In light of this, we sought to analyze gene expression differences between fetal and adult brain cortex, and compare the results from cytosol vs cytosol RNA-seq to the results from nucleus vs nucleus RNA-seq. We identified almost twice the number of differentially expressed genes in the cyto vs cyto analysis compared to the nuc vs nuc analysis. Our data show that even though there was an overlap between the lists of DEGs between two analyses, almost 50% of the cyto vs cyto and 35% of the nuc vs nuc DEGs were unique for each analysis. We also note that 61 transcripts showed contradictory differential expression patterns between the two analyses.
A recent study by Lake et al. 39 . found high concordance between single nuclei and whole-cell data and concluded the efficiency of single nuclei sequencing to interpret cellular diversity. We found on average more than 50% of cytosolic-enriched and nuclear-enriched transcripts from their study overlap with cytosolic and nuclear-enriched transcripts found in our data. We also found overlaps, particularly for the metabolic processes, between the two data sets for the list of ontology categories enriched in both the cytosol (or whole cell in the Lake et al. study) and nucleus. While we find a large overlap in genes identified as differentially expressed between samples using comparisons of cytosolic vs cytosolic and nuclear vs nuclear RNA, respectively, our data also highlight that certain categories of genes will give rise to differences in results depending on the cellular compartment used.
This includes lncRNA, which are generally overrepresented in the nucleus, but primarily thought to be located in the compartment where they are functional. We further find many genes to be exclusively found in either cyto vs cyto or nuc vs nuc analysis, with e.g. differential expression of genes involved in translational processes and ribosomal function to be better assayed using cytosolic RNA. In single-cell analysis, there is now a range of protocols ranging from those employing very mild lysis that will primarily interrogate cytosolic RNA, to harsh lysis and single nucleus sequencing that will capture the whole cell or only nuclear RNA, respectively. Our results show important differences that may arise depending on the protocol used, and highlight the importance of which subcellular fraction that is analyzed in interpreting and comparing results from single-cell studies.

Conclusions
Our data provide the first resource for the subcellular localization of RNA transcript in the human brain. We show significant differences in RNA expression for both protein-coding and lncRNAs between the cytosol and the nucleus. We also provide novel knowledge about the subcellular localization of NEMP transcripts, which could provide deeper insights into mitochondrial gene expression regulation. Our data suggest that using cytosol or nuclear RNA as a source for gene expression analysis might bias measurements of transcript levels.

Sample preparation
Cytoplasmic and nuclear RNA was purified from brain samples and SHSY-5Y cells using Cytoplasmic and nuclear RNA purification kit (Norgen) with modifications as published in 28 . In short, Cytoplasmic and nuclear RNA was purified from two fetal frontal cortex using Cytoplasmic and nuclear RNA purification kit (Norgen) with modifications as illustrated in Figure 1. In short, 20 mg of frozen tissues were ground in liquid nitrogen using mortar and pestle. Tissue powder was transferred to ice cold 1.

Read mapping and expression quantification
Data was delivered as BAM files from TMAP, but since these were not aligned with an aligner for spliced reads, therefore, the BAM files were converted into FASTQ format. Quality metrics of the raw data were determined with RSeqC 2.6.1 51 . Reads shorter than 36bp were filtered out and adaptor sequences were trimmed from the remaining data using cutadapt v1.9.1. Read alignment was done with STAR v2.4.2 52 , with default parameters using the human reference genome (hg19). Gene expression was quantified from the alignments using HTSeq count (v0.6.1) 53 by providing Ensembl gene annotation (v75) in GTF format. Only uniquely mapped reads overlapping with exonic regions of a single gene were considered for quantification. Gene biotype annotation was extracted from the Ensembl GTF file.

Differential expression analysis
Strand-specific read counts were provided to R package DESeq2 (v1.12.4) 54 to identify differentially expressed genes (adjusted p-value <0.05, Benjamini-Hochberg correction). Gene counts were normalized prior to differential analysis using the normalization method implemented in DESeq2. To identify differentially expressed snoRNA genes, all the rRNA and tRNA genes were removed from the quantified genes prior to performing differential analysis.
For the cyto-cyto and nuc-nuc differential gene expression analysis, we subsampled the read counts from the samples to obtain equal library sizes using metaseqR 55 . Differential expression analysis was performed in the similar way as mentioned earlier.

Samples clustering and PCA analysis
Euclidean distances between the samples were calculated on regularized logtransformed (rlog) counts. The distance matrix was used to compute hierarchical clustering of the samples. The Principal Component Analysis (PCA) was performed with prcomp function in R using rlog values of top 500 genes with high variance. Clustering and PCA were based on protein-coding and lncRNA genes.

Gene Ontology enrichment
The gene ontology (GO) term enrichment of DEGs was analyzed using GOSeq (v.1.24) 56 , controlling for gene-length bias. All genes were used as a background set in order to find over-represented GO terms in the DEGs.
Hypergeometric test (Benjamini and Hochberg correction) was performed to compensate for multiple testing at a significance level set to a value of <0.05 in our analyses.

Gene length distribution
Genes expressed in all fetal and adult tissues (base mean expression >= 10) were sorted by length. We then employed a sliding window

Ethics approval and consent
All samples were collected with informed consent and the study was approved by the regional ethical review board in Uppsala (dnr 2012/082).

Availability of data and material
RNA-seq data are deposited in the NCBI Sequence Read Archive (SRA) under accession GSE110727).

Competing interests
The authors in this study declare no competing interest  (D) Box plot shows mRNA half-life rate comparison between groups of differentially localized NEMP (dark blue) and remaining protein-coding genes.
(E) Comparison of mRNA half-life rate between cytosolic and nuclear DEGs.