Integrative analysis reveals the prognostic value and functions of splicing factors implicated in hepatocellular carcinoma

Splicing factors (SFs) play critical roles in the pathogenesis of various cancers through regulating tumor-associated alternative splicing (AS) events. However, the clinical value and biological functions of SFs in hepatocellular carcinoma (HCC) remain obscure. In this study, we identified 40 dysregulated SFs in HCC and established a prognostic model composed of four SFs (DNAJC6, ZC3H13, IGF2BP3, DDX19B). The predictive efficiency and independence of the prognostic model were confirmed to be satisfactory. Gene Set Enrichment Analysis (GSEA) illustrated the risk score calculated by our prognostic model was significantly associated with multiple cancer-related pathways and metabolic processes. Furthermore, we constructed the SFs-AS events regulatory network and extracted 108 protein-coding genes from the network for following functional explorations. Protein–protein interaction (PPI) network delineated the potential interactions among these 108 protein-coding genes. GO and KEGG pathway analyses investigated ontology gene sets and canonical pathways enriched by these 108 protein-coding genes. Overlapping the results of GSEA and KEGG, seven pathways were identified to be potential pathways regulated by our prognostic model through triggering aberrant AS events in HCC. In conclusion, the present study established an effective prognostic model based on SFs for HCC patients. Functional explorations of SFs and SFs-associated AS events provided directions to explore biological functions and mechanisms of SFs in HCC tumorigenesis.

correlated with poor prognosis of HCC patients 12 . It is of great significance to explore the overall expression abnormalities, prognostic value and corresponding biological functions of SFs in HCC.
In this study, we systemically analyzed the expression alterations of SFs and their prognostic values in HCC using gene expression profile downloaded from liver hepatocellular carcinoma (LIHC) of the Cancer Genome Atlas (TCGA). A prognostic model based on SFs for HCC patients was constructed, and its prognostic capacity was validated to be good. Gene set enrichment analysis (GSEA) was conducted to investigate underlying mechanisms associated with the prognostic model. In addition, we identified aberrant spliced AS events and prognostic AS events in HCC. The correlations between the 4 SFs in the model and prognostic AS events were analyzed to construct SFs-AS events regulatory network. Then functional analysis of protein-coding genes of AS events involved in the SFs-AS regulatory network further indicated the potential biological functions of AS events regulated by SFs in the prognostic model. Taken together, our present study provided a novel prognostic indicator for HCC patients and explored the potential functions of SFs implicated in HCC through regulating AS events.

Materials and methods
Data collection and processing. Gene expression counts data and clinical information of LIHC were downloaded from TCGA data portal (http:// tcgad ata. nci. nih. gov/ tcga/) 13 . The gene counts data were converted and subsequently standardized using R package "DESeq2" 14 , from which mRNA expression profile were obtained and annotated according to gene annotation file (GTF) of human downloaded from "http:// ftp. ensem bl. org/ pub/ relea se-103/ gtf/ homo_ sapie ns/". A total of 404 SF genes were identified through aggregating with the following gene sets: (1) SF-related gene sets (KEGG_SPLICEOSOME, REACTOME_MRNA_SPLICING, and REACTOME_MRNA_SPLICING_MINOR_ PATHWAY) downloaded from version 7.0 of Molecular Signature Database (MSigDB, org/gsea/msigdb/index.jsp); (2) SF-related genes downloaded from SpliceAid 2 (http:// 193. 206. 120. 249/ splic ing_ tissue. html) 15 . The list of 404 SFs were provided in Supplementary Table S1. Then the expression profile of SFs was extracted from mRNA expression profile of HCC.
We downloaded the percent spliced in (PSI) value of splicing events of HCC from TCGA SpliceSeq (https:// bioin forma tics. mdand erson. org/ TCGAS plice Seq), a data portal providing systematic profiles of AS events for all TCGA disease types 16 . PSI value represents the ratio of inclusion/exclusion normalized read counts to the total (both inclusion and exclusion) normalized read counts for a particular splicing pattern. Each AS event was assigned a unique annotation consisting of gene symbol, ID number, and splicing type. To ensure the credibility and universality of the present study, the AS events were filtered according to the following criteria: (1) the AS events with more than 75% effective PSI value; (2) the average of PSI value ≥ 0.05.

Identification of differentially expressed SFs and aberrant spliced AS events in HCC.
The expression of SFs between 50 paired tumor tissues and normal adjacent tissues of HCC were compared using R package of "limma" 17 . SFs with absolute value of log2-fold change (|log2FC|) ≥ 0.5 and adjusted P-values < 0.05 were considered significantly differentially expressed, in which P-values were adjusted using the Benjamini-Hochberg (BH) correction. The PSI value distributions of AS events between 50 normal adjacent tissues and 371 tumor tissues of HCC were compared by Wilcoxon rank-sum test to evaluate the splicing pattern alterations in HCC tissues. AS events with P value < 0.05 were considered differentially spliced. UpSet plot and Venn plot, generated by R package of "UpSetR" and "yyplot" respectively, were used to qualitatively visualize the intersecting gene sets among the types of differentially spliced AS events.

Screening for prognostic SFs and AS events in HCC.
A total of 342 HCC patients with follow-up time ≥ 30 days were included to perform univariate Cox regression analysis for dysregulated SFs and differentially spliced AS events. SFs with P < 0.05 were confirmed as prognosis-associated SFs. The hazard ratios (HRs) and 95% confidence interval (95% CI) of prognosis-associated SFs in HCC were visualized using R package of "forestplot". AS events with P < 0.05 were considered significantly correlated with the overall survival (OS) of HCC. Then UpSet plot and Venn plot, generated using R package of "UpSetR" and "yyplot" respectively, were applied to qualitatively display the intersecting gene sets among the types of prognostic AS events.
Construction of the prognostic risk score model based on SFs for HCC patients. Among the prognosis-associated SFs, the least absolute shrinkage and selection operator (LASSO) regression was conducted by R package "glmnet" to remove highly correlated SFs and prevent overfitting. Then 171 HCC patients were randomly selected to be training set, remaining 171 patients as validating set. The demographic information and clinical characteristics between training set and validating set were compared through χ 2 test or Fisher's exact test to ensure the random distribution between training set and validating set, with P < 0.05 considered statistically significant. Multivariate Cox analysis was applied using R package of "survival" to construct an optimal prognostic risk score model based on expression of SFs in training set, in which the risk scores of HCC patients were computed by the following formula: risk score = (β SF1 × expression level of SF1) + (β SF2 × expression level of SF2) + ⋯ + (β SFn × expression level of SFn). The median of risk scores in training set was set as cut-off value to stratified patients as low-risk and high-risk subgroup.
Identification the efficiency and independence of the prognostic model. We performed the logrank test and Kaplan-Meier survival analysis using "survival" and "survminer" packages in R to explore the statistical difference of OS between HCC patients in low-risk and high-risk subgroups. The sensitivity and specificity of the prognostic model was evaluated by receiver-operating characteristic (ROC) analysis. Then univariate and multivariate Cox regression analyses were performed to investigate the independent predictive value of the prognostic model compared with demographic information and clinical characteristics including age, gender,

GSEA.
To explore the potential pathways and gene sets associated with the constructed prognostic model, GSEA was performed using R package "GSEABase" to find enriched terms in the canonical pathways (C2) collected from the Kyoto Encyclopedia of Genes and Genomes (KEGG); in the ontology gene sets (C5) derived from the gene ontology resource (GO) consisting of biological process (BP), cellular component (CC), and molecular function (MF); and the oncogenic signatures gene sets (C6) which were often dysregulated in cancer. All gene sets above (C2, C5, and C6) were retrieved from Molecular Signature Database (MsigDB v6.2). It was considered significantly enriched when P < 0.01 and false discovery rate (FDR) q < 0.05. The results of GSEA were visualized by R package of "clusterProfiler".

Construction of prognostic SFs-AS events regulatory network. Spearman correlation analysis was
performed to analyze the correlation of expression of SFs in the prognostic model and PSI value of prognostic AS events. It was considered that SFs and AS events were significantly correlated when correlation coefficient r > 0.4 (or < − 0.4) and P < 0.01. Then the potential regulatory network of SFs and AS events was visualized by Cytoscape (version 3.7.2).

Protein-protein interaction (PPI) network analysis and functional enrichment analysis.
According to the human gene annotation file downloaded from http:// asia. ensem bl. org/ index. html, protein-coding genes were screened out from genes of the AS events involved in the prognostic SFs-AS events regulatory network. To explore potential interactions among these protein-coding genes, we uploaded these protein-coding genes to the STRING database (https:// string-db. org/), a biological database presenting functional protein association networks. Then the PPI network was set up with the identified genes by integrating www.nature.com/scientificreports/ the data retrieved from the STRING database. Results of PPI network analysis were visualized and analyzed via Cytoscape (version 3.7.2). Top 10 hub genes were identified through calculating the nodes' scores by cytoHubba. R package of "ClusterProfiler" was used to perform the GO enrichment analysis and KEGG pathway analysis for these protein-coding genes [18][19][20] .

Results
Identification the differentially expressed SFs in HCC. The approach and workflow of this study was illustrated in Fig. 1. To investigate the expression alterations of SFs in HCC, we compared the expression of 404 SF genes between 50 paired normal tissues and HCC tissues and identified 40 differentially expressed SFs in HCC tissues, among which 21 were upregulated and 19 were downregulated (Table 1). Hierarchical clustering analysis confirmed the significant differences in expression patterns of differentially expressed SFs between normal and tumor tissues of HCC (Fig. 2a). In addition, volcano plot displayed the distribution of differentially expressed SFs (Fig. 2b).  (Fig. 2c). Among the 13 prognosis-associated SFs, 5 SFs with hazard ratio (HR) < 1 (CLK1, SRSF5, ZC3H13, C9orf78, DDX19B) were considered protective factors; while the remaining 8 SFs with HR > 1 (FAM50A, THOC5, DNAJC6, PRPF3, DHX34, IGF2BP3. SF3B4, IL2) were considered risk factors. As expected, SFs as protective factors of HCC were significantly downregulated in HCC tissues (Fig. 2d, upper); while SFs as risk factors of HCC were upregulated in HCC tissues (Fig. 2d, lower), indicating their clinical potential as diagnostic, therapeutic, and prognostic biomarkers for HCC patients. Therefore, we applied LASSO regression analysis to the 13 prognostic SFs and identified 8 more valuable prognostic SFs (THOC5, SRSF5, DNAJC6, ZC3H13, IGF2BP3, C9orf78, SF3B4, and DDX19B) (Fig. 3a,b). Following, to easily and reliably stratify outcomes of HCC patients with SFs, we randomly categorized 342 HCC patients into training set and validating set. Except the gender, no clinical parameter was significantly different between training set and validating set, identifying their random distribution ( Table 2). In training test, the stepwise multivariate Cox regression was applied and a total of 4 SFs (DNAJC6, ZC3H13, IGF2BP3, and DDX19B) were selected to construct the final prognostic risk score model The normalized expression of these 4 SFs and their corresponding coefficients, displayed in Table 3, were used to calculate risk scores for HCC patients with the following risk score calculation formula: risk score = (0.28336 × DNAJC6 expression) + (− 0.4438 × ZC3H13 expression) + (0.226331 × IGF2BP3 expression) + (− 0.63347 × DDX19B expression). Then HCC patients were divided into high-risk and low-risk subgroup based on the median value (0.9856) of the risk scores of HCC patients in training set. The distribution of survival status, risk scores, and expression patterns of SFs (DNAJC6, ZC3H13, IGF2BP3, and DDX19B) in training set and validating set were respectively

Identification the efficiency and independence of the prognostic model for HCC patients.
To probe the relationship between the risk score computed by our prognostic model and OS of HCC patients, Kaplan-Meier analysis was performed and confirmed the OS of HCC patients in high-risk group was much shorter than those in low-risk group in both training set and validating set (Fig. 4a,b). In the training set, the area under the curve (AUC) value of ROC curve for 1, 3, 5-year-survival were 0.837, 0.726, and 0.574, respectively. In the validating set, the AUC value for 1, 3, 5-year-survival of ROC curve were 0.735, 0.652, and 0.579, respectively (Fig. 4c,d). These results confirmed the high efficiency of the prognostic model in predicting 1, 3-year survival for HCC patients. To further validate the independent predictive power of the model for HCC patients, the univariate Cox regression analysis was applied and identified that risk score calculated our prognostic model, AJCC stage, tumor size, and metastasis status were risk factors of HCC patients (Fig. 4e). Then, these risk factors were incorporated into multivariate Cox hazard regression analysis, validating risk score and metastasis status as independent prognostic factors for HCC (Fig. 4f). Collectively, these results demonstrated the prognostic signature owned good prognostic performance for HCC.  Supplementary Table S2. In enriched KEGG pathway (C2), a great majority of cancer-related pathways were activated in high-risk group, including DNA replication, cell cycle, bladder cancer, and p53 signaling pathway, etc.; while numerous metabolism-associated pathways were suppressed in high-risk group, including β-alanine metabolism, tryptophan metabolism, retinol metabolism, and pyruvate metabolism, etc. (Fig. 5a). In enriched BP, CC, and MF of GO term (C5), top 12 gene sets activated and suppressed by high-risk group were respectively displayed in Fig. 5b-d. In enriched oncogenic signatures (C6), upregulation of multiple oncogenic genes (E2F3, E2F1, VEGFA, etc.) were activated in high-risk group; whereas downregulation of several oncogenic genes (BMI1, MEL18, and CyclinD1) were suppressed in high-risk group (Fig. 5e). Collectively, these results confirmed that high-risk score calculated by our prognostic model might confer the intense oncogenic phenotype under activation of various oncogenic genes and pathways.

Construction of prognostic SFs-AS events regulatory network in HCC.
SFs exert pro-oncogenic or antitumor effects through inducing aberrant splicing process mainly. It is meaningful to investigate regulatory relationships between SFs and AS events implicated in HCC. According to distinct splicing modes, AS events  www.nature.com/scientificreports/ could be classified into the following seven types: alternative acceptor (AA), alternative donor (AD), alternative promoter (AP), alternative terminator (AT), exon skip (ES), retained intron (RI), and mutually exclusive exons (ME), as presented in Fig. 6. The PSI values of AS events were compared between 50 normal tissues and 371 tumor tissues of HCC. In total, 10,926 AS events from 5243 genes were identified to be altered in HCC tissues (Supplementary Table S3). The interactive gene sets among these seven types of dysregulated AS in HCC were quantitatively showed in Fig. 7a. Then 1757 AS events from 1144 genes were confirmed to be closely associated with the prognosis of HCC patients (Supplementary Table S4). The interactive gene sets among these seven types of prognostic AS in HCC were visualized in Fig. 7b. Following we explored the correlations of expression of SFs in our prognostic model (DNAJC6, ZC3H13, IGF2BP3, DDX19B) and PSI values of prognostic AS events through Spearman correlation analysis, and identified 39 ZC3H13-associated AS events, 53 IGF2BP3-associated AS events, and 106 ZC3H13associated AS events (Supplementary Table S5). However, no DNAJC6-associated AS events was screened out. According to the results of correlation analysis, we established the potential regulatory network of SFs and AS events in HCC (Fig. 7c). From the regulatory network, we concluded the specific transformations of AS events induced by dysregulation of ZC3H13, IGF2BP3, and DDX19B in HCC (Supplementary Table S6).
Functional exploration for the protein-coding genes of AS events in the SFs-AS events regulatory network. In total, there were 180 AS events from 117 genes involved in the SFs-AS events regula- www.nature.com/scientificreports/ tory network. Among these 117 genes, 108 genes were annotated to be protein-coding genes according to the human gene annotation file, which were listed in Supplementary Table S7. To better understand interactions among these 108 protein-coding genes, we established the PPI network by integrating the data retrieved from the STRING database (Fig. 8a). Hub genes ranking top 10 in the PPI network were selected by sorting node degree using cytoHubba in Cytoscape (Fig. 8b). These hub genes, including MELK, KIF4A, CHEK1, NEK2, NEIL3, CDCA3, TROAP, CLSPN, ESR1, and KIF20B, highly interconnected with other proteins in PPI network. Then we explored the potential biological functions of these 108 protein-coding genes by GO enrichment analysis and KEGG pathway analysis. The results of GO terms enriched by these protein-coding genes were presented in Fig. 8c and Supplementary Table S8. In BP, top three enriched terms were organelle fission, nuclear division, and mitotic nuclear division, which were essential for sustaining proliferation of cancer cells. In CC, only kinesin complex and transcriptionally active chromatin were significantly enriched. In MF, top three terms were steroid binding, hydrolase activity, hydrolyzing N-glycosyl compounds, and cholesterol binding. Besides, the results of KEGG pathways enriched by these 108 protein-coding genes were listed in Supplementary Table S8. Especially, top 10 enriched KEGG pathways were displayed in Fig. 8d, among which autophagy, PPAR signaling pathway, AMPK signaling pathway were closely related to tumor progression. Overlapping the C5 of GSEA in Supplementary Table S1 and KEGG pathways in Supplementary Table S5, seven mutual pathways were identified www.nature.com/scientificreports/ including glyoxylate and dicarboxylate metabolism, primary bile acid biosynthesis, complement and coagulation cascades, PPAR signaling pathway, tryptophan metabolism, propanoate metabolism, and prion disease. Therefore, we speculated ZC3H13, IGF2BP3, and DDX19B could trigger aberrant AS events and thus induce dysregulation of these 7 pathways, which might contribute to HCC progression.

Discussion
HCC is a heterogeneous tumor originating from liver parenchymal cells. Over the last few decades, increasing database-based bioinformatics analyses have made great efforts to investigate various molecular alterations, including mRNA, lncRNAs, circular RNAs, and miRNAs, to explore their biological functions and potential key molecular mechanisms involving in the pathogenesis of HCC and screen out targets as index of diagnosis, prognosis, and therapy for HCC patients [21][22][23][24] . Recently, the significance of splicing attracted increasing attention due to its capacity of expanding genomic coding capacity and increasing protein diversity at post-transcriptional level 25 . It is worth mentioning that the choices of AS events are mainly orchestrated by SFs 26 . Increasing evidence have showed expression alterations of SFs can induce the alterations of AS events, thus triggering various oncogenic process 27,28 . It has been confirmed several dysregulated SFs were closely correlated with the prognosis of HCC patients [29][30][31] . However, existing studies were limited to explore the role or molecular mechanism of a single SF gene in tumor progression. It is valuable to systematically analyze the prognostic ability of SFs and establish a novel prognostic model based on SFs for HCC patients.
In present study, we established a prognostic model consisting of four SFs (DNAJC6, ZC3H13, IGF2BP3, and DDX19B), which could classify HCC patients as high-risk and low-risk subgroups. Encouragingly, Kaplan-Meier analysis of training set and validating set revealed HCC patients in low-risk group exhibited better prognoses compared with those in high-risk group. ROC curve analysis in training set and validating set showed that the sensitivity and specificity of the prognostic model were relatively favorable. Univariate and multivariate cox regression analyses confirmed the risk score computed by our prognostic model was an independent prognostic factor for HCC patients. Furthermore, GSEA between high-risk and low-risk group of HCC patients significantly enriched multiple oncological pathways, various biosynthesis and metabolic process, which might explain the biological functions and molecular mechanisms of the prognostic model based on SFs.
In the prognostic model constructed in our study, DNAJC6 and IGF2BP3 were risk factors, while ZC3H13 and DDX19B were protective factors. DNAJC6 (DNA/HSP40 homolog subfamily C member 6) encodes the brain-specific isoform of auxilin. Auxilins is essential for the clathrin-mediated endocytosis (CME), which is crucial for material uptake of cells through clathrin-coated vesicles. Previous study has reported that two uncommon noncoding DNAJC6 variants may regulate RNA splicing, and DNAJC6 mutations is involved in autosomal recessive and early-onset Parkinson's disease 32 . Another study observed DNAJC6 was significantly upregulated in HCC and significantly correlated with tumor progression and poor outcome of HCC patients. Mechanically, DNAJC6 facilitates transforming growth factor β (TGF-β) pathway activation to promote epithelial-mesenchymal transition (EMT), thereby promotes HCC cell proliferation and invasion 33 . IGF2BP3 is a member of the insulin-like growth factor 2 mRNA binding protein family. It has been confirmed that upregulation of IGF2BP3  www.nature.com/scientificreports/ promotes initiation and progression of multiple cancers, such as bladder cancer and colon cancer. In bladder cancer, IGF2BP3 was reported to enhance cell proliferation and inhibit cell apoptosis through activation of JAK/ STAT pathway 34 . In colon cancer, IGF2BP3 binds to the mRNA of CCND1 and VEGFA via recognizing m6A modification of CCND1 and VEGFA, and enhances their mRNA stability, which facilitates cell proliferation and angiogenesis respectively 35 . A recent study has confirmed IGF2BP3 directly regulates alternative splicing of PKM and BTF3 and thus contributes to lung tumorigenesis 36 . ZC3H13 (zinc finger CCCH domain-containing protein 13), a classical CCCH zinc finger protein, inhibits proliferation and invasion of colorectal cancer cells via blocking the Ras-ERK signaling pathway 37 . DDX19B (DEAD-box Helicase 19 B) participates in regulating mRNA export and mRNA translation 38 . To date, the role of DDX19B in cancers remains unclear. Collectively, the roles of DNAJC6, IGF2BP3, and ZC3H13 in regulating cancer progression as mentioned in above studies are consistent with our present study, indicating the results based on our study are reliable. However, there is limited research on the roles of DNAJC6, IGF2BP3, ZC3H13, and DDX19B in the regulation of AS events. Therefore, we explored the correlations between these SFs (DNAJC6, IGF2BP3, ZC3H13, and DDX19B) and prognostic AS events. Then we extracted protein-coding genes from AS events regulated by SFs mentioned above for further functional exploration. seven pathways (glyoxylate and dicarboxylate metabolism, Green triangles represent SFs that were protective factors for HCC; red triangle represents SF that was risk factor for HCC; turquoise squares represent AS events that were protective factors for HCC; orange squares represent AS events that were risk factor for HCC. The red lines represent positive correlations while the blue lines represent negative correlations. www.nature.com/scientificreports/ primary bile acid biosynthesis, complement and coagulation cascades, PPAR signaling pathway, tryptophan metabolism, propanoate metabolism, and prion disease) were enriched by both GSEA of our prognostic model and KEGG pathway analysis of protein-coding genes of AS events associated SFs in the prognostic model. It has been reported dysregulation of glyoxylate and dicarboxylate metabolism is involved in gastric cancer and colorectal cancer 39,40 . Complement and coagulation cascades has been confirmed to be associated with chemosensitivity and overall survival of patients with soft tissue sarcoma 41 . PPAR (peroxisome proliferator-activated receptor) is a canonical pathway involved in lipid metabolism. PPAR family, composed of three transcription factors (PPARα, PPARβ/δ, and PPARγ), controls energy and metabolism balance 42 . The anticancer effect of PPAR has been elucidated in multiple cancer, such as gastric cancer and lung cancer 43,44 . Tryptophan (TRP) is implicated in neuronal function, immunity, and gut homeostasis, etc. The imbalance in the synthesis of TRP metabolites has been demonstrated to be associated with neurologic and psychiatric disorders, chronic immune activation and immune escape of cancers 45 . Thus, we speculated ZC3H13, IGF2BP3, and DDX19B might participate in the occurrence and development of HCC through regulating their correlated AS events and inducing dysregulation of above cancer-related pathways. There were several limitations in this study. Firstly, the prognostic model based on SFs was only verified in the internal data of TCGA but not verified in external independent cohorts. Secondly, the prognostic model based on SFs is not yet clinically validated. Thirdly, the regulatory relationship among SFs and AS events were established through statistical correlations, and further biological experiments are needed to verify the exact AS events regulated by ZC3H13, IGF2BP3, and DDX19B. Forth, the biological functions and molecular mechanisms of the prognostic model implicated in HCC progression are preliminary explored by bioinformatic analysis, which also need large amounts of biological experiments to validate in the future. www.nature.com/scientificreports/ Taken together, we established an independent and robust prognostic model based on prognosis-associated SFs, providing novel targets for diagnosis, prognosis, and therapy of HCC. In addition, we constructed the prognostic SFs-AS events regulatory network, and explored the potential roles of SFs via modulating AS event in HCC, which paved the way for seeking novel biological functions and molecular mechanisms of SFs in HCC tumorigenesis and progression.

Data availability
Gene expression data and clinical information of HCC can be accessed in TCGA. The alternative splicing events data of HCC can be accessed in TCGA SpliceSeq.