Identification of prognostic alternative splicing events in sarcoma

Sarcoma is a rare malignancy with unfavorable prognoses. Accumulating evidence indicates that aberrant alternative splicing (AS) events are generally involved in cancer pathogenesis. The aim of this study was to identify the prognostic value of AS-related survival genes as potential biomarkers, and highlight the functional roles of AS events in sarcoma. RNA-sequencing and AS-event datasets were downloaded from The Cancer Genome Atlas (TCGA) sarcoma cohort and TCGA SpliceSeq, respectively. Survival-related AS events were further assessed using a univariate analysis. A multivariate Cox regression analysis was also performed to establish a survival-gene signature to predict patient survival, and the area-under-the-curve method was used to evaluate prognostic reliability. KOBAS 3.0 and Cytoscape were used to functionally annotate AS-related genes and to assess their network interactions. We detected 9674 AS events in 40,184 genes from 236 sarcoma samples, and the 15 most significant genes were then used to construct a survival regression model. We further validated the involvement of ten potential survival-related genes (TUBB3, TRIM69, ZNFX1, VAV1, KCNN2, VGLL3, AK7, ARMC4, LRRC1, and CRIP1) in the occurrence and development of sarcoma. Multivariate survival model analyses were also performed, and validated that a model using these ten genes provided good classifications for predicting patient outcomes. The present study has increased our understanding of AS events in sarcoma, and the gene-based model using AS-related events may serve as a potential predictor to determine the survival of sarcoma patients.


Scientific Reports
| (2021) 11:14949 | https://doi.org/10.1038/s41598-021-94485-x www.nature.com/scientificreports/ in cancer-related pathways, and may be both biomarkers and targets for sarcoma diagnostics, prognostics, and treatments. The Cancer Genome Atlas (TCGA) database contains a large number of gene profiles that can be used to investigate novel AS-related gene expression and prognosis data for cancer. In this study, we systematically analyzed AS events and constructed an AS-based gene model based on the sarcoma database in TCGA.

Results
The sarcoma landscape of AS events. AS-event profiles were identified and analyzed in-depth from 236 sarcoma samples from the TCGA sarcoma cohort. We identified the seven AS patterns (ES, ME, RI, AP, AT, AD, and AA) listed in Fig. 1A. In total, we detected 9674 AS events in 40,184 genes, as illustrated in Fig. 1B. The ES pattern (15,311) was the most frequent AS-subtype, accounting for almost one-third of all events, followed by AT (8287), AP (7837), AA (3197), AD (2816), RI (2572), and ME events (164). Collectively, our findings demonstrate that these seven AS subtypes are frequently involved in sarcoma.

Prognosis-associated AS events.
To detect AS-event gene frequencies in sarcoma and survival, we integrated the clinical survival data from sarcoma patients. Using a univariate survival analysis, we obtained 2974 significant associations between prognoses and AS events (Fig. 1C), and 2311 significant associations between AS events and survival-related genes (Fig. 1D). We determined that 267 of these genes intersected between the AS-event cohort and the survival cohort (Fig. 1E). These results indicate that the majority of ES, AT, and AP events are significantly associated not only with overall prognoses for survival but also with survival genes.

AS-event subtypes indicate different prognoses.
To determine the prognoses associated with the seven different AS-event subtypes, we calculated the relationships between AS subtypes and the landscape for overall survival genes. The seven AS-event subtypes were closely associated with overall survival prognoses  (Fig. 2C), suggesting that AS-event subtypes associated with overall survival could serve as novel markers for prognostic predictions. The AP subtype also performed best in the variable-splicing associated with these 15 survival genes (Fig. 2D), indicating that analyses of AS events could be used as a new prognostic classification method for sarcoma.

AS event related gene interaction network construction.
To further explore the interactions between survival-related AS events in sarcoma, we constructed a STRING database network using score values > 0.4. The 267 significantly associated overall survival genes, for the seven AS-event subtypes, were illustrated in Fig. 3. Using the protein-protein network construction for the corresponding genes involved in the regulatory network for each AS-event, the analysis showed that the AD and RI subtypes had the majority of interactions.
The results also demonstrated that the prognosis-related genes associated with variable-splicing events had protein-protein interactions, with most of these genes involved in different biological functions.

Gene functional analysis.
To investigate the gene functions involved in the different types of variablesplicing events that were significantly related to patient prognoses, we performed a KEGG pathway-enrichment analysis 22 (Fig. 4A). The ES and ME subtypes were involved in a variety of pathways. The ME subtype was correlated with platelet activation, focal adhesion, arginine and proline metabolism, and axon guidance, among others. The ES subtype was correlated with the regulation of the actin cytoskeleton, nucleotide excision repair, www.nature.com/scientificreports/ mismatch repair, and DNA replication, among others. The RI subtype was involved with ribosomes, the biosynthesis of amino acids, and Alzheimer's disease, and the AP subtype was involved in metabolic pathways. The AA subtype was involved in amino-sugar and nucleotide-sugar metabolisms, and the AT subtype was involved in purine metabolism, cytosolic DNA-sensing pathways, and with RNA polymerase (Fig. 4B). These genes were also enriched in multiple disease-related pathways, suggesting their involvement in a multitude of biological functions through these seven subtypes of AS events that are significantly related to patient survival.

Gene expression profiling and prognosis.
To determine any correlations between gene expression and the prognoses for these variable-splicing events, that were significantly related to patient outcomes, we utilized a univariate survival analysis based on TCGA RNA-seq gene expression data. From this data, we used the 267 survival-related AS genes that were correlated with patient prognoses, and then Pearson correlations were performed to assess significance. We found that 149 (55.81%) of these genes were significantly related to variablesplicing (P < 0.05), indicating that the variable-splicing events of most genes were significantly related to their expression levels.

Hub gene selection.
To determine the hub genes associated with AS events, we identified 21 genes with expression levels and AS-related Pearson correlation coefficients with absolute values > 0.5. A Cox multivariate analysis was then performed to identify independent prognostic factors. Using these methods, we identified ten genes (TUBB3, TRIM69, ZNFX1, VAV1, KCNN2, VGLL3, AK7, ARMC4, LRRC1, and CRIP1). To understand the   www.nature.com/scientificreports/ relationship between potential hub genes and survival status, the risk scores for these 21 genes were calculated (Fig. 5A), and from these scores we established a ten-gene risk model. The resulting heat map revealed that genes ARMC4, LRRC1, and TUBB3 were overexpressed in the high-risk group, and that genes AK7 and CKNN2 were overexpressed in the low-risk group (Fig. 5B). Using the TCGA sarcoma dataset values, a survival status map was plotted to demonstrate the status for each sample (Fig. 5C). The risk scores, gene expression differences, and survival status for the ten hub genes, highlighted that AS events had the potential for prognosis predictions in sarcoma. We then constructed a multi-factor survival model and used it to analyze the variable-splicing events and expression-profile levels of these ten hub genes. The prognosis classifications are shown in Fig. 5D-G. This model demonstrated good prognostic classifications based on the RNA-seq and SpliceSeq data, and the AUC values were very high, suggesting that the ten genes of this model could be used as prognostic biomarkers for sarcoma.
Hub gene expression and overall survival of patients with sarcoma. These ten hub genes also had predictive value for patient survival. As shown in the Fig. 6A-G, the elevated expression levels of KCNN2, CRIP1, AK7, ZNFX1, VAV1, TRIM69, and VGLL3 were significantly associated with a better overall patient survival. However, the Kaplan-Meier analyses showed that the increased expressions of LRRC1, ARMC4, and TUBB3 were significantly associated with unfavorable patient prognoses (Fig. 6H-J). Overall, the findings demonstrate that these hub genes may serve as novel biomarkers for use in prognostic predictions.

Discussion
AS events frequently affect RNA binding, targeting specific RNA sequences or motifs which play important roles in gene expression regulation, cellular growth, development, tissue homeostasis, and RNA-species diversity 8,14 . AS events can be detected at the transcript level, using microarrays and RNA-seq data, and accumulating evidence www.nature.com/scientificreports/ indicates that aberrant RNA splicing patterns are associated with both the growth and progression of tumors 23,24 . Survival-related splicing factors are also important for AS events, as events that are favorable to overall survival were reported to be negatively associated with splicing factors in soft tissue sarcoma 25 . In Ewing sarcomas, hnRNPM-dependent AS promoted drug resistance and drove resistance to the inhibition of the PI3K/AKT/ mTOR pathway 26 . It was reported that the majority of survival-associated AS events were also poor prognostic markers for sarcoma 27 . EWS-FLI1 was reported to play a crucial role in the AS-regulation process of Ewing sarcoma 28 . Consistent with previous studies, our analysis of the relationship between AS events and sarcomapatient prognoses revealed that AS events were involved in sarcoma progression and may be potential predictors for sarcoma prognoses.
In this study we also identified ten prognosis-related genes (TUBB3, TRIM69, ZNFX1, VAV1, KCNN2, VGLL3, AK7, ARMC4, LRRC1, and CRIP1) closely associated with the occurrence and development of sarcoma. TUBB3, which codes for a microtubule protein, was overexpressed and linked to poor prognoses in a variety of cancers 29,30 . The function of TRIM69, a member of the tripartite motif (TRIM) family, has been reported to inhibit virus replication through a transcription-inhibition mechanism that prevented the synthesis of viral messenger RNAs 31 . ZNFX1, a novel lncRNA that has been shown to regulate cell proliferation, the cell cycle, cell migration, and cell invasion, may be a promising biomarker to predict poor prognoses in many cancers [32][33][34] . The ten hub genes were also involved with cancer initiation and progression, correlating with progression 35 . VAV1 has been reported to play a crucial role in the progression of human cancer, and AK7 expression has been positively correlated with malignant-cell proliferation in both acute lymphoblastic leukemia and Burkitt's B cell lymphoma 36 . ARMC4, an axonemal protein necessary for proper targeting, has also been identified among the novel genes associated with tumorigenesis in colorectal cancer 37 . LRRC1,a putative cell-polarity regulator, was significantly upregulated and considered to be a potential oncogene in hepatocellular carcinoma 38 . CRIP1 has been reported to be overexpressed in many cancer tissues and is considered to be an oncogene in tumor development and progression 39 .
The present ten gene-based model for prognostic predictions, focused on changes in AS events and changes in the expression levels of AS-related genes. The multivariate survival model assessed AS-related survival times for these ten hub genes, and successfully classified sarcoma prognoses. All of these hub genes are involved in a variety of oncogenesis and cancer progression functions, which indicates that this model may serve as a good predictor of survival prognoses for patients with sarcoma.
The present study verified that prognosis-associated AS events were ideal to construct a prognosis-prediction model. Using AS-related gene expression levels, we also identified the ten hub genes most relevant for this model. This novel model shows potential to contribute to both clinical and therapeutic approaches to better treat sarcoma.

Materials and methods
Data downloading and preprocessing. RNA-seq data from sarcoma AS events were downloaded from the SpliceSeq database (https:// bioin forma tics. mdand erson. org/ TCGAS plice Seq/) compiled from TCGA. Data from a total of 261 samples were collected from the TCGA database (https:// cance rgeno me. nih. gov/), including two normal tissue samples, and the RNA-seq gene expression profiles were from sarcomas and para-cancerous tissues. All represented patients had well-documented clinical information and follow-up data. The gene expression data in fragments per kilobase of exon model per million reads mapped (FPKM) were transformed into transcripts per million (TPM) values.
The corresponding gene profiles from the human genome, version GRCh38.p2 from GENECODE, were processed using gene ID conversion. In total, 236 protein-coding genes common to both the SpliceSeq and RNA-seq cohorts were identified. Lastly, we obtained the expression data for 19,754 genes for further analyses.
Analyses of RNA-seq data and AS events. SpliceSeq is a java application for the visualization and quantitation of splice junctions and exon proportions included in TCGA data. We defined seven AS subtypes for sarcomas: exon skip (ES), mutually exclusive exons (ME), retained intron (RI), alternate promoter (AP), alternate terminator (AT), alternate donor site (AD), and alternate acceptor site (AA).

Selection of survival-related AS events.
Analyses of patient survival data were made using the survival package in R software (http:// cran.r-proje ct. org/ packa ge= survi val), and the level of significance was set at P < 0.05.

Network analysis for AS-related genes.
For the analysis of gene-network interactions, we used the STRING database (http:// string-db. org/), and applied a threshold score > 0.4. Data visualization was performed using Cytoscape (https:// cytos cape. org/). A prognosis model based on ten hub genes. We performed a multivariate analysis of the gene expression data using the Cox proportional hazard-regression model. We identified the ten AS-related genes that demonstrated the most significant expression-profile changes, and then used these hub genes to create a prognostic model and for survival analyses. In addition, we determined the correlations between the expression profiles of these ten hub genes and overall patient survival based on Kaplan-Meier plots (https:// kmplot. com/).