Development and Validation of a Three-gene Prognostic Signature for Patients with Hepatocellular Carcinoma

Hepatocellular carcinoma (HCC) is the leading cause of cancer-related death worldwide, because recurrence often occurs in most HCC patients undergoing hepatectomy. It is necessary to identify patients with high risk for recurrence and adopt effective therapies. An obstacle to monitor patients at high risk for poor prognosis has been the lack of useful predictive biomarkers. Fortunately, recent progress in system biology allows to screen the biomarkers for HCC prognosis in a high-throughput manner. In this study, we performed systematic Kaplan-Meier survival analysis of the whole mRNA transcriptomics based on the Cancer Genome Atlas project (TCGA) and developed a three-gene prognostic signature composing of three genes UPB1, SOCS2 and RTN3. The model was validated in two independent microarray data sets retrieved from Gene Expression Omnibus (GEO) and the expression pattern of these three predictive genes in HCC was confirmed by western blot and immunohistochemistry with our HCC samples. In conclusion, our results showed that this three-gene signature has prognostic value for HCC patients.


Results
The entire workflow of how the signature was generated and validated. In order to comprehensively analyze the genomic prognostic genes in hepatocellular carcinoma, a workflow outlined the procedure for the analysis was developed (Fig. 1). We acquired the TCGA LIHC level 3 RNA-seq data and clinical datasheet from the TCGA data portal. 10 non-HCC patients and low-abundance genes were filtered, followed by the differential expression analysis. The remaining 360 patients and 9932 differentially expressed genes were enrolled in the prognostic analysis. We divided 360 patients into 3 groups based on T stage, and performed Kaplan-Meier survival analysis in every group individually. Univariate and multivariate cox regression analysis was performed with the 15 common genes, and a three-gene signature was generated. Finally, we validated the signature with microarray data retrieved from GEO (n = 289) and validated the expression pattern of the prognostic genes with our HCC samples (n = 82).

T stage and tumor weight are independent prognostic clinical parameters in patients with HCC.
To rule out the bias caused by the survival-related clinical parameters in the following analysis, we obtained the clinical dataset from TCGA and screened for the survival-related clinical index by univariate and multivariate cox regression analysis (Fig. 1). Univariate cox regression analysis revealed that pathologic stage, T stage, and tumor weight were associated with the survival time (Table 1). Surprisingly, no significant correlation was found between indexes such as Child pugh grade, histologic grade, vascular tumor invasion and clinical outcome according to TCGA clinical dataset. However, only T stage and tumor weight have statistical significance in multivariate cox regression model (Table 1,  Identification of 15 common survival-related genes in different T stages. We then stratified the patients with T stage, resulting in three different groups, with 176 patients in T1 group, 89 patients in T2 group, and 92 patients in T3/4 group. Three patients with unknown T stage were not included in this analysis ( Fig. 2A). The median was used as a cutoff value for classification of patients into high and low expression groups. Subsequently, to identify the potential mRNA with prognostic characteristics, we carried out Kaplan-Meier survival analysis of 9932 differentially expressed genes one by one with our R project algorithm in three groups individually, screening for the survival related genes. We obtained an increasing number of candidate genes from T1 to T3/4 group (617 genes in T1 group, 672 genes in T2 group and 1590 genes in T3/4 group, Supplemental Table s1). The overlaps among three groups were illustrated in the Venn diagram (Fig. 2B), and a total of 15 overlapped genes were found in the common region, indicating these 15 genes were associated with survival time regardless of the T stage ( Table 2).
Generation of a three-gene prognostic signature composing of UPB1, SOCS2 and RTN3. To better understand which genes of the 15 candidates are more representative in evaluation of clinical outcome, we further assessed these 15 genes with univariate and multivariate Cox regression analysis in a total of 360 patients. All genes except HPR and SDHAP3 were of statistically significant correlation with survival time in univariate Cox regression analysis (Table 2). Multivariate Cox's proportional hazards regression analysis indicated that the expression of reticulon 3 (RTN3), suppressor of cytokine signaling 2 (SOCS2) and beta-ureidopropionase 1 (UPB1) were independent predictors for the prognosis of HCC patients (RTN3: hazard ratio [ Table 2). Subsequently, based on the estimated Cox regression weights, we established a prognostic model, whose formula was computed as prognostic index (PI) = 0.384 × RTN3 − 0.561 × SOCS2 − 0.434 × UPB1. To explore the relationship among these survival-related genes, we performed cluster analysis based on the Spearman correlation coefficient matrix. Thirteen prognostic genes were classified into four clusters and the genes in the same cluster had similar expression pattern. Three predictive genes obtained from multivariate Cox regression analysis distributed in three different groups (Fig. 2C), which supported the representativeness of the three prognostic genes.
External validation of the three-gene prognostic model with GEO microarray data. To investigate whether our predictive model was appropriate for other datasets, we acquired two independent microarray datasets (GSE14520 26 and GSE54236 13 ) which incorporated gene expression information and survival time of HCC patients from the GEO database. We applied these two datasets to validate our prognostic model respectively. In GSE54236, the survival time for the poor prognosis patients predicted by our model (n = 40) was significantly worse than that of the good prognosis (n = 40) (Fig. 3A upper panel). In GSE14520, the good prognosis group predicted by the signature (n = 105) showed a substantial advantage in overall survival compared to the poor prognosis group (n = 104) ( Fig. 3B upper panel). We assessed the prognostic accuracy of the three-gene prognostic signature with time-dependent ROC analysis at varying follow-up times of two validation cohort GSE54236 (Fig. 3A lower panel), GSE14520 (Fig. 3B lower panel) and TCGA LIHC dataset (Fig. 3C lower panel). Flowchart describing the process used to generate and validate the prognostic signature in the analysis. The level 3 RNAseqv2 RSEM_genes_normalized files and clinical datasets were downloaded from TCGA. 10 non-HCC patients were ruled out. The gene abundance was filtered by requiring the RSEM normalized count to be ≥3 in ≥60% of 49 paired tumor tissues or nontumor tissues. The remaining 15353 genes were recruited for the differential expression analysis with paired t-test or Wilcoxon signed-rank test. The resulting 9932 differentially expressed genes were candidates for prognostic analysis. Clinical characteristics were selected as variables for univariate and cox regression analysis. T stage and tumor weight were found to be statistically significant in both univariate and multivariate cox regression analysis. 360 patients were stratified by T stage, divided into 3 groups. Kaplan-Meier survival analysis was performed in every group with the median as the cutoff value. We obtained 15 common genes in three groups, and further narrowed this gene list with univariate and cox regression analysis and built a prognostic model that included 3 genes: UPB1, SOCS2 and RTN3. This model was used to calculate risk scores for two independent expression microarrays retrieved from GEO. Finally, we validated the expression pattern of UPB1, SOCS2 and RTN3 by western blot and immunohistochemistry with our HCC samples.
Scientific RepoRts | 7: 5517 | DOI:10.1038/s41598-017-04811-5 The area under the curve(AUC) at different cutoff times indicated an acceptable predictive accuracy of the three-gene signature. These results supported the validation of the prognostic signature.
Internal validation of the three-gene prognostic model with TCGA datasheet. Next, we sought to determine whether the signature composed of UPB1, SOCS2 and RTN3 could be predictive for the clinical outcome of HCC patients. To this end, we tested the model with TCGA LIHC dataset firstly. What stands out in the figure is the difference between the outcome of two groups. Patients with high risk for poor prognosis predicted by the model had a markedly shorter overall survival time compared to patients with low risk, indicated this signature discriminated patients of good prognosis from these of poor outcome effectively ( Fig. 3C upper panel).
We subsequently evaluated the prognostic values of UPB1, SOCS2 and RTN3 with Kaplan-Meier survival analysis. The lower thirds of the sorted mRNA expression values were defined as the low expression, and the upper thirds was set as the cutoff point for high expression, thus the total 360 patients were divided into three groups based on the expression value, with 120 patients in each group. The difference among the survival curves of low, middle and high SOCS2 (Fig. 3D left panel), RTN3 ( Fig. 3D middle panel) and UPB1 ( Fig. 3D right panel) expression groups was statistically significant. As we showed before, T stage is associated with prognosis (Table 1). To ascertain that these three genes distinguishing the survival of HCC patients was not due to a possible correlation with T stage, we depicted the survival curve combined gene expression with T stage. We observed an interesting result when the T stage was taken into consideration. Consistent with the previous result, patients with advanced T stage tended to have poor prognosis overall (Fig. 3E). However, the T1 patients with low SOCS2 expression had a similar survival curve as the T2 patients with high SOCS2 patients. Likewise, patients in T2 stage with low SOCS2 expression had even poorer prognosis than T3 patients with high SOCS2 expression ( We also evaluated the comprehensive prognostic value of pairwise combination of UPB1, SOCS2 and RTN3. Patient with high RTN3 but low UPB1 expression exhibited shortest survival time, whereas patients with low RTN3 and high UPB1 showed a substantial advantage in overall survival. Patients with low RTN3 and low UPB1 expression or with high RTN3 and high UPB1 expression exhibited median survival time (Fig. 3E, left panel). Similar results could be obtained from the combination of RTN3 and SOCS2 (Fig. 3E, middle panel) or UPB1 and SOCS2 (Fig. 3E, right panel), revealing that pairwise combination of these three genes predicted the survival of HCC patients.
Validation the expression pattern of UPB1, SOCS2 and RTN3 in HCC samples. Western blot was employed to examine the expression of UPB1, SOCS2 and RTN3 in 7 pairs of matched HCC samples and adjacent normal tissues. As shown in Fig. 4A and B, UPB1 and SOCS2 exhibited lower expression than the corresponding normal tissues, while RTN3 showed a tendency of increase expression in tumor tissues (p = 0.09). To further confirm the expression pattern of UPB1, SOCS2 and RTN3 in HCC tissues, immunohistochemistry was performed on 82 pairs of HCC and adjacent non-tumorous liver tissues. The baseline clinicopathological characteristics of the HCC patient cohort for IHC was provided in Supplemental Table s3. The results demonstrated that UPB1 and SOCS2 were downregulated in HCC tissues (UPB1: p < 0.0001; SOCS2: p < 0.0001), while the expression of RTN3 was increased in tumor tissues (p < 0.0001) (Fig. 4D). The expression of UPB1 was found to be correlated with tumor capsule invaded, and no other significant association was observed between UPB1, SOCS2 and RTN3 expression and clinicopathological variables perhaps partly because of the relatively small sample size. In addition, the staining pattern indicated that UPB1, SOCS2 and RTN3 were mostly anchored in the cytoplasm (Fig. 4C). Taken together, these data indicated that the expression of SOCS2 and UPB1 was decreased in HCC, while the expression of RTN3 was increased compared with the paired noncancerous tissues. Functional analysis of prognostic genes UPB1, SOCS2, and RTN3. Comparing tumor tissue with the surrounding nontumor tissue is usually the first step in search for biomarkers, aimed to identify biological characteristics unique to tumor tissue [19]. To this end, we performed differential expression analysis in 49 paired tissues at the beginning of the study and obtained 9932 differentially expressed genes (Fig. 1). RTN3 was upregulated in HCC tumor tissue compared to nontumor tissue (p = 0.00001, Fig. 5A right panel), while expression of UPB1 in tumor tissue was significantly lower than that in nontumor tissue (p < 0.00001, Fig. 5A left panel). SOCS2 was also downregulated in tumor tissues (p < 0.00001). Besides, the expression of SOCS2 was observed to be downregulated in every pair of tissues ( Fig. 5A middle panel). We investigated the potential interactions among clinicopathological variables and UPB1, SOCS2 and RTN3 expression. As shown in Supplemental Table s2, UPB1 expression (between low expression group and high expression group) was closely correlated with tumor weight (p < 0.0001), histologic grade (p < 0.0001), Figure 2. Identification of prognostic genes with T stage stratification. (A) Kaplan-Meier curve for overall survival in HCC patients with different T stages from TCGA LIHC dataset. Patients were divided into 3 groups according to the T stages (T1, n = 176; T2, n = 89; T3 and T4, n = 92). Advanced T stage is a significant factor for poor prognosis (p < 0.0001, log-rank test). (B) All HCC patients were divided into three groups based on the T stages, and Kaplan-Meier survival analysis was performed in each group separately. Two subgroup were separated on the basis of the median expression level of each gene. 617 survival-related genes were obtained from T1 group, and 672 genes from T2 group, 1590 genes from T3/4 group. 15 common genes were found in different groups. (C) Cluster analysis of the 13 survival related genes based on the spearman correlation coefficient. The color in the heatmap represents spearman correlation coefficient. Red indicates positive correlation, and blue indicates negative correlation. 13 genes were clustered into 4 clusters, and RTN3, SOCS2, UPB1 (marked in red) were in different groups.
pathological stage (p < 0.0001), T stage (p < 0.0001) and survival status (p < 0.0001). The expression of SOCS2 (between low expression group and high expression group) was correlated with histologic grade (p < 0.0001), family history (p = 0.046), pathological stage (p = 0.001), T stage (p = 0.001), vascular invasion (p = 0.034) and vital status (p < 0.0001). The expression of RTN3 (between low expression group and high expression group) was negatively correlated with the vital status. No significant associations were observed between RTN3 expression and other clinicopathological variables in TCGA LIHC dataset.
Additionally, we evaluated the expression pattern of UPB1, SOCS2 and RTN3 in patients with or without vascular invasion. The expression of UPB1 (p = 0.002) and SOCS2 (p = 0.03) was significantly lower in HCC patients with vascular invasion (n = 105), compared with patients without vascular invasion (n = 200), while no significant differences were found in RTN3 expression (Fig. 5B). Furthermore, we investigated the expression profiles of the three predictive genes with histologic grade (Fig. 5C) and pathologic stage (Fig. 5D). Notably, with the advance of histologic grade and pathologic stage, the expression of UPB1 and SOCS2 were downregulated gradually. No significant differences among different groups were evident of RTN3 expression. These results indicated that these three predictive genes, especially SOCS2 and UPB1 were associated with tumor progression in HCC.

Discussion
HCC is one of the deadliest solid malignancies, because recurrence occurs in most patients even after undergoing potentially curative treatment. It's important to predict clinical outcome and adopt novel interventional measures in time for the patients at high risk for recurrence. Recent progress in integrated "Omics" analysis shed new light on HCC biomarker screening and validation. In this study, we investigated the prognostic clinicopathological characteristics and the predictive genes based on RNA-seq data from the public databases, established and validated of a three-gene prognostic signature containing UPB1, SOCS2 and RTN3.
Recent "Omics" technologies have opened the door to screening biomarkers in a high throughput manner. Microarray and the next generation sequencing technologies have facilitated the study of the expression of multiple genes simultaneously, offering an opportunity to probe prognostic genes on a global genome scale. However, reproducibility of the data is an inevitable problem, especially when the data were generated from different technologies and platforms. One well-known study that was often cited in research on prognostic biomarker screening for HCC is that of Hoshida (2008), which failed to identify an outcome associated signature based on gene-expression profiles of tumor tissue, but discovered and validated a gene-expression signature associated with survival based on the profiles of the surrounding nontumoral liver tissue 18 . However, we succeeded in developing a three-gene signature including UPB1, SOCS2 and RTN3 based on gene-expression profiles of tumor tissue in this study, and fortunately we have validated this model with two independent GEO microarray datasets produced from different platforms. There are several possible explanations for the inconsistency. Firstly, Hoshida's study was accomplished by hybridization-based techniques (microarrays), and only 6000 oligonucleotide probes were included in the analyses, while TCGA datasets were generated by the next-generation sequencing technologies which do not use molecular hybridization to capture transcripts and therefore do not depend on the probe numbers. The limitation of the cross-platform analysis and the limited number of probes of Hoshida's study might partly explained the inconsistency of the two studies. The second possible explanation for the discrepancy was the heterogeneity of HCC patients. The tissue samples of 80 HCC patients in the training set in Hoshida's study were all from a single center in Japan with high hepatitis virus infection rate (HBV: 23.75%, HCV:72.5%), while the  Table 2. Univariate and multivariate cox regression analysis of the survival-related genes. Variables with p < 0.1 in univariate analysis were included in the multivariate analysis using a forward conditional method. Gene symbol is in front of the dot, and the numbers behind the dot is the Entrez Gene ID. Abbreviations: CI, confidence interval; HR, hazard ratio. Statistical significance is marked with the star symbol: **p < 0.01, ***p < 0.001, ****p < 0.0001. patients in TCGA LIHC dataset were from multicenter consisting of different races including Asian, White and Black with low viral hepatitis rate (Supplemental Table s2). Another point which is easy to be neglected but could have an impact is the tissue procurement procedures. The intra-and post-operative cold and warm ischemia were reported to impact the RNA 27 , protein 28 and metabolites 29 of clinical tissue specimens. The tissue procurement  The Kruskal-Wallis test was performed. UPB1 and SOCS2 were found to be associated with pathologic stage, while RTN3 was not.
procedures can have a significant impact on the quality of specimens and might influence the reproducibility of analytical data sets for biomarker. Because we chose death to be the clinical outcome of our study, whether the death of patients was caused by HCC would influence the accuracy of our results. The role of RTN3 in Alzheimer's disease (AD) has been widely investigated. Transgenic mice overexpressing RTN3 developed neuritic abnormalities 30,31 . Additionally, RTN3 is reported to interact directly with a viral nonstructural protein and is important for viral replication 32 . However, RTN3 was found to limit the replication of HCV 33 . It has been observed that RTN3 was overexpressed in astrocytoma compared to non-cancerous tissue 34 . It's reported that RTN3 could interact with the oncogene Ras at endoplasmic reticulum 35 . Further work needs to be done to assess the role of RTN3 in tumorigenesis especially in HCC.
SOCS2 was reported to negatively regulate the growth hormone signaling in vitro and in vivo 36 and was identified as an important regulator of hepatic homeostasis in response to high-fat dietary stress 37 . In addition, SOCS2 was a crucial intracellular mediator of the anti-inflammatory actions. SOCS2 deficient mice had uncontrolled production of proinflammatory cytokines upon infection with an intracellular pathogen 38 . Livers from high-fat-diet-fed SOCS2 −/− mice showed increased Nuclear Factor κB (NF-κB) activity as well as elevated expression of genes for the inflammatory cytokines Interferon (IFN)-γ and Interleukin (IL)-6 37 . Over the last years, SOCS2 has emerged as potential tumor suppressor-like gene in several tumors. Previous studied have revealed that SOCS2 was downregulated in ovarian, breast carcinomas, pulmonary adenocarcinoma and HCC [39][40][41][42] . Consistent with our study, it has been suggested that SOCS2 was an independent predictor for survival of patients with breast cancer 40 , colorectal cancer 43 , and HCC 42 , and patients with low expression of SOCS2 displayed shorter survival time. SOCS2 deficiency promotes spontaneous intestinal tumorigenesis in a specific mice model driven by mutations in the APC/beta-catenin pathway 44 . Although conflicting results were reported that SOCS2 expression was upregulated in prostate cancer 45,46 , these findings provide insights into the role of SOCS2 in tumor suppression and highlight the potential usefulness of SOCS2 as a prognostic prediction biomarker.
UPB1 catalyzes the last step in the pyrimidine degradation pathway. Thus far, few studies have explored the relationship between UPB1 and cancer. In a study aimed to discovery novel circulation biomarkers for HCC by integrating various bioinformatics tools, they identified a seven-protein predictive signature including UPB1 47 . Further experimental investigations are needed to estimate the function of UPB1 in tumors.
The initial aim of our study was to seek out several prognostic genes from the RNA-seq data, and validate the expression pattern and the prognostic value in protein level and then establish a predictive immunohistochemical staining signature and finally apply to clinical research and practice because of the easy accessibility of immunohistochemistry to patients. Though we validated the expression pattern of UPB1, SOCS2 and RTN3 by western blot and immunohistochemistry, we didn't associate the expression of these genes to the survival of patients, because the subsequent validation process is difficult and costly. We acquired three survival-related genes and it was reported that these three genes played important roles in tumors. However, further research should be undertaken to investigate the function of UPB1, SOCS2 and RTN3 in HCC. The three-gene signature is promising, but the patient cohorts from both TCGA and GEO were retrospective, and prospective cohorts were needed to validate this signature. Although some limitations remain in our study, we have performed the first whole mRNA transcriptome survival analysis in HCC to search for predictive biomarkers, and we think our method can be extended to other cancer type.
In conclusion, we identified a three-gene predictive signature comprised of UPB1, SOCS2 and RTN3 for hepatocellular carcinoma and validated these three genes expression pattern in HCC tissues. This study may contribute to provide significant clinical implications for the prognosis prediction of HCC patients. However, what remains unclear is how these genes impact survival and it is beyond the scope of this study. Further investigation and experimentation are needed to elucidate the biological mechanisms of UPB1, SOCS2 and RTN3 in HCC development and progression.

Materials and Methods
Experimental protocol approval. All experimental protocols were approved by the research ethics committee of Drum Tower hospital and research ethics approval for this project was granted from the same institution. Utilization of data was conducted in accordance with TCGA and GEO data access policies. All analyses were performed in accordance with relevant guidelines and regulations.
Data sources and processing. TCGA LIHC level 3 RNAseqv2 data (RSEM_genes_normalized file) and clinical datasheet were obtained from the TCGA data portal (up to Jan 28, 2016). 371 patients containing 49 patients with paired normal liver tissues and 20531 genes were included. 10 patients whose pathological diagnoses were not hepatocellular carcinoma were ruled out. The gene abundance was filtered by requiring the RSEM normalized count to be ≥3 present in no less than 60% samples of the 49-paired tumor or normal tissues. The log2 transformation of RSEM was adopted in all plots. Two independent expression microarray HCC data sets (GSE14520 and GSE54236) were downloaded from the Gene Expression Omnibus (GEO) database (http://www. ncbi.nlm.nih.gov/geo/), in which outcome and survival time of the patients were incorporated. The survival time of HCC patients in microarrays was downloaded from ArrayExpress (http://www.ebi.ac.uk/arrayexpress/). GSE14520 (based on GPL3921 platform) and GSE54236 (based on GPL6480) enrolled 209 and 80 HCC patients with survival time, respectively. Differential gene expression analysis. The differential expression gene (DEG) analysis was performed on 49 paired HCC samples. The Shapiro-Wilk test was applied to determine whether the difference of normal/ tumor RSEM followed a normal distribution. The paired t-test was applied to normally distributed data otherwise the Wilcoxon rank test for paired data was applied to assess whether a gene was differentially expressed between tumor and normal samples. Statistical significance was determined using p < 0.05.
Scientific RepoRts | 7: 5517 | DOI:10.1038/s41598-017-04811-5 Survival analysis. For survival analysis, we used the Kaplan-Meier method to analyze the correlation between overall survival time and gene expression. The statistical significances of overall survival were determined using the Log-Rank test. Survival analysis was performed in R (version 3.3.1) and the survival curve was generated by R (version 3.3.1). The median was used as a cutoff value for classification of patients into high and low expression groups. Likewise, we use two tertiles to split patients into three groups with each gene expression, trisected patients into low expression group, middle expression group, and high expression group. We used R software version 3.3.1 and the "survivalROC" package to do the time-dependent ROC curve analysis.
Signature development. Prognostic index model was created using the three-gene signature based on multivariate Cox survival analysis. This algorithm is based on an importance score assigned to each gene. Using the Cox regression models, we calculated a risk score for each patient based on their individual expression levels of the three genes, where prognostic index (PI) = (0.384 × relative expression value of RTN3) − (0.561 × relative expression value of SOCS2) − (0.434 × relative expression value of UPB1). The median expression was used as a cutoff value, and the relative expression value was defined as follows: score 1, <median; score 2, >median.
Human tissue samples. Liver tissues were collected from patients undergoing hepatectomy with confirmed diagnosis of HCC at Department of Hepatobiliary Surgery of the Affliated Drum Tower Hospital of Nanjing University Medical School. Written informed consent was obtained from all patients or their guardians for the use of the biospecimens for research purposes. The samples were frozen in liquid nitrogen immediately after surgical resection and stored at −80 °C until further analysis.
Statistics. Univariate and multivariate survival analysis were performed using the Cox proportional hazards regression model. Only variables with p < 0.1 on univariate analysis were incorporated into the multivariate Cox regression analysis. The paired t-test or Wilcoxon signed-rank test were used for paired samples to make a statistical comparison between groups. The Kruskal-Wallis test and Mann-Whitney-Wilcoxon test were used for ranked data as appropriate. All testing was carried out using Prism 6.0 (GraphPad, San Diego, USA) or SPSS (version 23.0; Chicago, USA) or R (version 3.3.1, Auckland, NZ). The two-sided p value less than 0.05 was defined as statistically significant for all statistical analyses. The data were plotted as mean ± standard deviation (SD).