A highly expressed mRNA signature for predicting survival in patients with stage I/II non-small-cell lung cancer after operation

There is an urgent need to identify novel biomarkers that predict the prognosis of patients with NSCLC. In this study,we aim to find out mRNA signature closely related to the prognosis of NSCLC by new algorithm of bioinformatics. Identification of highly expressed mRNA in stage I/II patients with NSCLC was performed with the “Limma” package of R software. Survival analysis of patients with different mRNA expression levels was subsequently calculated by Cox regression analysis, and a multi-RNA signature was obtained by using the training set. Kaplan–Meier estimator, log-rank test and receiver operating characteristic (ROC) curves were used to analyse the predictive ability of the multi-RNA signature. RT-PCR used to verify the expression of the multi-RNA signature, and Westernblot used to verify the expression of proteins related to the multi-RNA signature. We identified fifteen survival-related mRNAs in the training set and classified the patients as high risk or low risk. NSCLC patients with low risk scores had longer disease-free survival than patients with high risk scores. The fifteen-mRNA signature was an independent prognostic factor, as shown by the ROC curve. ROC curve also showed that the combined model of the fifteen-mRNA signature and tumour stage had higher precision than stage alone. The expression of fifteen mRNAs and related proteins were higher in stage II NSCLC than in stage I NSCLC. Multi-gene expression profiles provide a moderate prognostic tool for NSCLC patients with stage I/II disease.


Results
Identify survival-related mRNA in the training set. The fifteen-mRNA signature of highly expressed mRNAs associated with survival was developed and validated as shown in Fig. 1. We identified highly expressed mRNAs from GSE31210 by using microarray data. Differentially expressed mRNAs were selected by volcano plot filtering (fold change ≥ 1 and P-value ≤ 0.05, Fig. 2). The relationship between RFS, survival state and high expression genes in NSCLC patients was performed using univariate Cox regression analysis in the training set, "risk score" of highly expressed genes in NSCLC prognosis was calculated. The higher the risk score, the greater the correlation between mRNA and RFS. According to the results, the fifteen-mRNA signature was significantly associated with RFS (n = 226, GSE31210; Table 1). UBE2F, TMSB10 and GAPDH were negative coefficients, indicating that patients with higher levels of expression had better outcomes than patients with lower levels of expression. Twelve mRNAs (UBC, TUBA1B, PPIA, PML, PKM, MESDC2, LDHA, HMOX1, FGFR1OP, CFB, ALDOA and ADAM10) were considered to be positive coefficients, so high levels of these mRNAs were associated with worse outcomes. Heatmap visualized distributions of fifteen-mRNA and risk scores in the training set and two independent GEO cohorts (Fig. 3).

Survival analyses between low-risk and high-risk groups. On the basis of the expression of mRNAs
and their regression coefficients in the multivariate Cox model, we determined individual patient risk scores according to the fifteen-mRNA in the training set, internal validation set and external validation set. As the median value was used as the cutoff value, NSCLC patients in each set were classified into the low-risk group or the high-risk group. Figure 4 shows the distributions of risk score and RFS status in each set, it demonstrated NSCLC patients who had high risk scores had a higher risk of relapse after surgery. The clinical characteristics of low-risk group and high-   . The red points in the volcano plot represents the statistically significant highly expressed mRNAs, and the green points represent mRNAs with significantly low expression.  (Fig. 6). As the figure shows, the AUC of the fifteen-mRNA signature was higher than that of stage alone in the training (P < 0.05) and internal validation sets (P < 0.05). The combined model's (fifteen-mRNA signature and tumour stage) AUC was higher than that of stage alone in each set (P < 0.05). It suggests that fifteen-mRNA signatures have good predictive power to the prognosis of NSCLC patients.    As the result denmonstrated lower scores were associated with longer RFS, and higher scores were associated with shorter RFS in each set (P < 0.05). Table 3. RFS analyses by the log-rank test according to the fifteen-mRNA signature in each set. Our result shown lower scores were associated with longer RFS, and higher scores were associated with shorter RFS in each set (P < 0.05).

Datasets Risk group (n)
Recurrence-free survival Protein-protein interaction analysis and mRNA expression validation. STRING online software was used to analyse the interaction between proteins encoded by the fifteen mRNAs (Fig. 10A), and key genes were analysed according to the number of nodes using R software (version 3.5.1). Nodes were mainly interrelated with GAPDH and UBC, so these two proteins were speculated to be the key proteins in this protein-protein interaction network (Fig. 10B). GEPIA online software was used to verify the expression of the fifteen mRNAs in stage I and II patients with lung adenocarcinoma and lung squamous cell carcinoma. The expression of ALDOA, CFB, GAPDH, LDHA, MESDC2, PPIA, TMSB10, TUBA1B, and UBE2F was higher in stage II patients than in stage I patients (Fig. 11, P < 0.05).
The fifteen-mRNA mRNA expression in patients with NSCLC. We used PCR to verify the expression of the fifteen-mRNA in lung cancer tissues of NSCLC patients, and the results showed that the fifteen-mRNA was significantly higher in stage II NSCLC than in stage I (Fig. 12, P < 0.05).
The proteins related to fifteen-mRNA mRNA expression in patients with NSCLC. We used westernblot to verify the expression of proteins related to the fifteen-mRNA in lung cancer tissues of NSCLC patients, and the results showed that these proteins was significantly higher in stage II NSCLC than in stage I (Fig. 13, P < 0.05).

Discussion
In general, the TNM system is a widely used staging system among clinicians 19,20 , and TNM staging is essential for evaluating outcomes in clinical practice and for providing some indication of prognosis for survival 21 .
Unfortunately, current methods of classification and staging for NSCLC are not completely reliable or sufficiently precise [22][23][24] . The progression and prognosis of tumours are related to the high expression of some genes 25,26 . The aim of this study was to characterize tumour recurrence and analyse genes related to the increased risk of recurrence in NSCLC. Bioinformatics analysis is currently considered to be an important tool for identifying tumour biomarkers. We profiled NSCLC mRNA by analysing the microarray data of the Affymetrix human genome U133 plus 2.0 array downloaded from GEO. High mRNA expression in stage I/II NSCLC was determined by   www.nature.com/scientificreports/ the "Limma" package in the training set (GSE31210). Univariate Cox proportional hazards regression was used to analyse relationship between high expression genes and patient's survival time and prognosis in NSCLC, the "risk score" of highly expressed genes in NSCLC prognosis was calculated, In this algorithm, the high risk socre of mRNA related with the poor prognosis of NSCLC. We selected 15 mRNAs with the highest risk score through this algorithm, and peculate that these 15 mRNAs are closely related to the prognosis of NSCLC. In order to study the predictive ability of these 15 mRNAs, we verify them in the training set and two independent GEO cohorts (GSE30219 and GSE50081), and the mRNA signature showed prognostic significance in three cohorts. Many factors, such as sex, age, and stage, are thought to be possible pathogenesis of NSCLC cancer. We analysed NSCLC patient RFS by multivariate Cox regression, and our results showed that the mRNA signature was associated with patient RFS. The mRNA signature performed better than stage alone, and the combined use of RNA and stage performed the best. The combined set, considering the mRNA signature and stage, was significantly associated with patient RFS; in the same stage, our mRNA signature was still significantly associated with patient RFS, and patients with low risk scores had significantly longer RFS. The fifteen-mRNA classifier has a very high HR and a very broad CI in the training set compared with the two other sets, we speculate it may be the instability of the training set. Furthermore, the ROC curve shownd that AUC of the fifteen-mRNA signature was higher than that of stage alone in the training and internal validation sets. The combined model's (fifteen-mRNA signature and tumour stage) AUC was higher than that of stage alone in each set. Results of ROC curve suggests these fifteen-mRNA signatures as an independent prognostic factor in NSCLC. Finally, our bioinformatics analysis results shown our fifteen-mRNA signature is a novel biomarker with useful applications in predicting NSCLC prognosis.
To determine the biological relationship and signalling pathways among the fifteen mRNAs in the signature, we performed GO and KEGG analyses. Functional categories of the fifteen mRNAs were mainly involved in three GO terms, including the carboxylic acid biosynthetic process (GO: 0,046,394), coenzyme metabolic process (GO: 0,006,732), and purine ribonucleoside triphosphate metabolic process (GO: 0,009,205). All three pathways are considered to be closely related to tumours [27][28][29][30] . The main KEGG pathways involved included glycolysis/ gluconeogenesis (KEGG: 00010) and the HIF-1 signalling pathway (KEGG: 04,066). Glycolysis is a universal Table 6. Predictive value of the fifteen-mRNA signature, sex, age, stage, and survival in the combined set (n = 407, GSE30219 and GSE31210) analysed by multivariate Cox regression. The results showed a significant correlation between our mRNA signature and RFS (HR = 2.30743, 95% CI = 1.7407-3.059, P < 0.001).

Set Risk group (n)
Recurrence-free survival

HR (95% CI) P-value
The combined set (n = 407)  www.nature.com/scientificreports/ pathway in living cells, and the glycolysis rate is 200 times higher in tumour cells than in normal cells 31 . Previous studies have shown that inhibition of HIF-1 represents a novel approach to cancer therapy 32,33 . We analysed the protein-protein interactions between proteins encoded by fifteen mRNAs. GAPDH and UBC were speculated to be the key proteins in this protein-protein interaction network according to their nodes, it suggest these two mRNA may play the key role of 15 mRNA. The expression of the fifteen mRNAs was validated by GEPIA online software, and the expression levels of ALDOA, CFB, GAPDH, LDHA, MESDC2, PPIA, TMSB10, TUBA1B, and UBE2F were higher in stage II patients than in stage I patients with lung adenocarcinoma and lung squamous cell carcinoma, which was consistent with our previous results in the gene sets. Furthermore, we verified the expression of mRNA in NSCLC tumor tissues by RT-PCR and confirmed the expression of 15 mRNA related proteins by Westernblot. Our results showed that the expression of 15 mRNA genes was higher in stage II NSCLC than in stage I NSCLC, and the expression of 15 mRNA gene related proteins also showed the same situation, that is, in stage II is higher than in stage I. The fifteen-mRNA signature included twelve risky genes (UBC, TUBA1B, PPIA, PML, PKM, MESDC2, LDHA, HMOX1, FGFR1OP, CFB, ALDOA and ADAM10) and three protective genes (UBE2F, TMSB10 and GAPDH). Previous research showed that high The innovation of this research is identified mRNAs significantly related to RFS of NSCLC with a risk score via univariate Cox analysis, these 15 mRNAs have shown good predictive ability in the training set, internal validation set and external validation set. However, there were several limitations to our study. For example, further experiments are required to verify the clinical value of the signature. Limited by the clinical information of GEO data sets, we cannot identify the resection status of patients with NSCLC. Additionally, our experimental sample size is small, larger clinical trials may lead to more convincing results.
Our findings demonstrate a multiple-mRNA signature closely relate with tumour prognosis in stage I/II patients with NSCLC. It may aid in the development of novel biomarkers of NSCLC and offer new insights into NSCLC prognosis and may provide a new method for analyzing NSCLC based on Cox analysis.

Data of NSCLC.
Raw microarray data from all data sets were analysed using the Affymetrix human genome U133 plus 2.0 array (GSE31210, GSE30219 and GSE50081), the mRNA expression data were log2 transformed before statistical analysis, and the median value was used when multiple probes existed for a single target. There was a total of 627 stage I/II patients with NSCLC after excluding patients without Recurrence Free Survival(RFS) or clinical data, including 226 from GSE31210, 226 from GSE30219 and 181 from GSE50081. The 226 patients from GSE31210 were used as a training set, 226 patients from GSE30219 were used as an internal validation set, and 181