Introduction

Non-small-cell lung cancer (NSCLC) is a disease with high morbidity and mortality rates, accounting for approximately 85% of lung cancer cases1,2. Although Surgery plays a pivotal role in treating NSCLC, 5-year survival rates of NSCLC after surgical resection are commonly accepted to be 60% to 80% for stage I and 30% to 50% for stage II3. The prognosis is more favourable in localized or limited advanced stages. The risk of recurrence peaks within the first 2 years after the operation4. Most postoperative recurrences are found during routine follow-up when patients are asymptomatic. Hence, it is urgent to explore effective NSCLC prognostic biomarkers to help optimize clinical management and ultimately further improve clinical outcome.

Gene expression can be used as a surrogate measurement of cancer disease phenotype5,6. Multiple gene signatures are found by using bioinformatics technology, and considered to have an intimate association with the prognosis of NSCLC7,8,9. High expression of some genes is closely related to cancer progression, which can be used to determine patient prognosis10,11,12. As such, numerous highly expressed genes with various inherent and acquired genetic alterations have been shown to influence NSCLC prognosis1,7,8,13,14. Screening tumor markers based on bioinformatics technology is a hot spot in current research, so the aim of this study is try to use a new algorithm to screen highly expressed mRNAs may have significant prognostic value in the recurrence of patients with NSCLC.

Microarray technology and bioinformatic analysis have been increasingly regarded as useful methods to identify biomarkers as diagnostic and prognostic tools15,16. With the help of gene expression databases such as Gene Expression Omnibus (GEO), it is easy to obtain abundant expression data for NSCLC. It is very helpful to analyse NSCLC at the genetic level. These resources have improved our ability analyse NSCLC at the genetic level. Data on individual patients' mRNA profiles and clinical information can be obtained from Affymetrix human genome U133 plus 2.0 array17,18 and GEO data sets, after screening the data of mRNA, stage I and II patients are complete and suitable for analysis using bioinformatics, so we performed this research to find a multi-RNA prognostic signature of highly expressed mRNAs for predicting relapse in stage I/II patients with NSCLC after surgery by analysing the GEO data.

Results

Identify survival-related mRNA in the training set

The fifteen-mRNA signature of highly expressed mRNAs associated with survival was developed and validated as shown in Fig. 1. We identified highly expressed mRNAs from GSE31210 by using microarray data. Differentially expressed mRNAs were selected by volcano plot filtering (fold change ≥ 1 and P-value ≤ 0.05, Fig. 2). The relationship between RFS, survival state and high expression genes in NSCLC patients was performed using univariate Cox regression analysis in the training set, "risk score" of highly expressed genes in NSCLC prognosis was calculated. The higher the risk score, the greater the correlation between mRNA and RFS. According to the results, the fifteen-mRNA signature was significantly associated with RFS (n = 226, GSE31210; Table 1). UBE2F, TMSB10 and GAPDH were negative coefficients, indicating that patients with higher levels of expression had better outcomes than patients with lower levels of expression. Twelve mRNAs (UBC, TUBA1B, PPIA, PML, PKM, MESDC2, LDHA, HMOX1, FGFR1OP, CFB, ALDOA and ADAM10) were considered to be positive coefficients, so high levels of these mRNAs were associated with worse outcomes. Heatmap visualized distributions of fifteen-mRNA and risk scores in the training set and two independent GEO cohorts (Fig. 3).

Figure 1
figure 1

Development and validation of the fifteen-mRNA signature shown as study flow.

Figure 2
figure 2

Volcano plot of mRNA expression in the training set (n = 226, GSE31210). The red points in the volcano plot represents the statistically significant highly expressed mRNAs, and the green points represent mRNAs with significantly low expression.

Table 1 The characteristics of the fifteen-mRNA signature related to RFS in the training set (n = 226, GSE31210).
Figure 3
figure 3

Heatmap of the fifteen-mRNA signature and risk scores in the training set(n = 226, GSE31210). The “pheatmap” package of R software (version 3.5.1) was used to generate the heatmap.

Survival analyses between low-risk and high-risk groups

On the basis of the expression of mRNAs and their regression coefficients in the multivariate Cox model, we determined individual patient risk scores according to the fifteen-mRNA in the training set, internal validation set and external validation set. As the median value was used as the cutoff value, NSCLC patients in each set were classified into the low-risk group or the high-risk group. Figure 4 shows the distributions of risk score and RFS status in each set, it demonstrated NSCLC patients who had high risk scores had a higher risk of relapse after surgery. The clinical characteristics of low-risk group and high-risk group patients in these three sets are shown in Table 2. As our result, the clinical characteristics of the external independent variables (age, sex, stage) between the low-risk group and the high-risk group were not significantly different. RFS analyses were performed by log-rank test to determine the differences between high-risk and low-risk groups in these three sets (Fig. 5 and Table 3); lower scores were associated with longer RFS, and higher scores were associated with shorter RFS in each set (P < 0.05). These results suggest that these fifteen-mRNA can distinguish NSCLC patients with different prognosis, and can be used in subsequent studies.

Figure 4
figure 4

The distributions of RFS status and risk score in the training (n = 226, GSE31210) internal validation (n = 226, GSE30219) and external validation (n = 181, GSE50081) sets. The results showed that patients with recurrent NSCLC had a high risk score. Abbreviation: RFS, disease-free survival.

Table 2 Clinical characteristics of NSCLC patients on the basis of the fifteen-mRNA signature in the training (n = 226, GSE31210) internal validation (n = 226, GSE30219) and external validation (n = 181, GSE50081) sets.
Figure 5
figure 5

Kaplan–Meier curves of disease-free survival on the basis of the fifteen-mRNA signature in the training, internal validation, and external validation sets. As the result denmonstrated lower scores were associated with longer RFS, and higher scores were associated with shorter RFS in each set (P < 0.05).

Table 3 RFS analyses by the log-rank test according to the fifteen-mRNA signature in each set. Our result shown lower scores were associated with longer RFS, and higher scores were associated with shorter RFS in each set (P < 0.05).

Multivariate Cox regression analysis of the fifteen-mRNA signature and clinical information in each set

The relationship of the fifteen-mRNA signature, clinical information (sex, age, stage) and RFS in each set was analysed by multivariate Cox regression analysis (Table 4). As our data showed, the fifteen-mRNA signature was significantly related to the RFS as well as clinical characteristic in these datasets of NSCLC patients (all P < 0.05).

Table 4 The relationship of the fifteen-mRNA signature, clinical information (sex, age, stage) and RFS in each set analysed by multivariate Cox regression.

ROC analysis of the fifteen-mRNA signature and stage in each set

The area under the curve (AUC) of the ROC curve was used to analyse the RFS of fifteen-mRNA signatures and stages in each set (Fig. 6). As the figure shows, the AUC of the fifteen-mRNA signature was higher than that of stage alone in the training (P < 0.05) and internal validation sets (P < 0.05). The combined model’s (fifteen-mRNA signature and tumour stage) AUC was higher than that of stage alone in each set (P < 0.05). It suggests that fifteen-mRNA signatures have good predictive power to the prognosis of NSCLC patients.

Figure 6
figure 6

ROC curves of the combined model of the fifteen-mRNA signature and stage, the fifteen-mRNA signature and stage alone for each set. The AUC of the fifteen-mRNA signature was higher than that of stage alone in the training (P < 0.05) and internal validation sets (P < 0.05). The combined model’s (fifteen-mRNA signature and tumour stage) AUC was higher than that of stage alone in each set (P < 0.05). Abbreviation: AUC, area under the curve.

Comparison of RFS in the combined set (training set and internal validation set)

To further verify the efficacy of this fifteen-mRNA signature, we merged the training set and internal validation set into a combined set (n = 407, GSE31210 and GSE30219) and compared the RFS of the low-risk group (n = 204) and high-risk group (n = 203). The results show that the RFS of the high-risk group was significantly shorter than that of the low-risk group (P < 0.001, Table 5). Survival analysis of clinical information (sex, age, stage) and mRNA in the combined set using multivariate Cox regression. The results showed a significant correlation between our mRNA signature and RFS (HR = 2.30743, 95% CI = 1.7407–3.059, P < 0.001; Table 6). The Kaplan–Meier curve further showed that the RFS of the high-risk group was significantly shorter than that of the low-risk group (P < 0.001, Fig. 7). The results in combined set are consistent with our previous analysis.

Table 5 Comparison of RFS in patients with NSCLC based on the fifteen-mRNA signature in the combined training set and the validation set (n = 407, GSE30219 and GSE50081).
Table 6 Predictive value of the fifteen-mRNA signature, sex, age, stage, and survival in the combined set (n = 407, GSE30219 and GSE31210) analysed by multivariate Cox regression.
Figure 7
figure 7

Kaplan–Meier curve analysis of RFS according to the fifteen-mRNA signature for stage I/II patients in the combined training set and internal validation set. The Kaplan–Meier curve further showed that the RFS of the high-risk group was significantly shorter than that of the low-risk group (P < 0.001).

GO and KEGG functional enrichment analysis

We used GO and KEGG enrichment to identify the biological functions and signalling pathways of fifteen-mRNA signature. The results showed that the fifteen-mRNA signature was significantly associated with 94 GO terms (Fig. 8) and 20 KEGG pathways (Fig. 9). The GO terms mainly fit into three functional categories: carboxylic acid biosynthetic process (GO: 0,046,394), coenzyme metabolic process (GO: 0,006,732), and purine ribonucleoside triphosphate metabolic process (GO: 0,009,205). Glycolysis/gluconeogenesis (KEGG: 00010) and HIF-1 signalling pathway (KEGG: 04,066) were the main KEGG pathways involved.

Figure 8
figure 8

Functional enrichment analysis by GO category (BP: biological process; CC: cell component; MF: molecular function). The GO terms mainly fit into three functional categories: carboxylic acid biosynthetic process (GO: 0,046,394), coenzyme metabolic process (GO: 0,006,732), and purine ribonucleoside triphosphate metabolic process (GO: 0,009,205).

Figure 9
figure 9

Calculated results of KEGG functional enrichment. Glycolysis/gluconeogenesis (KEGG: 00010) and HIF-1 signalling pathway (KEGG: 04,066) were the main KEGG pathways involved.

Protein–protein interaction analysis and mRNA expression validation

STRING online software was used to analyse the interaction between proteins encoded by the fifteen mRNAs (Fig. 10A), and key genes were analysed according to the number of nodes using R software (version 3.5.1). Nodes were mainly interrelated with GAPDH and UBC, so these two proteins were speculated to be the key proteins in this protein–protein interaction network (Fig. 10B). GEPIA online software was used to verify the expression of the fifteen mRNAs in stage I and II patients with lung adenocarcinoma and lung squamous cell carcinoma. The expression of ALDOA, CFB, GAPDH, LDHA, MESDC2, PPIA, TMSB10, TUBA1B, and UBE2F was higher in stage II patients than in stage I patients (Fig. 11, P < 0.05).

Figure 10
figure 10

Protein–protein interaction network (A) and nodes (B) of proteins encoded by the fifteen mRNAs in the signature.

Figure 11
figure 11

GEPIA online software verified the expression of the fifteen-mRNA signature in stage I/II patients with lung adenocarcinoma and lung squamous cell carcinoma.

The fifteen-mRNA mRNA expression in patients with NSCLC

We used PCR to verify the expression of the fifteen-mRNA in lung cancer tissues of NSCLC patients, and the results showed that the fifteen-mRNA was significantly higher in stage II NSCLC than in stage I (Fig. 12, P < 0.05).

Figure 12
figure 12

The expression of 15 mRNAs detected by RT-PCR. 15 mRNAs was significantly higher in stage II NSCLC than in stage I ( P < 0.05).

The proteins related to fifteen-mRNA mRNA expression in patients with NSCLC

We used westernblot to verify the expression of proteins related to the fifteen-mRNA in lung cancer tissues of NSCLC patients, and the results showed that these proteins was significantly higher in stage II NSCLC than in stage I (Fig. 13, P < 0.05).

Figure 13
figure 13

The expression of 15 mRNAs related protein in tumor tissue detected by Westernblot. These 15 mRNAs related protein was significantly higher in stage II NSCLC than in stage I ( P < 0.05).

Discussion

In general, the TNM system is a widely used staging system among clinicians19,20, and TNM staging is essential for evaluating outcomes in clinical practice and for providing some indication of prognosis for survival21. Unfortunately, current methods of classification and staging for NSCLC are not completely reliable or sufficiently precise22,23,24. The progression and prognosis of tumours are related to the high expression of some genes25,26. The aim of this study was to characterize tumour recurrence and analyse genes related to the increased risk of recurrence in NSCLC. Bioinformatics analysis is currently considered to be an important tool for identifying tumour biomarkers. We profiled NSCLC mRNA by analysing the microarray data of the Affymetrix human genome U133 plus 2.0 array downloaded from GEO. High mRNA expression in stage I/II NSCLC was determined by the “Limma” package in the training set (GSE31210). Univariate Cox proportional hazards regression was used to analyse relationship between high expression genes and patient’s survival time and prognosis in NSCLC, the "risk score" of highly expressed genes in NSCLC prognosis was calculated, In this algorithm, the high risk socre of mRNA related with the poor prognosis of NSCLC. We selected 15 mRNAs with the highest risk score through this algorithm, and peculate that these 15 mRNAs are closely related to the prognosis of NSCLC. In order to study the predictive ability of these 15 mRNAs, we verify them in the training set and two independent GEO cohorts (GSE30219 and GSE50081), and the mRNA signature showed prognostic significance in three cohorts. Many factors, such as sex, age, and stage, are thought to be possible pathogenesis of NSCLC cancer. We analysed NSCLC patient RFS by multivariate Cox regression, and our results showed that the mRNA signature was associated with patient RFS. The mRNA signature performed better than stage alone, and the combined use of RNA and stage performed the best. The combined set, considering the mRNA signature and stage, was significantly associated with patient RFS; in the same stage, our mRNA signature was still significantly associated with patient RFS, and patients with low risk scores had significantly longer RFS. The fifteen-mRNA classifier has a very high HR and a very broad CI in the training set compared with the two other sets, we speculate it may be the instability of the training set. Furthermore, the ROC curve shownd that AUC of the fifteen-mRNA signature was higher than that of stage alone in the training and internal validation sets. The combined model’s (fifteen-mRNA signature and tumour stage) AUC was higher than that of stage alone in each set. Results of ROC curve suggests these fifteen-mRNA signatures as an independent prognostic factor in NSCLC. Finally, our bioinformatics analysis results shown our fifteen-mRNA signature is a novel biomarker with useful applications in predicting NSCLC prognosis.

To determine the biological relationship and signalling pathways among the fifteen mRNAs in the signature, we performed GO and KEGG analyses. Functional categories of the fifteen mRNAs were mainly involved in three GO terms, including the carboxylic acid biosynthetic process (GO: 0,046,394), coenzyme metabolic process (GO: 0,006,732), and purine ribonucleoside triphosphate metabolic process (GO: 0,009,205). All three pathways are considered to be closely related to tumours27,28,29,30. The main KEGG pathways involved included glycolysis/gluconeogenesis (KEGG: 00010) and the HIF-1 signalling pathway (KEGG: 04,066). Glycolysis is a universal pathway in living cells, and the glycolysis rate is 200 times higher in tumour cells than in normal cells31. Previous studies have shown that inhibition of HIF-1 represents a novel approach to cancer therapy32,33. We analysed the protein–protein interactions between proteins encoded by fifteen mRNAs. GAPDH and UBC were speculated to be the key proteins in this protein–protein interaction network according to their nodes, it suggest these two mRNA may play the key role of 15 mRNA. The expression of the fifteen mRNAs was validated by GEPIA online software, and the expression levels of ALDOA, CFB, GAPDH, LDHA, MESDC2, PPIA, TMSB10, TUBA1B, and UBE2F were higher in stage II patients than in stage I patients with lung adenocarcinoma and lung squamous cell carcinoma, which was consistent with our previous results in the gene sets.

Furthermore, we verified the expression of mRNA in NSCLC tumor tissues by RT-PCR and confirmed the expression of 15 mRNA related proteins by Westernblot. Our results showed that the expression of 15 mRNA genes was higher in stage II NSCLC than in stage I NSCLC, and the expression of 15 mRNA gene related proteins also showed the same situation, that is, in stage II is higher than in stage I. The fifteen-mRNA signature included twelve risky genes (UBC, TUBA1B, PPIA, PML, PKM, MESDC2, LDHA, HMOX1, FGFR1OP, CFB, ALDOA and ADAM10) and three protective genes (UBE2F, TMSB10 and GAPDH). Previous research showed that high tissue levels of PKM34,35, LDHA36,37, HMOX138, FGFR1OP39, ADAM1010,40, ALDOA41,42, and GAPDH 43,44were correlated with an increased risk of relapse in NSCLC patients. Low expression of UBC inhibits radiostasis and proliferation of NSCLC tumor cells45, UBE2F high expression promotes lung cancer cell survival46. CFB promote migration and proliferation of Cutaneous Squamous Cell Carcinoma47, and overexpression of TMSB10 relate with hepatocellular carcinoma and renal cell carcinoma48,49. Although there are no studies on eight of the mRNAs (TUBA1B, UBC, PPIA, PML, MESDC2, CFB, UBE2F, TMSB10) in prognosis of NSCLC, our experimental results shown these 15 mRNAs are involved in the progression of NSCLC, these experimental results provide evidence for the roles of these mRNAs in NSCLC and identify them as biomarkers.

The innovation of this research is identified mRNAs significantly related to RFS of NSCLC with a risk score via univariate Cox analysis, these 15 mRNAs have shown good predictive ability in the training set, internal validation set and external validation set. However, there were several limitations to our study. For example, further experiments are required to verify the clinical value of the signature. Limited by the clinical information of GEO data sets, we cannot identify the resection status of patients with NSCLC. Additionally, our experimental sample size is small, larger clinical trials may lead to more convincing results.

Our findings demonstrate a multiple-mRNA signature closely relate with tumour prognosis in stage I/II patients with NSCLC. It may aid in the development of novel biomarkers of NSCLC and offer new insights into NSCLC prognosis and may provide a new method for analyzing NSCLC based on Cox analysis.

Methods

Data of NSCLC

Raw microarray data from all data sets were analysed using the Affymetrix human genome U133 plus 2.0 array (GSE31210, GSE30219 and GSE50081), the mRNA expression data were log2 transformed before statistical analysis, and the median value was used when multiple probes existed for a single target. There was a total of 627 stage I/II patients with NSCLC after excluding patients without Recurrence Free Survival(RFS) or clinical data, including 226 from GSE31210, 226 from GSE30219 and 181 from GSE50081. The 226 patients from GSE31210 were used as a training set, 226 patients from GSE30219 were used as an internal validation set, and 181 patients from GSE50081 were used as the external validation set. The training set was used to optimize the parameters of model, and the internal validation set was used to tune hyper-parameters to optimize the model, external validation set use for validating the robustness of the screening method to different data.

Tissue specimens

Tumor tissues were obtained from 8 patients underwent resection of NSCLC (mean age of 54.4 ± 2.3 years, six males). The samples were taken during thoracic surgery, all cancer tissues were identified by HE staining. Four patients were stage I and the rest were stage II, all patients with no history of COPD or other respiratory infectious diseases.

Real-time PCR (RT-PCR)

Total RNA was extracted from tumor tissues using the TRIzol reagent (TaKaRa, Dalian, China). The primer sequences of 15 mRNAs are listed in Table 7. Qualitative and quantitative analysis of total RNA were using Nanodrop. RNA was reverse transcripted to cDNA and all samples carried out in triplicate and run RT-PCR on an ABI/PRISM 7500 according to the reagent manufacturer's instructions. RT-PCR was performed by SYBR Premix Ex TaqTM II (TaKaRa, Dalian, China).

Table 7 The primer sequences of 15 mRNAs.

Westernblot

After sufficiently ground and crushed tumor tissue, the protein in tumor tissue is extracted with radioimmunoprecipitation assay (RIPA) buffer, the expression level of 15 mRNAs related protein in cancer tissue were detected by Westernblot. The relevant antibodies used to detect the target protein are as follows: ADAM10 (dilution 1:1000; Abclone, Wuhan, Hubei, China), ALDOA (dilution 1:10,000; Abclone, Wuhan, Hubei, China), CFB (dilution 1: 1000; Abclone, Wuhan, Hubei, China), FGFR1OP (dilution 1: 1000; Abclone, Wuhan, Hubei, China), GAPDH (dilution 1:1000; Abclone, Wuhan, Hubei, China), HMOX1 (dilution 1:1000; Abclone, Wuhan, Hubei, China), LDHA (dilution 1:1000; Abclone, Wuhan, Hubei, China), MESDC2 (dilution 1:1000; Abclone, Wuhan, Hubei, China), PKM (dilution 1:1000; Abclone, Wuhan, Hubei, China), PML (dilution 1:1000; Abclone, Wuhan, Hubei, China), PPIA (dilution 1:1000; Abclone, Wuhan, Hubei, China), TMSB10 (dilution 1:1000; Sigma-Aldrich Chemicals, St. Louis, MO, USA)), TUBA1B (dilution 1:1000; Abclone, Wuhan, Hubei, China), UBC (dilution 1:1000; Abclone, Wuhan, Hubei, China), UBE2F (dilution 1:1000; Abclone, Wuhan, Hubei, China), Beta Actin (dilution 1:1000; Abclone, Wuhan, Hubei, China).

Statistical analysis

The “survival” package of R software (version 3.5.1) was used to perform survival analysis. Univariate Cox regression analysis was used to evaluate the association between the expression level of mRNA, NSCLC patients’ RFS and patients’survival state in the training set. mRNA expression was considered to be significantly different when the P-value was < 0.05, and multivariate Cox regression analysis of highly expressed mRNAs was used to calculate their risk score regression coefficients in the training set 50,51,52. The median value of risk scores in the training set was used as the cutoff point, and NSCLC patients in the training, internal validation, and external validation sets were classified as low risk or high risk corresponding to the cutoff. The Kaplan–Meier estimator and log-rank test were used to assess survival differences between the two groups. Multivariate Cox regression analysis was used to compare the efficacy of the risk score system and the efficacy of clinical characteristics such as stage, age, and sex. ROC curves were used to show the predictive value of RFS in the combined model (risk score combined with stage), risk score model and stage alone. To generate the ROC curves, patients with NSCLC who had a duration of less than 5 years of RFS were excluded if they did not relapse at the last follow-up. We referred to the previous method, set 60 months as the cutoff value of RFS for reasearch the 5-year survival rates, and the remaining NSCLC patients were divided into two groups by this cutoff value 53,54. The “pROC” package of R software was used to generate the ROC curve of RFS. Differences observed in the log-rank test, Cox regression analysis, and ROC analysis were considered to be significant if their P-values were < 0.05.

Results of RT-PCR and Westernblot are presented as means ± SD. Statistical analyses were calculated via SPSS (version 16.0.0; SPSS, Chicago, IL, USA). One-way ANOVA, Bonferroni post hoc correction (α = 0.0167), and Tukey test were conducted to evaluate significant differences in the data. Statistical significance was set at P < 0.05.

Functional enrichment analysis

Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were based on the GeneCodis web tool (http://genecodis.cnb.csic.es/) and KOBAS web tool ( http://kobas.cbi.pku.edu.cn/kobas3/?t=1) for functional enrichment analysis of these fifteen-mRNA signature. GO and KEGG category enrichments analyses had cutoff thresholds of P-value < 0.05. R software (version 3.5.1) was used to display significant enrichment results in graphical format.

Protein–protein interaction analysis and mRNA expression validation

STRING (https://string-db.org/) online software was used to analyse the interaction between proteins of fifteen mRNAs and to screen key genes. Gene Expression Profiling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn/) online software was used to verify the expression of fifteen-mRNA in stage I and II patients with lung adenocarcinoma and lung squamous cell carcinoma.

Ethics approval and consent to participate

All samples were obtained with informed consent and all protocols were approved by the First Affiliated Hospital of Guangxi Medical University (Scientific and Research Ethics Committee, No. 2020(KY-E-142)). And written informed consent was obtained from all patients participated in our research. This study follows the ethical guidelines of the Declaration of Helsinki 1975.