Introduction

Lung cancer is the commonest cancer, and is the main cause of global tumor morbidity and mortality1. Non-small cell lung cancer (NSCLC) is a common type of lung cancer, including lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and large cell lung carcinoma (LCLC). Among NSCLC, LUAD is the most common followed by LUSC which constitutes approximately 30% of all lung cancer cases2. With the discovery of target genes and the development of targeted therapy and immunotherapy, the survival of patients with NSCLC has been extended3, especially patients with LUAD4. However, due to the lack of effective targeted treatments, the progress of LUSC therapy is very slow and the clinical outcome of LUSC remains unsatisfactory.

TMB is the number of mutations per million bases in tumor tissue, including base substitutions, gene insertion, and gene coding and deletion errors. TMB has a vital role in tumor occurrence and development, and affects the immune response and survival prognosis of LUSC patients5. And many studies suggest that TMB is a potential and emerging biomarker for selecting NSCLC patients suitable for immunotherapy, even better than PD-1/PDL-1 expression6,7. In resected NSCLC patients, decreased TMB is believed to be associated with poor prognosis 8.

NSCLC patient prognosis is most often evaluated in light of the American Joint Committee on Cancer (AJCC) staging system (8th edition), with the stage of the cancer being used to guide treatment decision making9. However, in many cases this system may fail to accurately predict the prognosis of a given patient, as a number of other factors can influence such outcomes. Several studies have suggested that TMB might be a potentially-useful clinical predictor in NSCLC patients undergoing immunotherapy10 and patients with resected NSCLC8. However, there are few studies focusing on the relationship between TMB and LUSC. However, few studies have focused on the relationship between TMB and LUSC. Therefore, using data from the TCGA database, we built a novel predictive model able to predict the survival probability of LUSC patients.

Method

Data acquisition

The RNA transcriptome sequences and corresponding clinicopathological data for LUSC patients were collected from The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/) and used for model training; this included 502 tumors and 49 paracancerous tissues. Information obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) was used for external validation; this included 69 lung squamous carcinoma samples (GSE73403). Data was obtained solely from public databases, obviating any ethical conflict.

TMB calculation and differential expressed genes (DEGs) screening

TMB is the sum of mutations per megabase in tumor tissue. TMB for each organization can be detected using the VarScan method, as calculated by the R package "maftools". The R package "limma" was used to identify differentially expressed genes (DEGs) between a high TMB population and a low TMB population. Classification used the median of the TMB score and a |log2 fold change |≥ 1.0. A P value < 0.05 and a False Discovery Rate (FDR) < 0.05 were the screening criteria. TMB evaluation of different analysis pipelines (MuTect and Muse detected Mutation data) was undertaken using R package ‘maftools’. Kaplan–Meier analysis was used to show the survival difference between high and low TMB expression groups.

Protein–protein interactions (PPI) network construction

We construction a PPI network of DEGs using the STRING database (https://string-db.org/) and visualized using Cytoscape software. A confidence score of C ≥ 0.15 was defined as the threshold criterion.

Functional enrichment analysis

We used Gene Set Enrichment Analysis (GSEA) to analyze the signal path for two different TMB expression groups. Gene Ontology (GO) enrichment analysis was used to reveal potential biological processes of TMB-associated DEGs in LUSC.

Construct and validate a Cox proportional hazards model

Univariate and multivariate Cox analysis were used to identify factors affecting survival and prognosis of LUSC. The risk associated with TMB-related genes was found from multivariate Cox regression analysis. From the median calculated score, patients were classified into high- and low-risk groups. Kaplan–Meier curves showed the survival difference between groups for both training and validation groups. A nomogram utilizing the risk score and baseline clinical information was able to predict 1-, 3- and 5-year overall survival (OS) in LUSC patients. The predictive accuracy of the model was found using ROC curves. RNA transcriptome profiles of LUSC patients were downloaded from the GEO database and used for external validation.

Statistical analysis

DEGs were screened using the R package "Limma". GO was conducted using “clusterProfiler”, “ggplot2”, “enrichplot”, “stringi”, and “DOSE” packages. By using the TMB value, GSEA was able to analyze the signal path. We choose the "c2.cp.kegg.v6.2.symbols.gmt” gene set downloaded from the MSigDB database (http://software.broadinstitute.org/gsea/msigdb/) as the reference. To assess prognostic value, Cox regression analysis was used to estimate the hazard ratio and 95% confidence interval (95% CI) for each variable in the LUSC cohort. Statistically significant variables (P < 0.05) in the initial univariate Cox regression analysis were used in the subsequent multivariate analysis. A predictive risk model was created using multivariate Cox regression analysis implemented in the “survival” package. The resulting model and clinical information were incorporated into a final predictive nomogram, able to predict 1-, 3-year and 5-year LUSC patient OS. Nomogram prediction was undertaken using the “survival” package. Predictive accuracy was evaluated using ROC curves. All statistical analysis was performed using R (version 3.5.2), Cytoscape, GSEA, and perl. A P value < 0.05 was considered significant.

Result

DEGs screening and PPI network construction

30 DEGs (including CCL19, BPIFB1, SCGB1A1, PIGR, SELE, NR5A1, PIP) were sorted into low and high expression TMB groups: threshold |Log2 FC|> 1.0, P value < 0.05 and FDR < 0.05 (in Table1; Fig. 1a). Kaplan–Meier analysis indicated patient OS in the low expression group was significantly lower than in the high expression group. When evaluating TMBs generated by different analysis pipelines this remained true (Fig. 5). Figure 2 shows the distribution of DEGs in the two TMB expression groups. By using a STRING database comprising 25 nodes and 44 edges, we established a PPI network between DEGs, using a confidence threshold of C ≥ 0.15 (Fig. 1b). CD22, TLR10, PIGR, and SELE were identified as hub genes and visualized using Cytoscape (https://cytoscape.org/) (Fig. 1c). DEGs included in the model and hub genes are verified in TMB evaluation of different analysis pipelines (Table 2).

Table 1 Differential expressed genes between low TMB and high TMB groups.
Figure 1
figure 1

Identification of DEGs in LUSC between tumor and normal tissues. (a) The heatmap of DEGs between the high‐TMB and low‐TMB groups in LUSC by analysis of the TCGA datasets. Each column represents a sample, and each row represents one of DEGs. The levels of DEGs are shown in different colors, which transition from green to red with increasing proportions. The lines before the heat map indicated the dendrogram of DEGs cluster analysis. (b) The protein–protein interaction network (PPI) analysis was constructed by all the 30 DEGs using STRING database. (c) Four hub genes (PIGR, TLR10, SELE and CD22) in the PPI were screened by Cytoscape based on their connectivity degree. Red circles indicated four hub genes.

Figure 2
figure 2

The expression of DEGs distributed in high-TMB and low-TMB groups.

Table 2 DEGs are verified in TMB evaluation of different analysis pipelines.

Enrichment analysis

GO analysis indicated that the identified DEGs participated in regulation of lymphocyte activation, lymphocyte, mononuclear cell, and leukocyte proliferation (Table 3; Fig. 3). MS4A1, TNFSF8, SCGB1A1, CD22, and CCL19 were associated with lymphocyte, Leukocyte and mononuclear cell proliferation. TNFSF8, SCGB1A1, CD22, and CCL19 were involved with regulating lymphocyte proliferation and leukocyte proliferation. GSEA analysis indicated that high expression of TMB correlated with DNA replication, cell cycle, oocyte meiosis, spliceosome, RNA degradation, base excision repair, and pyrimidine metabolism. Low-TMB was mainly associated with B cell receptor, T cell receptor, and chemokine signaling pathways (Fig. 4).

Table 3 GO enrichment analysis of TMB associated DEGs in LUSC.
Figure 3
figure 3

GO enrichment analysis of the DEGs between high- and low-TMB groups.

Figure 4
figure 4

GSEA analysis was performed to further screen the significant pathway between high TMB group and low TMB group. The q‐value < 0.05 was considered as significance. (a) Significant pathway identified in the high-TMB group. (b) Significant pathway identified in the low-TMB group.

Cox proportional hazards model construction

Univariant analysis suggested that survival rate was correlated with FAM107A, IGLL1, SELE and T stage. Multivariate Cox analysis indicated that FAM107A, IGLL1, SELE and T stage were independent prognostic factors of LUSC (Table 4). The regression model we constructed using multivariate Cox analysis can be used to predict LUSC patient OS. According to the median score, patients were divided between high- and low-risk groups. In training, Kaplan–Meier indicated patient OS in the low-risk group was significantly lower than in the high-risk group, as confirmed by external validation using data from the GEO database (GSE73403) (Fig. 5).

Table 4 Cox proportional hazards model analysis of prognostic factors.
Figure 5
figure 5

Kaplan–Meier survival curves. (a/b) Patients from the TCGA and GSE73403 dataset are stratified into two groups according to median values for the risk scores calculated by three gene based on risk score signature. (a) Kaplan–Meier survival curves of the signature in TCGA dataset. (b) Kaplan–Meier survival curves of the signature in GSE73403 dataset. (c) Kaplan–Meier survival curves of different TMB groups calculated by VarScan. (d) Kaplan–Meier survival curves of different TMB groups calculated by MuTect. (Red means high-TMB group and blue means low-TMB group).

Development of a nomogram

A nomogram able to predict LUSC patient survival at 1 year, 3 years, 5 years was constructed, incorporating the following information: T stage, M stage, N stage, age, gender and risk score model. In this nomogram, lines under each independent prognostic factor correspond to a score, with a combined score produced by summing all individual scores. This overall score allows prediction of patient prognosis after 1 years (“sur1year”), 3 years (“sur3year”) and 5 years (“sur5year”) (Fig. 6).

Figure 6
figure 6

A prognostic nomogram predicting 1-, 2-, and 3-year overall survival of LUSC patient.

Nomogram validation

We then validated this nomogram using Receiver Operator Characteristic (ROC) curves (Fig. 7). We found that in training the area under the ROC curve was 0.672 (95%CI: 0.632–0.684) for 1-year survival, 0.659 (95%CI: 0.621–0.665) for 3-year survival, and 0.645 (95%CI: 0.621–0.665) for 5-year survival (Fig. 7). In external validation the corresponding areas were 0.648 (95%CI: 0.632–0.684) for 1-year survival, 0.681 (95%CI: 0.661–0.695) for 3-year survival, and 0.652 (95%CI: 0.621–0.665) for 5-year survival.

Figure 7
figure 7

ROC for 1-, 3-, and 5-year overall survival predictions for the nomogram. (a) In the training cohort, ROC curve for 1-year, 3-year and 5-year overall survival. (b) In the validation cohort, ROC curve for 1-year, 3-year and 5-year overall survival.

Discussion

LUSC is a common type of lung cancer with high incidence and mortality rate. Despite the improved diagnosis and therapy of LUSC, the prognosis is still very poor11. As the survival of NSCLC patients is influenced by many factors beyond just tumor stage, the use of the TNM staging system to estimate patient prognosis can often lead to inaccurate survival estimates. Many other studies have generated prognostic models aimed at more accurately estimating the survival of patients with NSCLC in a comprehensive manner12.

TMB can alter responses to immunotherapy and affect the prognosis of many cancers, including breast cancer13, lung cancer14 and colon cancer15. We identified 30 genes differentially expressed between high- and low-expression TMB groups. This included 4 up-regulated genes and 26 down-regulated genes. Univariate and multivariate analyses were used to examine the influence of each gene and clinical characteristic on OS, allowing a combined model and nomogram to be developed to predict LUSC patient OS.

GO enrichment analysis indicated these DEGs were mainly involved in lymphocyte activation and lymphocyte/leukocyte/monocyte proliferation. These genes are likely to be closely related to the functioning of immune checkpoint inhibitors in the cancer microenvironment. The proliferation, activation, and differentiation of lymphocytes are critical to the immune system16. Many studies have indicated TMB to be an independent predictor of immune checkpoint inhibitor therapy influencing the immune microenvironment17. Our work supports the view that TMB is closely related to immunity.

Univariant analysis and multivariate Cox analysis revealed that FAM107A, IGLL1 and SELE were independent prognostic factors of LUSC. These genes reportedly participate in the pathological processes of the TMB microenvironment. FAM107A, a putative tumor suppressor, was originally identified within common missing area of 3p21 in renal cell carcinoma18. FAM107A plays a vital role in lung carcinogenesis19. Katarzyna Kiwerska et al. found that the expression of FAM107A was reduced significantly in larynx squamous cell carcinoma (LSCC)20, and the recurrent inactivation of FAM107A may be involved LSCC development. FAM107A expression was low or even absent in Hodgkin Reed-Sternberg (HRS) cells21. IGLL1 is part of the immunoglobulin gene superfamily, and its expression is closely related to humoral immunity22. It is associated with immune cell progression23. IGLL1 is thought to be a component of pre-BCR (precursor B cell antigen receptor)24. IGLL1 Mutations can induce blood system disease25. SELE, a selectin, is a cell adhesion molecule that contributes significantly to tumorigenesis and tumor progression26. By studying TCGA data, we found that low SELE expression was associated with worse OS in female lung cancer patients who never smoked27. Kang et al. found that E-selectin could act as a circulating signaling molecule and facilitate tumor progression and metastasis28. SELE was associated with several cancers, including lung cancer29, prostate cancer30 and colon cancer31.

The PPI network indicated that CD22, TLR10, PIGR, and SELE were hub genes. CD22, a transmembrane glycoprotein, is thought to be a regulator of autoimmunity and B cell responses32,33. It is also expressed in human lung tumors34,35. CD22 is an important drug target for ameliorating autoimmune diseases and acute lymphoblastic leukemia36,37. TLR-10 is a pattern recognition receptor with anti-inflammatory properties38. Kopp et al. found that TLRs are associated with colorectal cancer via nuclear transcription factor-κB-initiated transcription of inflammatory genes39. TLR-10 is particularly important in asthma genetics40. It is an anti-inflammatory factor affecting the risk of tuberculosis41. Polymeric immunoglobulin receptor (PIGR) is a component of the mucosal immune system correlated with several cancers, such as pancreatic cancer42, colon cancer43, hepatocellular carcinoma44 and bladder cancer45.

TMB has a clear role in tumorigenesis and development. It is associated with the immune microenvironment and inflammation. In this study, using TMB-associated genes we developed a three gene risk model to predict patient survival. Patient mortality increases with increasing risk score. We also constructed a nomogram based on the TMB-associated genes and clinical characteristic to predict LUSD patient OS.

This study has shortcomings. First, there is no direct experimental verification of the identified DEG in LUSC. Secondly, there is not a large enough number of clinical samples to confirm the risk model and nomogram. Larger clinical trials will be needed to verify this in the future.

Conclusion

TMB correlated with tumourigenesis and the development of LUSC. We constructed a TMB-associated risk score and nomogram to predict LUSC patient survival.