Tumor mutation burden (TMB)-associated signature constructed to predict survival of lung squamous cell carcinoma patients

Lung squamous cell carcinoma (LUSC) is a common type of lung cancer with high incidence and mortality rate. Tumor mutational burden (TMB) is an emerging biomarker for selecting patients with non-small cell lung cancer (NSCLC) for immunotherapy. This study aimed to reveal TMB involved in the mechanisms of LUSC and develop a model to predict the overall survival of LUSC patients. The information of patients with LUSC were obtained from the cancer genome atlas database (TCGA). Differentially expressed genes (DEGs) between low- and the high-TMB groups were identified and taken as nodes for the protein–protein interaction (PPI) network construction. Gene oncology (GO) enrichment analysis and gene set enrichment analysis (GSEA) were used to investigate the potential molecular mechanism. Then, we identified the factors affecting the prognosis of LUSC through cox analysis, and developed a risk score signature. Kaplan–Meier method was conducted to analyze the difference in survival between the high- and low-risk groups. We constructed a nomogram based on the risk score model and clinical characteristics to predict the overall survival of patients with LUSC. Finally, the signature and nomogram were further validated by using the gene expression data downloaded from the Gene Expression Omnibus (GEO) database. 30 DEGs between high- and low-TMB groups were identified. PPI analysis identified CD22, TLR10, PIGR and SELE as the hub genes. Cox analysis indicated that FAM107A, IGLL1, SELE and T stage were independent prognostic factors of LUSC. Low-risk scores group lived longer than that of patients with high-risk scores in LUSC. Finally, we built a nomogram that integrated the clinical characteristics (TMN stage, age, gender) with the three-gene signature to predict the survival probability of LUSC patients. Further verification in the GEO dataset. TMB might contribute to the pathogenesis of LUSC. TMB-associated genes can be used to develope a model to predict the OS of lung squamous cell carcinoma patients.

www.nature.com/scientificreports/ NSCLC patient prognosis is most often evaluated in light of the American Joint Committee on Cancer (AJCC) staging system (8th edition), with the stage of the cancer being used to guide treatment decision making 9 . However, in many cases this system may fail to accurately predict the prognosis of a given patient, as a number of other factors can influence such outcomes. Several studies have suggested that TMB might be a potentiallyuseful clinical predictor in NSCLC patients undergoing immunotherapy 10 and patients with resected NSCLC 8 . However, there are few studies focusing on the relationship between TMB and LUSC. However, few studies have focused on the relationship between TMB and LUSC. Therefore, using data from the TCGA database, we built a novel predictive model able to predict the survival probability of LUSC patients.

Method
Data acquisition. The RNA transcriptome sequences and corresponding clinicopathological data for LUSC patients were collected from The Cancer Genome Atlas (TCGA, https:// portal. gdc. cancer. gov/) and used for model training; this included 502 tumors and 49 paracancerous tissues. Information obtained from the Gene Expression Omnibus (GEO) database (https:// www. ncbi. nlm. nih. gov/ geo/) was used for external validation; this included 69 lung squamous carcinoma samples (GSE73403). Data was obtained solely from public databases, obviating any ethical conflict. TMB calculation and differential expressed genes (DEGs) screening. TMB is the sum of mutations per megabase in tumor tissue. TMB for each organization can be detected using the VarScan method, as calculated by the R package "maftools". The R package "limma" was used to identify differentially expressed genes (DEGs) between a high TMB population and a low TMB population. Classification used the median of the TMB score and a |log2 fold change |≥ 1.0. A P value < 0.05 and a False Discovery Rate (FDR) < 0.05 were the screening criteria. TMB evaluation of different analysis pipelines (MuTect and Muse detected Mutation data) was undertaken using R package 'maftools' . Kaplan-Meier analysis was used to show the survival difference between high and low TMB expression groups.

Protein-protein interactions (PPI) network construction.
We construction a PPI network of DEGs using the STRING database (https:// string-db. org/) and visualized using Cytoscape software. A confidence score of C ≥ 0.15 was defined as the threshold criterion.
Functional enrichment analysis. We used Gene Set Enrichment Analysis (GSEA) to analyze the signal path for two different TMB expression groups. Gene Ontology (GO) enrichment analysis was used to reveal potential biological processes of TMB-associated DEGs in LUSC.

Construct and validate a Cox proportional hazards model. Univariate and multivariate Cox analysis
were used to identify factors affecting survival and prognosis of LUSC. The risk associated with TMB-related genes was found from multivariate Cox regression analysis. From the median calculated score, patients were classified into high-and low-risk groups. Kaplan-Meier curves showed the survival difference between groups for both training and validation groups. A nomogram utilizing the risk score and baseline clinical information was able to predict 1-, 3-and 5-year overall survival (OS) in LUSC patients. The predictive accuracy of the model was found using ROC curves. RNA transcriptome profiles of LUSC patients were downloaded from the GEO database and used for external validation.
Statistical analysis. DEGs were screened using the R package "Limma". GO was conducted using "clus-terProfiler", "ggplot2", "enrichplot", "stringi", and "DOSE" packages. By using the TMB value, GSEA was able to analyze the signal path. We choose the "c2.cp.kegg.v6.2.symbols.gmt" gene set downloaded from the MSigDB database (http:// softw are. broad insti tute. org/ gsea/ msigdb/) as the reference. To assess prognostic value, Cox regression analysis was used to estimate the hazard ratio and 95% confidence interval (95% CI) for each variable in the LUSC cohort. Statistically significant variables (P < 0.05) in the initial univariate Cox regression analysis were used in the subsequent multivariate analysis. A predictive risk model was created using multivariate Cox regression analysis implemented in the "survival" package. The resulting model and clinical information were incorporated into a final predictive nomogram, able to predict 1-, 3-year and 5-year LUSC patient OS. Nomogram prediction was undertaken using the "survival" package. Predictive accuracy was evaluated using ROC curves. All statistical analysis was performed using R (version 3.5.2), Cytoscape, GSEA, and perl. A P value < 0.05 was considered significant.

Result
DEGs screening and PPI network construction. 30 DEGs (including CCL19, BPIFB1, SCGB1A1, PIGR, SELE, NR5A1, PIP) were sorted into low and high expression TMB groups: threshold |Log2 FC|> 1.0, P value < 0.05 and FDR < 0.05 (in Table1; Fig. 1a). Kaplan-Meier analysis indicated patient OS in the low expression group was significantly lower than in the high expression group. When evaluating TMBs generated by different analysis pipelines this remained true (Fig. 5). Figure 2 shows the distribution of DEGs in the two TMB expression groups. By using a STRING database comprising 25 nodes and 44 edges, we established a PPI network between DEGs, using a confidence threshold of C ≥ 0.15 (Fig. 1b). CD22, TLR10, PIGR, and SELE were identified as hub genes and visualized using Cytoscape (https:// cytos cape. org/) (Fig. 1c). DEGs included in the model and hub genes are verified in TMB evaluation of different analysis pipelines ( www.nature.com/scientificreports/ Enrichment analysis. GO analysis indicated that the identified DEGs participated in regulation of lymphocyte activation, lymphocyte, mononuclear cell, and leukocyte proliferation (Table 3; Fig. 3). MS4A1, TNFSF8, SCGB1A1, CD22, and CCL19 were associated with lymphocyte, Leukocyte and mononuclear cell proliferation. TNFSF8, SCGB1A1, CD22, and CCL19 were involved with regulating lymphocyte proliferation and leukocyte proliferation. GSEA analysis indicated that high expression of TMB correlated with DNA replication, cell cycle, oocyte meiosis, spliceosome, RNA degradation, base excision repair, and pyrimidine metabolism. Low-TMB was mainly associated with B cell receptor, T cell receptor, and chemokine signaling pathways (Fig. 4).

Cox proportional hazards model construction.
Univariant analysis suggested that survival rate was correlated with FAM107A, IGLL1, SELE and T stage. Multivariate Cox analysis indicated that FAM107A, IGLL1, SELE and T stage were independent prognostic factors of LUSC (Table 4). The regression model we constructed using multivariate Cox analysis can be used to predict LUSC patient OS. According to the median score, patients were divided between high-and low-risk groups. In training, Kaplan-Meier indicated patient OS in the low-risk group was significantly lower than in the high-risk group, as confirmed by external validation using data from the GEO database (GSE73403) (Fig. 5).

Development of a nomogram.
A nomogram able to predict LUSC patient survival at 1 year, 3 years, 5 years was constructed, incorporating the following information: T stage, M stage, N stage, age, gender and risk score model. In this nomogram, lines under each independent prognostic factor correspond to a score, with a combined score produced by summing all individual scores. This overall score allows prediction of patient prognosis after 1 years ("sur1year"), 3 years ("sur3year") and 5 years ("sur5year") ( Fig. 6).

Nomogram validation.
We then validated this nomogram using Receiver Operator Characteristic (ROC) curves (Fig. 7). We found that in

Discussion
LUSC is a common type of lung cancer with high incidence and mortality rate. Despite the improved diagnosis and therapy of LUSC, the prognosis is still very poor 11 . As the survival of NSCLC patients is influenced by many factors beyond just tumor stage, the use of the TNM staging system to estimate patient prognosis can often lead to inaccurate survival estimates. Many other studies have generated prognostic models aimed at more accurately estimating the survival of patients with NSCLC in a comprehensive manner 12 . TMB can alter responses to immunotherapy and affect the prognosis of many cancers, including breast cancer 13 , lung cancer 14 and colon cancer 15 . We identified 30 genes differentially expressed between high-and low-expression TMB groups. This included 4 up-regulated genes and 26 down-regulated genes. Univariate and multivariate analyses were used to examine the influence of each gene and clinical characteristic on OS, allowing a combined model and nomogram to be developed to predict LUSC patient OS.   www.nature.com/scientificreports/ GO enrichment analysis indicated these DEGs were mainly involved in lymphocyte activation and lymphocyte/leukocyte/monocyte proliferation. These genes are likely to be closely related to the functioning of immune checkpoint inhibitors in the cancer microenvironment. The proliferation, activation, and differentiation of lymphocytes are critical to the immune system 16 . Many studies have indicated TMB to be an independent predictor of immune checkpoint inhibitor therapy influencing the immune microenvironment 17 . Our work supports the view that TMB is closely related to immunity.
Univariant analysis and multivariate Cox analysis revealed that FAM107A, IGLL1 and SELE were independent prognostic factors of LUSC. These genes reportedly participate in the pathological processes of the TMB  www.nature.com/scientificreports/    20 , and the recurrent inactivation of FAM107A may be involved LSCC development. FAM107A expression was low or even absent in Hodgkin Reed-Sternberg (HRS) cells 21 . IGLL1 is part of the immunoglobulin gene superfamily, and its expression is closely related to humoral immunity 22 . It is associated with immune cell progression 23 . IGLL1 is thought to be a component of pre-BCR (precursor B cell antigen receptor) 24 . IGLL1 Mutations can induce blood system disease 25 . SELE, a selectin, is a cell adhesion molecule that contributes significantly to tumorigenesis and tumor progression 26 . By studying TCGA data, we found that low SELE expression was associated with worse OS in female lung cancer patients who never smoked 27 . Kang et al. found that E-selectin could act as a circulating signaling molecule and facilitate tumor progression and metastasis 28 . SELE was associated with several cancers, including lung cancer 29 , prostate cancer 30 and colon cancer 31 . The PPI network indicated that CD22, TLR10, PIGR, and SELE were hub genes. CD22, a transmembrane glycoprotein, is thought to be a regulator of autoimmunity and B cell responses 32,33 . It is also expressed in human lung tumors 34,35 . CD22 is an important drug target for ameliorating autoimmune diseases and acute lymphoblastic leukemia 36,37 . TLR-10 is a pattern recognition receptor with anti-inflammatory properties 38 . Kopp et al. www.nature.com/scientificreports/ found that TLRs are associated with colorectal cancer via nuclear transcription factor-κB-initiated transcription of inflammatory genes 39 . TLR-10 is particularly important in asthma genetics 40 . It is an anti-inflammatory factor affecting the risk of tuberculosis 41 . Polymeric immunoglobulin receptor (PIGR) is a component of the mucosal immune system correlated with several cancers, such as pancreatic cancer 42 , colon cancer 43 , hepatocellular carcinoma 44 and bladder cancer 45 . TMB has a clear role in tumorigenesis and development. It is associated with the immune microenvironment and inflammation. In this study, using TMB-associated genes we developed a three gene risk model to predict patient survival. Patient mortality increases with increasing risk score. We also constructed a nomogram based on the TMB-associated genes and clinical characteristic to predict LUSD patient OS.
This study has shortcomings. First, there is no direct experimental verification of the identified DEG in LUSC. Secondly, there is not a large enough number of clinical samples to confirm the risk model and nomogram. Larger clinical trials will be needed to verify this in the future.

Conclusion
TMB correlated with tumourigenesis and the development of LUSC. We constructed a TMB-associated risk score and nomogram to predict LUSC patient survival.  www.nature.com/scientificreports/