Analysis of prognostic model based on immunotherapy related genes in lung adenocarcinoma

Lung cancer is one of the most common malignant tumors, and ranks high in the list of mortality due to cancers. Lung adenocarcinoma (LUAD) is the most common subtype of lung cancer. Despite progress in the diagnosis and treatment of lung cancer, the prognosis of these patients remains dismal. Therefore, it is crucial to identify the predictors and treatment targets of lung cancer to provide appropriate treatments and improve patient prognosis. In this study, the gene modules related to immunotherapy were screened by weighted gene co-expression network analysis (WGCNA). Using unsupervised clustering, patients in The Cancer Genome Atlas (TCGA) were divided into three clusters based on the gene expression. Next, gene clustering was performed on the prognosis-related differential genes, and a six-gene prognosis model (comprising PLK1, HMMR, ANLN, SLC2A1, SFTPB, and CYP4B1) was constructed using least absolute shrinkage and selection operator (LASSO) analysis. Patients with LUAD were divided into two groups: high-risk and low-risk. Significant differences were found in the survival, immune cell infiltration, Tumor mutational burden (TMB), immune checkpoints, and immune microenvironment between the high- and low-risk groups. Finally, the accuracy of the prognostic model was verified in the Gene Expression Omnibus (GEO) dataset in patients with LUAD (GSE30219, GSE31210, GSE50081, GSE72094).

Extraction of differential genes and prognosis related genes. By comparing the differential expression of magenta module genes in normal tissues and LUAD tissues, 48 differential expression genes were identified. The heat map shows the expression of each differential gene in each sample (Fig. 2a). The volcano map shows the up regulation and down regulation of differential genes (Fig. 2b). Univariate Cox regression analysis was used to screen 21 prognostically related DEGs (Table S2), as shown in the forest diagram (Fig. 2c). Gene mutation (Fig. S1) shows that among 561 samples, 75 had mutations in central regulatory factors, with a frequency of 13.37%. It was found that IL16 had the highest mutation frequency, followed by FCRLA, FLI1, RASSF2, GIMAP7, EVI2B, PAPLIN, S19R4, RASGRP2. The rest of the regulatory factors did not show any mutations in the sample. The investigation of Copy number variation(CNV) frequency (Fig. 2d) showed that 19 central regulatory factors had copy number variation, FCRLA, PTPN7, TAP1, LTA, GIMAP7, EVI2B, FAM53B, IL16, CD28 focused on the amplification of copy number, and FCRLA had the highest amplification frequency. RASSF2, CD69, PAPPIN, S1PR4, FLI1, GNG7, STAMBPL1, CLECL1, ZC3H12D focused on the deletion of copy number. The deletion frequencies of CLECL1 and CD69 were the highest. In addition, the altered position of the central regulator CNV on the chromosome is also shown (Fig. 2e).

Consensus clustering based on prognostic related genes. Unsupervised clustering of LUAD patients
with different expression patterns of 21 immune prognosis related genes was carried out using the R package of consensusclusterplus. In order to ensure the stability of classification, 1000 iterations were carried out, and the resampling rate was 80%. The cumulative distribution function (CDF) curve is used to determine the number of clusters and determine that k = 3 has the best cluster stability from k = 2 to 9 according to the s imilarity ( Fig. 3ac). Finally, three different clusters (A, B, C) were identified, and the OS curve indicated the significant survival advantage of cluster B in the three main clusters (P = 0.003, Fig. 3d). Then Principal component analysis (PCA) was used to determine the sample distribution of the three clusters (Fig. 3e). The Heatmap showed high expression of prognosis related genes in cluster B and low expression in cluster A (Fig. 3f). ssGSEA analysis showed that there were significant differences in the degree of immune cell infiltration among the three clusters (Fig. 3g). Except for the unintentional expression of cd56dim.natural.killer.cellna, the expression of the other 22 immune cells was the lowest in cluster A and the highest in cluster B, such as activated B. cellna (P < 0.001), Activated. CD4. T. cellna (P < 0.001), Activated. CD8. T. cellna (P < 0.001), Eosinophilna (P < 0.001), MDSCna (P < 0.001), Macrophagena (P < 0.001), Mast. cellna (P < 0.001), Monocytena (P < 0.001), Natural. killer. Cellna (P < 0.001), neutrophilna (P < 0.001), among others. The immune cell infiltration level of cluster A was the lowest, indicating that the immune response of cluster A was the lowest, which is consistent with the poor survival results. The immune cell infiltration level of cluster B was the highest, indicating that the immune response of cluster B was the highest, which is consistent with the better survival results. To explore the differences in biological behavior among different clusters, we performed KEGG gene set variation analysis (GSVA) (Fig. S2a- Consensus clustering based on DEG among different clusters. Based on 125 DEGs (Fig. 4a, Table S3) of the intersection of three clusters, 78 prognosis related genes (Table S4) were screened out through univariate analysis for unsupervised cluster analysis. In order to ensure the stability of classification, 1000 iterations are carried out, and the resampling rate is 80%. The cumulative distribution function (CDF) curve is used to determine the number of clusters and determine that k = 2 has the best cluster stability from k = 2 to 9 according to the s imilarity (Fig. 4b, c). Finally, two different clusters (A, B) were identified. Kaplan Meier OS curves for both clusters showed that patients with gene cluster B had better prognosis (P < 0.001) (Fig. 4d). Then the PCA algorithm is used to confirm that the samples of the two risk groups are distributed separately (Fig. 4e). The Heatmap shows the clinicopathological features of prognostically relevant DEGs (Fig. 4f). ssGSEA analysis showed that there were significant differences in the degree of immune cell infiltration between the two clusters ( Fig. 4g). Activated. CD4. T. cellna (P < 0.001) , CD56d im. natural. killer. cellna (P < 0.001), Natural. killer. T.  Construction of prognosis model. In order to avoid over fitting, Lasso Cox regression analysis was performed on 78 differential genes related to prognosis, and Lasso coefficient spectra of 6 potential prognostic genes www.nature.com/scientificreports/ related to immunity were established (Fig. 5a). Then the optimal penalty parameters of lasso model were determined through ten-fold cross validation (λ) (Fig. 5b), find the key genes with the strongest correlation through dimension reduction, and calculate the relative coefficient of genes (Table S5). Finally, six genes, Plk1,HMMR, ANLN,SLC2A1, SFTPB,CYP4B1 were established to construct the prognosis model and score. We named it "IMscore". Risk scoring formula = (Plk1mRNA level *0.05682) + (HMMRmRNA level *0.00878) + (ANLNm-RNA level *0.10474) + SLC2A1mRNA level *0.01988) + (SFTPBmRNA level *− 0.00501) + CYP4B1mRNA level *−0.00608). Among them, 2 genes are protective factors (SFTPB,CYP4B1), and 4 genes are risk factors (Plk1,HMMR, ANLN,SLC2A1). Calculate the risk score for each patient according to the formula. According to the optimal threshold, patients were divided into high-risk and low-risk groups (Table S6). PCA showed (Fig. 5c) that patients with different risks could be divided into two groups. There were differences in IMscore among different subtypes. IMcluster A has the highest risk value and IMcluster B has the lowest risk value. The prognosis of high scores is poor, which is consistent with the previous studies (Fig. 5d). In genecluster, there were also differences in IMscore. The risk value of genecluster A was greater than that of cluster B, and the prognosis of cluster A is worse, which is consistent with the previous studies (Fig. 5e). Combining the IMscore with the clinical survival status, it was found that the IMscore of the dead patients was much larger than that of the living patients, and the patient mortality increased with the increase of the risk value (Fig. 5f). Survival analysis showed that there were significant differences between the high-risk group and low-risk group, and the survival of the high-risk group was worse (P < 0.001, Fig. 5g). Finally, the IMscore, genecluster, high-risk, low-risk and survival status were connected through the Sankey diagram. Most of the clusterB with the best prognosis in IMscore belong to geneclusterB with better prognosis in genotyping, and most of them belong to the low-risk group with better prognosis (Fig. 5h).
Evaluation of correlation between risk score and clinical characteristics. The risk curve (Fig. 6a) shows that LUAD patients are divided into high-risk and low-risk groups according to the median value of the risk score. The IMscore of the high-risk group is higher than that of the low-risk group. With the increase of the risk value, the number of dead patients increases. The progression free survival showed that the high-risk group was lower than the low-risk group (P < 0.001, Fig. 6b). The predictive effect of OS prognostic characteristics in LUAD patients was evaluated by time-dependent receiver operating characteristic (ROC) curve. The areas under the curve were (AUC) 0.675 in 1 year, 0.668 in 3 years and 0.607 in 5 years (Fig. 6c), indicating that the model has high sensitivity and specificity in predicting the prognosis of LUAD patients. Subsequently, we performed univariate and multivariable Cox analysis based on the risk scores obtained from immune related prognostic characteristics and the main clinical characteristics of LUAD patients in TCGA database. Univariate Cox analysis confirmed that higher stage and risk score were risk factors for HRS > 1 in LUAD patients, P < 0.001 (Fig. 6d). After removing other factors, a further multivariable Cox analysis (Fig. 6e) showed that higher stage and risk score were proved to be independent prognostic factors for OS in LUAD patients (stage HR = 1.571, 95% CI: 1.352-1.824, P < 0.001; risk score HR = 5.029, 95% CI: 2.722-9.290, P < 0.001). Stage stage shows that the risk score increases with the increase of stage, and the risk value of stage IV is the highest (Fig. 6f); T stage indicates that the risk score increases with the increase of stage, and the risk value in T4 stage is the highest (Fig. 6g). Clinical staging showed that the prognostic risk characteristics were closely related to the degree of malignancy.
Nomograph modeling using clinical characteristics and risk scores. In order to make better use of the prognosis model we constructed, nomograms of 1, 3 and 5-year overall survival of LUAD patients in TCGA database were established based on multivariate Cox analysis (Fig. 7a, Table S7). Calibration charts for the 1-, 3-, and 5-year OS are used to visualize the performance of nomograms (Fig. 7b). The sensitivity of the nomogram model was evaluated by ROC curve. The AUC result of the risk scoring model was 0.714 ( Fig. 7c), indicating that the nomogram was the best in predicting the survival of LUAD patients compared with other individual prognostic factors. Then, by univariate Cox analysis, the risk score was a risk factor for HRS > 1 in LUAD patients, P < 0.001 (Fig. 7d). Multivariate Cox analysis showed that the risk score proved to be an independent prognostic factor for OS in LUAD patients (risk score HR = 1.913, 95% CI:1.370-2.672, P < 0.001, Fig. 7e).

Functional analysis between different risk groups.
In order to study the potential difference of biological function between different risk groups, we conducted GO, KEGG pathway, GSEA and GSVA. GO analysis showed that DEGs between high-risk and low-risk groups were mainly enriched in nuclear division, organelle fission, chromosome segregation. (Fig. 8a, Table S8). KEGG analysis showed that DEGs were mainly enriched in CELL_CYCLE, DNA_REPLICATION and P53_SIGNAL_PATHWAY (Fig. 8b, Table S9). GSEA analysis showed that the high-risk score group was mainly enriched in CELL_CYCLE, DNA_REPLICATION and P53_SIGNAL_ PATHWAY, OOCYTE_MEIOSIS, SPLICEOSOME, etc., while the low-risk score group was mainly enriched in ALPHA_LINOLENIC_ACID_METABOLISM, ARACHIDONIC_ACID_METABOLISM, ASTHMA, INTES-TINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTIC,COMPLEMENT_AND_COAGULATION _CAS-CADE, etc. (Fig. 8c, Table S10). GSVA analysis prompted P53_SIGNALING_PATHWAY, CELL_CYCLE, DNA_ REPLICATION, RNA_DEGRADATION, HOMOLOGOUS_RECOMBINATION were mainly enriched in high-risk groups, ASTHMA, PPAR_SIGNALING_PATHWAY, ALPHA_LINOLENIC_ACID_METABOLISM, LINOLEIC_ACID_METABOLISM, COMPLEMENT_AND_COAGULATION_CASCADES and others were mainly enriched in low-risk groups (Fig. 8d, Table S11). Biological function between high and low risk groups in TCGA cohort.
Correlation analysis between risk score and tumor mutational burden. There is a significant difference in tumor mutation load between high and low risk scores. The tumor mutation load in the high risk score www.nature.com/scientificreports/ group is significantly higher than that in the low risk score group (Fig. 9a), and there is a significant positive correlation between tumor mutation load and risk score (Fig. 9b). Survival analysis showed that it was meaningless to study the relationship between high and low tumor mutation load and patient survival alone (Fig. 9c). However, after giving high and low risk scores, OS showed patients with high scores had poor prognosis in both high tumor mutation load group and low tumor mutation load group. Among them, patients with high tumor mutation and low IMscore had the best survival, while patients with low tumor mutation and high IMscore had the worst survival (Fig. 9d). There were differences in gene mutation frequency between high and low IMscore groups. The gene mutation frequency in high IMscore group was higher than that in low IMscore group. The top 20 most significantly mutated genes in the high and low risk score groups were TP53, TTN (Fig. 9e, f).

Correlation analysis of risk score with tumor immune microenvironment and immune cell infiltration.
In order to study the relationship between risk score and immune microenvironment, the estimate algorithm was used to quantify the matrix score, immune score, estimate score and tumor purity. The stromal score, immune score and estimate score of the low-risk group were higher than those of the high-risk group (P < 0.05, Fig. 10a). Therefore, the tumor purity of high-risk group was higher than that of low-risk group, it was associated with poor prognosis (Fig. S3a-3d) . There was significant difference between risk score and immune subtype (P < 0.05), and the risk value was the highest in C1 (Fig. 10b). Using the CIBERPORT algorithm, we calculated the proportion of 22 immune cells in each LUAD sample. Then, the difference of the proportion of immune cells between the high and low risk groups was compared. The results showed that the proportion of plasma cells, T cells CD4 memory reacting, NK cells activated, monocytes, dendritic cells reacting and mast cells resting was significantly higher in the low-risk group, and the proportion of M0 macrophases (P < 0.001), M1 macrophases (P < 0.001), T cells CD4 memory activated (P < 0.001) and NK cells resting (P < 0.001) in the highrisk group were significantly higher (Fig. 10c). They were associated with poor prognosis (Fig. S4a-4b).
Correlation analysis between risk score, immune checkpoint and drug sensitivity. Immune checkpoint inhibitor is a new strategy for the treatment of lung cancer in recent years. The correlation analysis of immune checkpoints showed that CD274, PDCD1LG2, PDCD1, IDO1 were positively correlated with risk scores (Fig. 11a). The difference analysis of immune checkpoints showed that CD40LG (P < 0.001), TNFSF14 (P < 0.001), TNFSF15 (P < 0.001), CD48 (P < 0.001), CD27 (P < 0.001) were highly expressed in the low-risk group, TNFRSF9 (P < 0.001), CD276 (P < 0.001), PDCD1LG2 (P < 0.001), CD274 (P < 0.001) and TNFSF4 (P < 0.001) were highly expressed in the high-risk group (Fig. 11b). CD274 was highly expressed in the high-risk group, thus, the highrisk group was more suitable for anti-PD-L1 treatment. Semi-inhibitory concentration (IC50) is an important index to evaluate the efficacy or response of drugs. We studied the risk score and the sensitivity of anticancer drugs, and found that the risk score is related to many anticancer drugs, such as gemcitabine, paclitaxel, etoposide, vinorelbine, imatinib, sorafenib, among others, which are more suitable for high-risk patients. These results suggest that the risk score can be used as a potential predictor of chemotherapy sensitivity, providing new insights for the treatment of tumors and the prevention of drug resistance (Fig. 11c-h).

Validate model accuracy in GEO datasets.
To determine the predictive power of the six gene prognostic model in other datasets, four LUAD patient datasets (GSE30219, GSE31210, GSE50081, GSE72094) as external validation. The same formula was used to calculate the risk score of patients in the GEO cohort. According to the optimal threshold, LUAD patients were divided into high-risk group and low-risk group. The survival curve showed that patients in the high-risk group had a shorter survival time (Fig. 12a-d). ROC curve was used to evaluate the sensitivity of prognostic model (Fig. 12e-h). Therefore, through these four datasets, the correctness and feasibility of the prognosis model are verified. Our model was helpful to predict the prognosis of the LUAD patients.

Discussion
The risk model we constructed shows that there is a significant difference in prognosis between high-risk and low-risk groups. In order to further study the potential causes of poor survival outcomes in high-risk patients, we compared the immune cell infiltration, immune checkpoint gene expression and TMB in high-risk and lowrisk patients, and found that the degree of tumor immune cell infiltration, the difference in immune checkpoint gene expression, and tumor mutation load may be the potential mechanisms that affect the prognosis of patients. Tumor associated macrophages (TAMs) are important components of the tumor microenvironment (TME) 12 and are potential targets for tumor immunotherapy 13 . We found that M0 and M1 macrophages were heavily A close relationship has been reported between the degree of macrophage infiltration and poor prognosis of patients 15 , and with accelerated angiogenesis, tumor cell invasion, infiltration, and distant metastasis 16 . Macrophages can be polarized into a tumor-promoting phenotype during lung tumor progression 17 . The progression of most tumors from benign to malignant is accompanied by a significant increase in vascular density, a process known as "angiogenesis transition" 18 . Macrophages play an important role in this complex vascular remodeling 19,20 . Macrophages can produce vascular endothelial growth factor (VEGF) in human and mouse breast tumors 19,20 . When macrophages are exposed to interleukin-4 (IL-4), they express VEGF and epidermal growth factor (EGF), thus accelerating tumor angiogenesis and breast cancer metastasis 21 , leading to poor prognoses. The activation of PD-1 and its ligand programmed cell death ligand-1 (PD-L1 or CD274) axis mediates T-cell dysfunction and failure 22 , causing tumor cells to escape immune surveillance, thus promoting tumor cell proliferation 23 . Our study showed that PDL-1 was highly expressed in the high-risk group. A previous study demonstrated that the high expression of (PD-L1) was closely related to prognosis in patients with Non-small-cell lung cancer(NSCLC) 24 , Similar conclusions were also reported for liver cancer 25 . The high expression of PD-L1 can also enhance immune checkpoint blockade (ICB) in the treatment of NSCLC 26 , urothelial carcinoma 27 .
Studies have shown that TMB can predict the efficacy of PD-1 combined with CTLA-4 blockade in patients with NSCLC 28 . In our study, the high-risk group had higher TMB. TMB was also shown to be positively correlated with response to ICB in 27 cancers 29 , and is gradually emerging as a potential marker for the same. Patients with high TMB in NSCLC are more likely to benefit from ICB therapy 30 . In our study, TP53 mutations were significantly more frequent in the high-risk group, and are generally associated with poor prognoses 31 , Meanwhile, patients with TP53 mutations also reportedly respond better to ICB therapy 32 . This supports our results in that the higher the expression of PDL-1, tumor mutation load, and frequency of TP53 mutation, the greater is the sensitivity of the high-risk group to immune checkpoint inhibitors. Moreover, these results may also partly explain the underlying mechanism of poor prognosis in high-risk groups.
Among the six genes (PLK1, HMMR, ANLN, SLC2A1, SFTPB, and CYP4B1) in the prognosis model, four genes (PLK1, HMMR, ANLN, SLC2A1) were risk factors and two genes (CYP4B1 and SFTPB) were protective factors. PLK1 (polo-like kinase) is a member of a new serine/threonine protein kinase family 33 , and has been shown to be highly expressed in human cancers. Its overexpression is related to poor prognoses in cancers such as neuroblastoma 34 , rectal cancer 35 , and epithelial ovarian cancer 36 . Research showed that inhibition of PLK1 can up regulate the expression of PD-L1. The combination of PD-L1 blocker and PLK1 inhibitor can produce synergistic effect in mice, significantly reduce the tumor burden and prolong the survival period of mice 37 .The proliferation of tumor cells can be inhibited by inhibiting the expression of PLK1, which may thus be a potential target for cancer therapy 38 . Hyaluronic acid mediated motor receptor (HMMR) is an extracellular matrix component that is closely related to cell proliferation 39 . It is associated with poor prognoses and is overexpressed in various cancers such as pancreatic cancer 40 , bladder cancer 41 , and glioblastoma 42 ,among others. HMMR was associated with the reduction of the overall survival of lung cancer patients. In addition, it can pass HCG18/miR-34a-5p/HMMR axis that accelerate the progression of lung adenocarcinoma 43 . ANLN is an actin binding protein that is associated with poor prognosis and is highly expressed in many malignant tumors such as pancreatic cancer 44 ,LUAD 45 , and nasopharyngeal carcinoma 46 ,among others. ANLN played a key role in human lung cancer by participating in www.nature.com/scientificreports/ phosphoinositide 3-kinase/AKT pathway. Selective inhibition of ANLN may be a new strategy for the treatment of lung cancer 47 .Solute carrier family 2 member 1 (SLC2A1), also known as glucose transporter 1 (GLUT1), is a glucose transporter coding gene related to the growth and proliferation of tumor cells 48 . Its overexpression is s imilarly related to poor prognosis in cancers such as colorectal cancer 49 ,breast cancer 50 ,and pancreatic cancer 51 , among others. It has a particularly essential role in the occurrence and progression of tumors, and may be one of the driver genes of lung cancer 52 . Surfactant protein B (SFTPB), secreted by type II alveolar epithelial cells, is the main component of pulmonary surfactant 53 , and its precursor form can predict the risk of lung cancer 54 . CYP4B1 is a cytochrome P450 monooxygenase. The loss of CYP4B1 gene expression is related to bladder urothelial carcinoma 55 , and its low expression is related to the poor prognosis of LUAD patients. Therefore, it can be used as an independent prognostic marker and a potential therapeutic target for patients with LUAD 56 . All in all, this study used WGCNA to identify the module genes related to immunotherapy, and screened out the genes related to prognosis through differential analysis and univariate Cox regression. Through consensus classification, patients were divided into three clusters. Subsequently, 125 DEGs were identified after the intersection of the three clusters. Six key genes were determined to construct a prognosis model through univariate Cox regression analysis and LASSO analysis. Patients were divided into high-risk and low-risk groups. Through analysis and comparison, patients in high-risk and low-risk groups had significant differences in prognosis, tumor immune microenvironment, tumor mutation burden, immunotherapy and immune checkpoints. Finally, the validity of the prediction model was successfully verified in the dataset of four external queues (GSE30219, GSE31210, GSE50081, GSE72094).These findings may provide new ideas for the treatment of lung cancer. However, this study still has some limitations. Our research was only based on the public database, which requires a larger sample size and further experiments to verify the predictive ability of the prognosis model. In addition, the role of key genes in the model also needs to be verified by a large number of experiments.

Conclusion
In conclusion, Our study has constructed a prediction model based on 6 genes, which divided LUAD patients into high-risk and low-risk groups. The IMscore played an important role in predicting clinical prognosis and sensitivity to anti-tumor drug treatment, which may help us to provide new strategies for personalized treatment of LUAD patients.

Data availability
All data were publicly available from TCGA (https:// portal. gdc. cancer. gov/) and GEO (https:// www. ncbi. nlm. nih. gov/ geo/) datasets. These data are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/ Correspondence and requests for materials should be addressed to Y.Z. or J.S.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.