Identification of a methylomics-associated nomogram for predicting overall survival of stage I–II lung adenocarcinoma

The aim of this paper was to identify DNA methylation based biomarkers for predicting overall survival (OS) of stage I–II lung adenocarcinoma (LUAD) patients. Methylation profile data of patients with stage I–II LUAD from The Cancer Genome Atlas (TCGA) database was used to determine methylation sites-based hallmark for stage I–II LUAD patients’ OS. The patients were separated into training and validation datasets by using median risk score as cutoff. Univariate Cox, least absolute shrinkage and selection operator (LASSO) and multivariate Cox analyses were employed to develop a DNA methylation signature for OS of patients with stage I–II LUAD. As a result, an 11-DNA methylation signature was determined to be critically associated with the OS of patients with stage I–II LUAD. Analysis of receiver operating characteristics (ROC) suggested a high prognostic effectiveness of the 11-DNA methylation signature in patients with stage I–II LUAD (AUC at 1, 3, 5 years in training set were (0.849, 0.879, 0.831, respectively), validation set (0.742, 0.807, 0.904, respectively), entire TCGA dataset (0.747, 0.818, 0.870, respectively). Kaplan–Meier survival analyses exhibited that survival was significantly longer in the low-risk cohort compared to the high-risk cohort in the training dataset (P = 7e − 07), in the validation dataset (P = 1e − 08), and in the all-cohort dataset (P = 6e − 14). In addition, a nomogram was developed based on molecular factor (methylation risk score) as well as clinical factors (age and cancer status) (AUC at 1, 3, 5 years entire TCGA dataset were 0.770, 0.849, 0.979, respectively). The result verified that our methylomics-associated nomogram had a strong robustness for predicting stage I–II LUAD patients’ OS. Furthermore, the nomogram combined clinical and molecular factors to determine an individualized probability of recurrence for patients with stage I–II LUAD, which stood for a major advance in the field of personalized medicine for pulmonary oncology. Collectively, we successfully identified a DNA methylation biomarker and a DNA methylation-based nomogram to predict the OS of patients with stage I–II LUAD.

www.nature.com/scientificreports/ NSCLC 6 . Liu et al. revealed lncRNA SLC16A1-AS1 as a novel prognostic hallmark in NSCLC 7 . Zhang et al. identified six metabolic genes as potential biomarkers for LUAD 8 . Meanwhile, emerging studies have demonstrated that epigenetics plays a critical role in the initiation, progression, therapeutic response, and result of human tumors 9,10 . DNA methylation serves as a significant epigenetic modification of cancer cells, which may be the main mechanisms of the inactivation of cancer-associated suppressor genes in the process of tumorigenesis 11 . Various studies suggested that DNA methylation was closely related to the development, invasion, and metastasis of carcinoma, and may act as a hallmark of prognosis 12,13 . For example, Guo et al. revealed a five-DNA methylation signature as an effective prognostic hallmark in patients with ovarian serous cystadenocarcinoma 14 . Li et al. suggested that a four-DNA methylation signature served as a robust prognostic biomarker for survival of patients with gastric cancer 15 . In addition, a previous study reported that methylation was more likely to occur for DNA in tumor cells than that in normal cells 16 . Nielsen et al. indicated that aberrant DNA methylation was a relatively stable as well as potentially reversible therapeutic signal 17 . Therefore, DNA methylation has great potential as biomarker of prognosis for cancer patients. However, few studies have investigated the value of DNA methylation for prognostic prediction in patients with stage I-II LUAD. The identification of novel prognostic DNA methylation hallmark for stage I-II LUAD patients was highly desired. We analyzed the genome-wide methylation map of stage I-II LUAD patients in TCGA database to investigate a DNA methylation-based predicator and a DNA methylation-associated nomogram. The Kaplan-Meier survival curve and ROC analyses were employed to test the robustness of the 11-DNA methylation signature and the results suggested a good value of our nomogram.

Results
Clinical characteristics of the study populations. A total of 393 stage I-II LUAD samples were enrolled in the present study. The demographics and clinical features of the included patents are summarized in Table 1. Workflow of model generation and subject enrolment was summarized in Fig. 1.

Methylation markers associated with OS of stage I-II LUAD patients.
We identified 2332 differentially expressed methylation sites between dead and alive cohorts that were adopted for univariate Cox proportional hazard regression analysis. Consequently, 84 DNA methylation sites were revealed to be closely involved in stage I-II LUAD patients' OS (P < 0.05). Following this, the identified 84 DNA methylation sites were used for LASSO Cox regression model (Supplementary Table S1) and 25 methylation sites were determined as the candidate methylation sites (P < 0.01) ( Fig. 2A,B) that were then used for multivariate Cox proportional hazard regression. Of these, 11 methylation sites were screened as optimal model sites for the prognostic evaluation of patients with stage I-II LUAD based on the multivariate Cox regression analysis. The risk scoring formula was calculated as follows: Risk score = 1.31688*cg00237391 + 8.85449*cg04529955 − 3.3827*cg06393879 − 3.34536*c g11539066 + 2.62432*cg12133048 − 6.18862*cg13600632 + 1.50514*cg13643814 + 2.3872*cg17186803 − 3.74192 *cg20546263 + 2.57421*cg24311704 + 1.19749*cg27468419. The scores showed strong associations of poor prognosis with hypermethylation of cg00237391, cg04529955, cg12133048, cg13643814, cg17186803, cg24311704, cg27468419 and the hypomethylation of cg06393879, cg11539066, cg13600632 and cg20546263 sites (Fig. 3).
Association between the 11-DNA methylation signature and stage I-II LUAD patients' OS. The prognostic value of the selected 11-DNA methylation hallmark was assessed on the basis of Kaplan-Meier survival analyses in the training, validation and the all-cohort datasets. The median risk scores were employed as cutoff value to divide the patients into low and high risk groups. As exhibited in Fig. 4A, OS was significantly longer in the low-risk cohort in comprison to the high-risk cohort in the training dataset (P = 7e − 07). A similar result was found in the validation dataset (P = 1e − 08) (Fig. 4C) and in the all-cohort dataset (P = 6e − 14) (Fig. 4E). The results implied a great robustness of the predicator as prognostic indicator in patients with stage I-II LUAD.
Evaluation of the predictive ability of the 11-DNA methylation biomarker by ROC analysis. We performed ROC analysis to further evaluate the value of the 11-DNA methylation biomarker. The AUC of the 11-DNA methylation hallmark at 1, 3, 5 years in training set were 0.849, 0.879, 0.831, respectively (Fig. 4B). A good value was also found in validation set (0.742, 0.807, 0.904, respectively) ( Fig. 4D) and the whole validation set (0.747, 0.818, 0.870, respectively) ( Fig. 4F), the results showed that the 11-DNA methylation biomarker had a significant capacity for predicting OS of stage I-II LUAD patients.
Then, the patients enrolled in this study were ranked based on their risk scores (Fig. 5A), and the dotplot was drawn in based on their recurrence status (Fig. 5B). The outcomes showed that the low-risk cohort had a longer OS than that in the high-risk cohort. Heatmap of 11 methylation sites distribution on the basis of risk score was observed in Fig. 5C Implementation of the 11-DNA methylation biomarker-related biological pathways. Singlesample Gene Sets Enrichment Analysis (ssGSEA) was carried out in accordance to TCGA LUAD mRNA dataset by employing GSVA package 18 for exploring the 11-DNA methylation signature-based signaling pathways. The median score was employed as the cutoff to divide the patients into low and high risk groups. Top 20 pathways that were more activated in the high-risk samples than that in low-risk samples were exhibited in Fig. 6A Table S2). The exact Pearson correlations between enriched pathways and risk score was presented in Fig. 6B.

Nomogram development.
To assess whether the 11-DNA methylation biomarker was an independent predictor for OS of stage I-II LUAD patients, we carried out univariate and multivariate Cox model via risk score and a few clinicopathological factors. Hazard ratios (HRs) showed that the 11-DNA methylation hallmark was significantly involved in the OS of stage I-II LUAD patients (P < 0.01, HR 2.60, 95% CI 1.91-3.53) ( Table 2), which implied that the 11-DNA methylation hallmark was an independent prognosis classifier. To further pre-  Comparison of our nomogram with several known prognostic hallmarks. We compared the nomogram of this study with other known prognostic biomarkers to confirm whether the nomogram hallmark had the advantage of stable and great performance. The biomarkers for predicting early stage LUAD patients' OS were employed for the comparison [19][20][21][22][23][24][25] . As indicated in Fig. 9, the result manifested that the nomogram outperformed a few known prognostic hallmarks. The AUC of the nomogram at 5 years was 0.979, which was obviously larger than that in other biomarkers, suggestive of a greater value for the nomogram in comparison to other signatures in predicting LUAD patients' prognosis.

Discussion
Despite advances in diagnosis and therapy of NSCLC over the past few decades, the 5-year survival rate is still poor 3 . Thus, it is urgently needed to identify potential hallmarks correlated with the prognosis of NSCLC and develop optimum targeted therapy. With the deepening study of epigenetics, increasing studies have suggested that DNA methylation is crucial to gene regulation and acts as early events of a few tumors. DNA methylation serves as one of the earliest detectable neoplastic alterations, which give it a significant superiority as carcinoma diagnosis and prognosis hallmarks [26][27][28] 30 .
In the present study, TCGA database were applied to analyze the methylation of stage I-II LUAD. We identified a signature which contained 11 methylation sites (cg00237391 (DEF6: 1stExon), cg04529955 (SLC10A7: Body), cg06393879 (MYT1L: Body), cg11539066 (MIR596: TSS1500), cg12133048, cg13600632 (FAM125B: Body), cg13643814 (CHRNA7: TSS1500), cg17186803 (CN4B: Body), cg20546263 (C3orf33: TSS1500),   38 . The result showed that 8 of these 10 genes were involved in cancer. We speculated the above 10 genes may be related to OS of LUAD patients. In addition, the result of Fig. 3 suggested that hypermethylation of cg00237391, cg04529955, cg12133048, cg13643814, cg17186803, cg24311704, cg27468419 sites was involved in poor OS and the hypomethylation of cg06393879, cg11539066, cg13600632 and cg20546263 sites was associated with a shorter OS. The prognosis predictive ability of a single methylation site was limited while the combination of multiple methylation sites had a stronger robustness for predicting OS of LUAD patients, which was further verified by Fig. 4B,D,F. To better explore whether different clinical characteristics have different effects on the 11-DNA methylation signature, we divided the patients based on site, age, stage, gender and smoking history and used the subgroup analysis to detect whether there were differences between ROC curves of different clinical characteristics cohorts. An 11-DNA methylation signature was determined in the present study. The result indicated that the biomarker had a high diagnostic ability for prognosis of stage I-II LUAD patients. We hope that it can assist monitor stage I-II LUAD patients' OS and also offer some assistance for the in-depth study of stage I-II LUAD. In addition, a few significant superiorities were required to be elaborated in the study. We developed a nomogram based on risk score, age and cancer status to predict 3-and 5-year stage I-II LUAD patients' OS. The result suggested that AUC at 1, 3, 5 years entire TCGA dataset were 0.770, 0.849, 0.979, respectively, indicating good value of the 11-DNA methylation biomarker in the clinical application, which made our study more valuable. On the other hand, we performed LASSO Cox regression analysis to explore the candidate methylation sites significantly involved in stage I-II LUAD patients' OS, which can filter the variables between univariate and multivariate Cox analysis. www.nature.com/scientificreports/ That is to say, the application of LASSO Cox regression model can elevate the predictive value of the 11-DNA methylation biomarker. Following that, a comparison of the nomogram with other known prognostic predicators exhibited that our nomogram had apparently higher value in the result prediction of LUAD. In addition, the AUC of the nomogram was larger than the 11-DNA methylation biomarker in this study, demonstrating that the combination of the risk score with clinical variables was more significant in comparison to the methylation-correlated signature alone in the prediction of LUAD patients' prognosis. In addition, based on our nomogram, clinicians will be able combine clinical and molecular factors to determine an individualized probability of recurrence for patients with stage I-II LUAD, which represents a major advance in the field of personalized medicine for pulmonary oncology, suggesting that this tool could help clinicians overcome one of the most challenging limitations we face when treating patients with stage I-II LUAD.
Whereas, there were a few limitations that needed to be emphasized in this study. Firstly, an independent external validation set was required to improve the predictive accuracy of DNA methylation biomarker in our study. Secondly, it takes a relatively long time for this predictive marker to be used clinically. Thirdly, our study was a retrospective study from TCGA database, which might produce some amount of selection bias. In addition, the samples of our study contained a mixture of methylomes in a tumor-lymphocytes/subsets, macrophage/ subsets, fibroblasts, as well as tumor cells, which may make reproducibility challenge. Furthermore, Illumina's 450 K Platform chip measured 450,000 methylation sites, which was less than 1% of all methylation sites, and this may yield some bias in our study.

Conclusions
In this study, we successfully developed a new DNA methylation prognostic predicator. In addition, a nomogram was developed based on methylation risk score, age and cancer status by a comprehensive analysis of bioinformatics methods, which could be used for the prediction of stage I-II LUAD patients' OS. Figure 3. Boxplots of methylation β values against risk group in the entire TCGA dataset. "High Risk" and "Low Risk" represent the high-risk and low-risk samples, respectively. The median risk score was applied as a cutoff. Vertical coordinates represent the β-value of 11-DNA methylation sites respectively. Mann-Whitney U test was employed to assess the diferences between the high-risk score and low-risk score groups.

DNA methylation data of stage I-II LUAD patients. DNA methylation data of patients with stage I-II
LUAD from TCGA database that was processed using the Illumina HumanMethylation450 BeadChip (Illumina Inc., CA, USA) and the associated clinical information was retrieved from TCGA database by R TCGAbiolinks package 39 , notably, Illumina's 450 K Platform chip analyzed 450,000 methylation sites, which was less than 1% of all methylation sites, and this may yield some bias in our study. The DNA methylation were evaluated via β-values and calculated as the proportion of M and M + U + 100, in which M represented the signal from methylated beads, and U referred to the signal from unmethylated beads targeting CpG site. Data for a total of 393 stage I-II LUAD patients containing 485,577 DNA methylation sites were enrolled after removing patients for whom clinical survival information was not available. We analyzed the correlation between DNA methylation levels and the related OS of patients with stage I-II LUAD. We randomly divided the total stage I-II LUAD patients into two parts: training group (70%) and testing group (30%). The training group was used for model construction and the testing group for assessing the value of the model. LASSO is a critical regularization in many various analysis approaches. Here, LASSO regression model was used to screen core methylation sites involved in stage I-II LUAD patients' OS. LASSO COX regression analysis was conducted by using a publicly available R package 'glmnet' 40 for 1000 iterations.
Data processing, normalization, and determination of differentially expressed methylation sites. The preprocessing of the raw data was performed to analyze the prognosis prediction classifier. The methylation site whose beta value was not available (NA) in any sample was deleted. Following this, we performed the normalization of the data by 'betaqn' function of wateRmelon package 41 . Moreover, the whole samples were divided into dead and alive groups based on recurrent status. The normalized beta was transformed to M value with the formulation: M = log(β/(1 − β)). M value was employed for the elimination of the bias caused by different probes. Finally, M value was exploited for the determination of the differentially expressed methylation sites between recurrent cluster and no recurrent cluster with 'dmpFinder' function of minfi package 42 . conducted to screen the methylation sites associated with the OS of stage I-II LUAD patients. Then, the LASSO Cox regression analysis was performed via the screened methylation sites for the selection of the candidate methylation sites correlated with OS of stage I-II LUAD patients. Following this, the candidate methylation sites were analyzed by multivariate Cox regression analysis to determine the independent prognostic hallmark for OS of stage I-II LUAD patients. Eventually, an 11-DNA methylation signature was determined for the prediction of OS stage I-II LUAD patients. Risk score models were developed on the basis of the 11-DNA methylation signature to calculate the risk score of each sample. The median risk score was set as the cutoff point. Patients with stage I-II LUAD were categorized into "high-risk" or "low-risk" cohorts based on a high and low score, respectively. Log-rank testing of the Kaplan-Meier curve was performed via the "survival" package 43 to assess the difference in OS of the two cohorts. ROC analysis was performed via the 'survivalROC' package 44 and the area under the ROC curve (AUC) was employed to evaluate the predictive performance of the hallmark. The greater the AUC value of a hallmark, the better the predictive capacity of the marker. Unless otherwise noted, all curves were plotted using R (version 4.0.2$2020).

Gene set variation analysis (GSVA).
To reveal the established biomarker-associated signaling pathways, we used a GSVA package 18 to evaluate DNA methylation risk score and enriched pathways activity conditions. The median risk score was set as the cutoff point to divide the patients into "high-risk" or "low-risk" cohorts. The exact pearson was drawn to analyze the correlations between enriched pathways and risk score. Significance was set as P < 0.05.

Construction of the nomogram.
To improve the predictive value of established predicator for stage I-II LUAD patients' OS, a nomogram was built via the 'rms' R package 45 . The univariate and multivariate Cox proportional hazard analysis were conducted on the basis of the methylation risk score and other clinical factors. Cox proportional hazard models was adopted to measure hazard ratios (HR) and corresponding 95% confidence interval (CI). We used the factors (P ≤ 0.05) from the multivariate Cox proportional hazard analysis to construct the nomogram. The nomogram was assessed based on C-index, ROC, DCA. The calibrate curve described the outcome of the nomogram, and the 45° line suggested the best prediction.  www.nature.com/scientificreports/  Figure 7. In order to quantify the risk assessment and survival probability for individual stage I-II LUAD patients, a nomogram was developed in the entire TCGA dataset according to the 11 DNA methylation signature-based risk score, age and cancer status. www.nature.com/scientificreports/   www.nature.com/scientificreports/ Ethics approval and consent to participate. Data obtained from the TCGA open-access database was collected from tumors of patients who provided informed consent based on the guidelines from the TCGA Ethics, Law and Policy Group.