Prognostic and therapeutic prediction by screening signature combinations from transcriptome–methylome interactions in oral squamous cell carcinoma

DNA methylation pattern in oral squamous cell carcinoma (OSCC) remains poorly described. This study aimed to perform a genome-wide integrated analysis of the transcriptome and methylome and assess the efficacy of their prognostic signature model in patients with OSCC. We analyzed transcriptome and methylome data from 391 OSCC samples and 41 adjacent normal samples. A total of 8074 differentially expressed genes (DEGs) and 10,084 differentially expressed CpGs (DMCpGs) were identified. Then 241 DEGs with DMCpGs were identified. According to the prognostic analysis, the prognostic signature of methylation-related differentially expressed genes (mrDEGPS) was established. mrDEGPS consisted of seven prognostic methylation-related genes, including ESRRG, CCNA1, SLC20A1, COL6A6, FCGBP, CDKN2A, and ZNF43. mrDEGPS was a significant stratification factor of survival (P < 0.00001) irrespective of the clinical stage. The immune effector components, including B cells, CD4+ T cells, and CD8+ T cells, were decreased in the tumor environment of patients with high mrDEGPS. Immune checkpoint expressions, including CTLA-4, PD-1, LAG3, LGALS9, HAVCR2, and TIGHT, were comprehensively elevated (P < 0.001). The estimated half-maximal inhibitory concentration difference between low- and high-risk patients was inconsistent among chemotherapeutic drugs. In conclusion, the transcriptome–methylome interaction pattern in OSCC is complex. mrDEGPS can predict patient survival and responses to immunotherapy and chemotherapy and facilitate clinical decision-making in patients with OSCC.


Patterns of DMCpGs and differential gene expression (DEG). A flowchart of the study is shown in
. A total of 8074 DEGs and 10,084 DMCpGs were identified. Clustering based on the DEGs or DMCpGs revealed two distinctive sample clusters, indicating the possibility to distinguish between OSCC samples and adjacent tissue samples ( Fig. 2A,B). Of the 8074 DEGs, 4327 were upregulated and 3747 were downregulated (Fig. 2C). Of the 10,084 DMCpGs, 5937 were hypermethylated DMCpGs (HyperCpGs) and 4147 were hypomethylated DMCpGs (HypoCpGs). Both HyperCpGs and HypoCpGs were correlated with the upregulation and downregulation of genes (Fig. 2D). These results indicate that the regulation of gene expression by methylation may be complex and multimodal.
Relationship between distribution pattern of methylation alterations and gene expression. DMCpGs across different genomic regions were not randomly distributed across the genome. Most DMCpGs are found in the gene body and intergenic regions (IGR). HyperCpGs had a higher proportion located 200 bp upstream of transcriptional start sites (TSS200), 5′-untranslated regions (UTR), and 1st exon, whereas HypoCpGs were located in the gene body and intergenic regions (Fig. 3A). The results indicated that there was a high density of methylated CpGs in several kb regions upstream and downstream of the transcriptional start sites (TSS). The distribution pattern around the CGI differed significantly between HyperCpGs and HypoCpGs. The distribution of HyperCpGs was significantly enriched within CGI (62.9%), whereas HypoCpGs were mostly enriched in the open sea regions (72.6%) (Fig. 3B), indicating that CpG methylation in CGI and non-CGI was potentially functional in gene expression. To further integrate the distribution pattern around TSS, CGI, and gene expression, we plotted the CGI and non-CGI methylation levels of every gene within four expression quartiles grouped by distance to the TSS (Figs. 3C and S1). Then, to check the potential bias owning to unpaired samples, we divided the samples into paired normal samples (pNormal), paired Tumor samples (pTumor) and unpaired single Tumor samples (sTumor) and found that the methylation pattern of pTumor was similar to that of sTumor instead of pNormal, showing the robustness of results (Fig. S2). HypoCpGs proximal to the TSS (approximately ± 1 kb) were observed in both OSCC and adjacent samples in the highly expressed CGI genes. CGI CpGs with low gene expression levels exhibited a higher number of HyperCpGs around TSSs in OSCC samples than in adjacent tissues, indicating that hypermethylation-induced silencing of tumor suppressor genes was more evident in transcriptionally silent genes with CGI. Moreover, HypoCpGs were observed in non-CGI CpGs with low gene expression and in those with high gene expression away from the TSS, suggesting a potential role of non-CGI HypoCpGs in the regulation of oncogenes and tumor-suppressor genes.
Correlation between methylation levels and gene expressions. We recognized that HyperCpGs and HypoCpGs might function differently in genes with CGI and non-CGI CpGs in OSCC samples. To further understand the correlations between CpG alterations and gene expression, we defined CpGs as positively-correlated CpGs (PosCpGs), wherein HyperCpGs and HypoCpGs were positively correlated with upregulation and downregulation of gene expression, respectively, and vice versa, as negatively-correlated CpGs (NegCpGs). We plotted the correlation between methylation and gene expression levels grouped by the genomic distribution of methylated CpGs. NegCpGs were highly concentrated around CGIs and within the promoter, while PosCpGs were highly distributed within the gene body, and the majority of CpGs were open sea (Fig. 4A). The correlation modes were gradient from CGI to the open sea. The results indicated that the correlation between methylation and gene expression levels was diverse but organized across the genomic distribution. However, not all correlations between methylation and gene expression levels were statistically significant. We further divided the www.nature.com/scientificreports/ correlation into significantly positive, significantly negative, and non-significant. Significant associations were observed more frequently in the DMCpGs than in the non-DMCpGs. CGI CpGs located in the promoter tended to have significantly negative associations with gene expression, whereas CpGs away from CGI and located in the gene body had significantly positive associations (Fig. 4B). www.nature.com/scientificreports/ obtained 522 and 384 DMCpGs that exhibited significant positive and negative correlations with gene expression, respectively (Fig. 5A). We then intersected DEGs with genes containing selected PosCpGs or NegCpGs and found 121 DEGs with positively related DMCpGs, 96 DEGs with negatively related DMCpGs, and 24 DEGs with both positively and negatively related DMCpGs (Fig. 5B). The intersection results further indicate the complexity of methylation regulation. It was difficult to predict the final gene expression, particularly when multiple methylation sites were altered. Here, we defined DEGs with significantly correlated DMCpGs as methylation related DEGs (mrDEGs). To determine the prognostic value of the 241 mrDEGs in OSCC, a multivariate Cox regression analysis was performed with a cutoff of P < 0.01. We found that three mrDEGs (ESRRG , CCNA1, and SLC20A1) were associated with a poor prognosis and significantly increased hazard ratio (HR), whereas five mrDEGs (COL6A6, FCGBP, CDKN2A, MEI1, and ZNF43) served as protective genes with HR < 1 (Fig. 5C). However, MEI1 did not reach a significant level in the Least Absolute Shrinkage and Selection Operator (LASSO) and was thus abandoned.

Screening of prognostic
Establishment and validation of the prognostic signature. We then built an mrDEGs predictive signature (mrDEGPS) using seven survival-relevant mrDEGs. The mrDEGPS for each patient was calculated using the following formula: www.nature.com/scientificreports/ We divided the patients with OSCC into low-and high-risk groups (Fig. 5D). The expressions of ESRRG , CCNA1, and SLC20A1 were comparatively higher in the high-risk group than in the other groups (Fig. 5E). Prognosis comparison showed that low-risk patients had significantly higher overall survival (OS) (P = 4.587e−10), disease-specific survival (DSS) (P = 1.28e−07), and progression-free survival (PFS) (P = 9.588e−6) than that in high-risk patients (Fig. 5F). The validation analysis in a small sample size (n = 97) demonstrated that patients in www.nature.com/scientificreports/ the high-risk group had poorer overall survival (P = 3.098e−2) than those patients in the low-risk group (Fig. 5G).
The area under the curve (AUC) of 1-year survival was 0.714 and 0.702 in the training and validation datasets, respectively. Compared with the clinical stage, mrDEGPS displayed superior predictive performance.

Association between mrDEGPS with clinical features and human papillomavirus (HPV). Next,
we explored the association between the mrDEGPS and clinical features (Fig. 5H). We found that the mrDEGPS did not differ between patients with different age groups, gender, alcohol consumption, and surgical margin status (P > 0.05). Notably, there was no difference in mrDEGPS in the different clinical stages and histopathological grades (P > 0.05), implying the absence of linear correlation between mrDEGPS and these traditional survival stratification factors; thus, mrDEGPS was a potentially independent prognostic factor. We also found that the mrDEGPS was significantly associated with lymphovascular and perineural invasion. As both lymphovascular and perineural invasion were independent prognostic factors 24 , this result was consistent with mrDEGPS as a www.nature.com/scientificreports/ prognostic factor. Moreover, the human papillomavirus (HPV) infection was associated with higher mrDEGPS (P < 0.001), indicating that HPV might partially promote epigenetic alterations. We further included patients with HPV status not available (NA) in the analysis and found no differences in mrDEGPS between HPV (NA), HPV ( +), or HPV (−) (Fig. 5I), which indicated that the mixed population neutralized the difference between HPV ( +) and HPV (−).
Association between mrDEGPS and TIME. Since the selected mrDEGs were relevant to the immune response, we hypothesized that mrDEGPS might have the capacity to identify alterations in TIME. The results showed that the high-risk group samples had reduced proportions of exhausted T cells, type 1 regulatory T cells, follicular T-helper cells, dendritic cells, B cells, CD4 + T cells, and CD8 + T cells, and increased proportions of naive CD4 + T cells, naïve CD8 + T cells, induced regulatory T cells, T-helper 2 cells, effector memory T cells, natural killer T cells, natural killer cells, and neutrophils. (Figs. 6A and S3). It has been reported that DCs, CD8 + T cells, and CD4 + T cells display a beneficial effect on survival, while neutrophils, NK cells, and Tem cells display harmful effects in breast cancer 25 . Our results implied that high-risk patients might have an altered survivalharmful TIME. To further characterize the potential signaling pathways involved in the influence of mrDEGPS, gene set enrichment analysis (GSEA) was performed to enrich the Kyoto Encyclopedia of Genes and Genomes 26 pathways ranked by gene correlation values with mrDEGPS. Half of the enriched pathways were associated with immune processes like "primary immunodeficiency, " "allograft rejection, " "autoimmune thyroid disease, " "intestinal immune network for IgA production, " "T cell receptor signaling pathway, " "antigen processing and presentation, " "natural killer cell-mediated cytotoxicity, " and "B cell receptor signaling pathway" (adi. P < 0.01) (Fig. 6B).

Prediction of immunotherapy and chemotherapy by mrDEGPS.
T-cell receptor signaling is the essential basis for immunotherapy and may participate in chemotherapy resistance 27,28 . Current immunotherapy is mainly achieved by antibody blocking of CTLA-4 or PD-1 pathway 29 . Studies indicate that LAG3, LGALS9, HAVCR2, and TIGHT could be the next-generation immunotherapy checkpoints [30][31][32][33] . We found that all these immunotherapy checkpoints were downregulated in high-risk patients (P < 0.001) (Fig. 6C). However, there was no significant difference in PD-L1, indicating that mrDEGPS was correlated with TIME rather than with tumor cells (Fig. S4). The mrDEGPS was higher in patients with a response (P = 0.00005) and the response rate was significantly lower in high-risk patients (P = 0.0024) (Fig. 6D). These results indicate that mrDEGPS low-risk patients might benefit from immunotherapy. For chemotherapy, the estimated half-maximal inhibitory concentration difference between low-and high-risk patients was inconsistent among drugs, which was not significant for gefitinib, lower in low-risk patients for rapamycin, and lower in high-risk patients for cisplatin, docetaxel, sorafenib, erlotinib, and gemcitabine (Figs. 6E and S5). We observed that the result might provide a reference for chemotherapeutic drug selection for individuals, and thus facilitated the survival of patients with OSCC. In summary, mrDEGPS screened from complex transcriptome-methylome interactions could facilitate the prediction of survival and immunotherapeutic efficacy in patients with OSCC (Fig. 7).

Discussion
Several studies have demonstrated a strong relationship between epigenetic and genetic aberrations in tumorigenesis 34 . It is commonly believed that epigenetic changes, such as DNA methylation, can drive abnormal gene expression of crucial genes involved in the development and progression of cancer, including head and neck cancer 35 . Hypermethylation of tumor suppressor genes and hypomethylation of proto-oncogenes at the promoter sites are associated with carcinogenesis and progression of OSCC 36,37 . For example, several studies have suggested that hypermethylation of PAX1 and ZNF582 genes is associated with aggressive progression and poor survival [38][39][40] .
Although the effect of promoter methylation changes has been studied extensively, increasing evidence from genome-wide methylome studies suggests that the methylation patterns are complex and cancer-type-specific [41][42][43] . The term "CpG island methylator phenotype" has been repeatedly used over decades to describe widespread CpG island promoter methylation 44,45 . However, only around 4-8% CGIs exhibit tissue-specific methylation, while approximately 70% of annotated gene promoters are associated with a specific CGI 46 . Therefore, there must be undisclosed cancer-type-specific methylation outside CGIs. Our results showed that CpGs away from islands, denoted as open sea CpGs, may facilitate gene expression of oncogenes in OSCC.
Despite CGI features, cumulative evidence indicates that the transcriptome-methylome interaction is not restricted to promoters and TSS 47 . In contrast to the repression of promoter methylation on expression, gene body methylation orchestrates transcription in a complex pattern, which is in contrast to the repression of promoter methylation. Approximately half of CGIs in mammalian genomes are not associated with a known gene promoter and are referred to as orphan CGIs 48 . Orphan CGIs that do not map to promoters of any protein-coding or non-coding transcripts but possess chromatin and transcriptional markers may reflect enhancer activity 49 . Orphan CGIs display most of the evolutionarily conserved methylation differences among tissues, indicating their possible role in tissue specification 50 . Our results indicated that CpGs located in the gene body might have a positive association with gene expression, suggesting that most tissue-specific methylation CGIs are not located at promoter regions in OSCC 46 . These findings may potentially facilitate research on aberrant methylation of these rarely investigated regions.
Another hypothesis regarding the non-promoter DMCpGs is that most genes have two or more TSSs; therefore downstream TSS are probably within the bodies of the transcriptional units of the alternative upstream promoters 51 . A large-scale full-length cDNA analysis to explore the budding yeast transcriptome showed that alternative promoters could be located at CGIs or non-CGIs, and combinations of an upstream non-CGI and a www.nature.com/scientificreports/ downstream CGI, or vice versa, may also occur. It is well known that alternative promoters contribute to contextspecific isoform expression and the regulation of isoform diversity 52 . However, it is difficult to interpret the link between expression and methylation in genes with multiple TSSs. Probes that are used to measure expression detect the output of all alternative promoters; however, only a single isoform might be dominant in OSCC cells.
Methylation of a downstream promoter blocks transcription from that promoter, which would allow the elongation of a transcript emanating from an upstream promoter, thereby leading to an apparent discordance between methylation and expression.
Currently, most methylome studies use the term "methylation-driven gene" 53,54 . However, we believe that the expression "driven" may be misleading. Considering the complex methylation patterns of alternative promoters, we conservatively referred to methylation-driven DEGs as mrDEGs. The mrDEGs were simultaneously hypermethylated and hypomethylated at different sites. There is a high possibility that CGI HyperCpGs function synergistically with open sea HypoCpGs. However, due to the limitation of the technique, it was challenging to deduce whether the gene expression difference was the counterbalance effect of antagonistic methylation alterations. Therefore, mrDEGs may be a suitable terminology for DEGs with differentiated methylation.
The mrDEGPS model was established using seven prognostic mrDEGs (ESRRG , CCNA1, SLC20A1, COL6A6, FCGBP, CDKN2A, and ZNF43). Methylation of ESRRG and CCNA has been investigated in head and neck cancers and is associated with poor prognosis [55][56][57] . The prognostic significance of CDKN2A and ZNF43 promoter hypermethylation has also been confirmed in colorectal cancers 58,59 . To the best of our knowledge, polymorphisms of COL6A6 and SLC201A have been preliminarily investigated in other diseases, and their methylation, possibly related to the polymorphism, is unknown 60,61 .
In our study, we found that mrDEGPS could serve as an independent prognostic predictor. MrDEGs, including FCGBP, COL6A6, CDKN2A, and CCNA1 correlated with immune cell response, T-cell phenotype modulation, and treatment response to doxorubicin 62-65 , whereas ESRRG expression was negatively correlated with CD4 + T cell activation. The role of mrDEGPS in TIME, immunotherapy, and chemotherapy responses was further investigated. Activated immune cell clusters, including CD8 + effector cells, exhibited lower infiltration in highrisk OSCC samples, indicating that the mechanism of prognostic function of mrDEGPS might be attributed to TIME alterations.
In addition, HPV may significantly affect the methylation levels of prognostic mrDEGs. Differences in methylation in HPV-driven cases were revealed in a previous study of the OSCC epigenetic landscape 66 . For example, DNA methylation of IDO1 in head and neck squamous cell carcinomas (HNSC) correlates with HPV status and survival 67 . Methylation markers for diagnosis of OSCC were independent of HPV infection 68 . It could be speculated that HPV correlated with partially methylated gene expression, whereas mrDEGs were fitted in the prognosis model rather than a diagnosis model. However, it is paradoxical that low-risk samples have a higher HPV infection rate since HPV is a well-known poor prognostic factor. Further studies are required to elucidate the role of HPV in DNA methylation.
However, this study had some limitations. Firstly, the validation set was limited. However, considering the cancer-specific property of epigenetic modulation, it might not be reasonable to validate the model in other datasets (for example, a dataset of HNSC included samples other than OSCC). Second, the prediction reliability of the score model was not verified in our clinical observations owing to limited resources. In addition, the cohorts lacked partial clinicopathological data, and ethnic differences existed among the groups. However, we collected the best available matched data. These results may be further confirmed by genomic big data. As mrDEGPS is  www.nature.com/scientificreports/ a prognostic model irrespective of the clinical stage, an improved novel staging system could be explored by combining genomic and clinical risk factors. Third, this study selected adjacent normal tissue as control. The control may be affected by field cancerization, leading to phenotypes different from true healthy tissue. However, we could not change the TCGA and other dataset as authors of secondary analysis. Lastly, immunotherapy and chemotherapy were more frequently prescribed for patients with late-stage OSCC. Further validation should be performed using data from patients with stage III/IV disease. Big data has evolved as the ubiquitous watchword of medical innovation 69 . We believe that the prevalence of medical big data can mitigate these limitations.
In conclusion, the transcriptome-methylome interaction pattern in OSCC is complex. Moreover, mrDE-GPS could predict patient survival and responses to immunotherapy and chemotherapy and facilitate clinical decision-making in patients with OSCC. Analysis of DNA methylation pattern in OSCC and its regulation on mRNA transcription. Gene expression in TCGA-OSCC cohort was correlated with the normalized beta value of HM450K probes using Spearman's rank correlation followed by false discovery rate (FDR) correction using the FDR method. Methylation-gene relations with |correlation value|> 0.3 and FDR < 0.05 were considered significant, and kernel density estimation was used to plot the distribution of significant probes around the TSS ± 5 kb. The methylation level of each probe was measured by the beta value, which ranged from 0 to 1 (representing unmethylated to fully methylated levels, respectively). Similarly, we removed probes with missing beta values in ≥ 5% samples. The remaining probes with missing beta values were imputed using the k-nearest neighbors algorithm.

Materials and methods
We mapped the probes to the promoter regions of the genes, which were defined as − 1.5 to 0 kb regions around the TSS. Next, the DNA methylation level of a gene was defined as the average beta value of the probes that mapped to its promoter region. Finally, samples with paired mRNA expression and DNA methylation profiles were analyzed, which involved 17,481 genes and 384 samples, to determine DMCpGs and DEGs. The Spearman rank correlation of CpG methylation levels with related gene expression was analyzed, and significant associations were based on the criteria of |correlation value|> 0.3 and FDR < 0.05.

Identification and validation of mrDEGPS.
Kaplan-Meier analysis was used to evaluate the relationship between the genes and survival of patients with OSCC. The LASSO binary logistic regression model and multivariate Cox regression were adopted after primary filtration. The linear combination of the regression coefficient β derived from the multivariate Cox regression model and multiplied by the corresponding mRNA levels generated a prognostic signature. The mrDEGPS for each patient was calculated using the following risk score formula: Exp(Gene n) · β n www.nature.com/scientificreports/ Patients were divided into high-and low-risk groups by setting the median risk score as the cut-off value. The OS, DSS, and PFS of the two groups were calculated using the Kaplan-Meier method with the log-rank test. Receiver operating characteristic (ROC) curves were generated to assess the predictive performance of the prognostic model. The expression patterns of genes in this prognostic model were visualized using the "pheatmap" package. In the validation analysis, we verified the Kaplan-Meier plot and ROC test in GEO using another OSCC cohort, GSE41613. The clinical features of the patients with low or high mrDEGPS were analyzed and compared.

Analysis of correlation between the survival risk model and tumor immune infiltration.
Immu-CellAI (http:// bioin fo. life. hust. edu. cn/ ImmuC ellAI# !/) was utilized to analyze the fraction of 24 types of immune cells in high-and low-risk samples with OSCC 70 . These immune cells included 18 subtypes of T cells, namely CD4 + T cells; CD8 + T cells; naive CD4 + T cells; naive CD8 + T cells; cytotoxic T cells; exhausted T cells; type 1 regulatory T cells; natural regulatory T cells; induced regulatory T cells; T-helper 1, 2, and 17 cells; follicular T-helper cells; central memory T cells; effector memory T cells; natural killer T cells; mucosal-associated invariant T cells; and gamma-delta T cells, as well as six other types of immune cells including B cells, natural killer cells, monocytes, macrophages, neutrophils, and dendritic cells. Immune Cell Abundance Identifier (ImmuCel-lAI) built an immune cell-based support vector machine model for the prediction of immunotherapy response (AUC: 0.80-0.91), and the model was used to estimate the response results of OSCC.
GSEA was performed using the R package "clusterprofiler" to determine the enrichment of previously defined biological processes in the ranked correlated gene with risk score using RNA-seq data from TCGA-OSCC cohort. The raw count data of gene expression from the TCGA-OSCC cohort were normalized using the variance stabilizing transformation function in R package "DESeq2, " and then submitted to ImmuCellAI (http:// bioin fo. life. hust. edu. cn/ web/ ImmuC ellAI/) to estimate the abundance of immune cells, particularly the T-cell proportions, and to predict the response of immune-checkpoint inhibitor treatments.  www.nature.com/scientificreports/