Gastrointestinal adenocarcinoma analysis identifies promoter methylation-based cancer subtypes and signatures

Gastric adenocarcinoma (GAC) and colon adenocarcinoma (CAC) are the most common gastrointestinal cancer subtypes, with a high incidence and mortality. Numerous studies have shown that its occurrence and progression are significantly related to abnormal DNA methylation, especially CpG island methylation. However, little is known about the application of DNA methylation in GAC and CAC. The methylation profiles were accessed from the Cancer Genome Atlas database to identify promoter methylation-based cancer subtypes and signatures for GAC and CAC. Six hypo-methylated clusters for GAC and six hyper-methylated clusters for CAC were separately generated with different OS profiles, tumor progression became worse as the methylation level decreased in GAC or increased in CAC, and hypomethylation in GAC and hypermethylation in CAC were negatively correlated with microsatellite instability. Additionally, the hypo- and hyper-methylated site-based signatures with high accuracy, high efficiency and strong independence can separately predict the OS of GAC and CAC patients. By integrating the methylation-based signatures with prognosis-related clinicopathologic characteristics, two clinicopathologic-epigenetic nomograms were cautiously established with strong predictive performance and high accuracy. Our research indicates that methylation mechanisms differ between GAC and CAC, and provides novel clinical biomarkers for the diagnosis and treatment of GAC and CAC.


Scientific Reports
| (2020) 10:21234 | https://doi.org/10.1038/s41598-020-78228-y www.nature.com/scientificreports/ recent years, mutations in transcriptional profiles have been used to classify cancers into different subtypes and explore novel molecular markers that are related to different biological characteristics and survival outcomes [3][4][5][6][7] . Therefore, molecular-based pathogenic mechanisms and diagnostic markers of gastrointestinal adenocarcinoma subtypes have received extensive attention. Epigenetics refers to heritable modifications manifested as changes in gene expression but not the DNA sequence, and these modifications play important roles in embryonic development, gene imprinting, cell differentiation and tumorigenesis 8 . The effects of epigenetic modifications in the tumor context include abnormal DNA methylation, histone modifications, activity of noncoding RNAs (such as microRNAs and lncRNAs), etc. 9 . Currently, abnormal DNA methylation, which is closely related to tumorigenesis and progression via the regulation of tumor suppressor genes, is the most thoroughly studied epigenetic modification 10 . More importantly, DNA methylation in gastrointestinal cancers has been extensively studied, especially CpG island methylation, which occurs in 56% of the protein-coding genes in the human genome 10,11 , and the diagnostic potential of certain CpG methylation sites for gastrointestinal cancer has been effectively assessed 12,13 . Moreover, several markers for overall survival (OS) of patients with gastrointestinal cancer have been developed according to transcription and methylation profiles, but these studies were based mostly on analysis of candidate genes and focused mainly on CpG island methylation and its relationship with gene expression [14][15][16][17][18] . However, the DNA methylation-based cancer subtypes and prognostic signatures for gastrointestinal adenocarcinoma have not been fully investigated.
In this study, the promoter methylation profiles were accessed from open public databases to classify gastric adenocarcinoma (GAC) and colon adenocarcinoma (CAC) subtypes. The prognostic risk scoring signatures based on promoter hypo-or hyper-methylated sites were constructed for GAC and CAC. Additionally, we explored the relationship between gene expression and methylation levels, and investigated the signaling pathways involving genes containing independent prognostic methylation sites (Fig. 1).

Results
Promoter methylation patterns reveal cancer subtypes. Multivariate analysis revealed a total of 131 independent prognostic methylation sites [68 for GAC (Supplementary Table 1) and 63 for CAC (Supplementary Table 2)], which were used to identify GAC and CAC subtypes. According to promoter methylation-based consensus clustering, six clusters that contained all the GAC samples were identified at a clustering threshold of maxK = 6 ( Fig. 2A-C). In addition, six major CAC clusters that contained 98.98% of the CAC samples were identified for maxK = 7 ( Fig. 2D-F), cluster 7 was excluded because it contains only 2 samples. The Kaplan-Meier survival curves (Fig. 2G,H) determined the statistical significance of the consensus clustering results for GAC and CAC (P = 0.002 and 1.914e−11, respectively). In GAC, clusters 4 and 1 showed the best and worst OS profiles, while in CAC, clusters 5 and 4 revealed the best and worst OS profiles, respectively. Differential methylation sites across clusters. Considering the remarkable impact of the classification scheme on patients' prognosis, the differential methylation sites among clusters were investigated. Difference analysis confirmed the significant differences of promoter methylation across clusters for both GAC and CAC (Fig. 3A,B). In GAC, cluster 1 was hypo-methylated, while cluster 4 was more methylated than others (Fig. 3C).   (Fig. 2G), the OS of patients became worse with the decrease of promoter methylation level. Therefore, promoter hypomethylation is unfavorable to the OS of GAC patients. In CAC, cluster 4 was hyper-methylated, while no significant difference was found among other clusters (Fig. 3D). More importantly, cluster 4 showed the worst OS compared with other clusters (Fig. 2H), which indicated that promoter hypermethylation is detrimental to the OS of CAC patients.
Co-methylation pattern and methylation levels of independent prognostic sites. In this study, the co-methylation patterns of 68 independent prognostic methylation sites in GAC and 63 methylation sites in CAC were discussed by correlation analysis. In GAC, 5 methylation sites were negatively correlated with the other 63 methylation sites, while the other 63 methylation sites were positively correlated or had no co-methylation relationship (Fig. 3E). In CAC, 63 methylation sites showed significant positive correlation or no methylation relationship (Fig. 3F). Additionally, according to the differential methylation analysis, 65 differentially   www.nature.com/scientificreports/ methylated sites were obtained in GAC, including 63 hypo-methylated sites and 2 hyper-methylated sites (Supplementary Table 3). 19 differential methylation sites were identified in CAC, consisting of 18 hyper-methylated sites and 1 hypo-methylated site (Supplementary table 4). Therefore, this study mainly captured hypomethylation sites in GAC, while hypermethylation sites in CAC.

Microsatellite instability levels across clusters.
In GAC, cluster 4 has the highest microsatellite instability (MSI), followed by cluster 5, while the other clusters showed a low MSI, including the hypo-methylated cluster 1 ( Supplementary Fig. 1A). In CAC, the mean of MSI in the hyper-methylated cluster 4 was lower than the other five clusters, even if the difference in MSI between six clusters was not obvious (Supplementary Fig. 1B). These results indicated that hypomethylation in GAC and hypermethylation in CAC are negatively correlated with MSI.
Proportion of clinicopathologic variables in cancer subtypes. The proportion of clinicopathological variables (including age, sex, grade, and pathological stage) was investigated in each cluster. In GAC, there was no significant difference in the proportion of age, sex, T stage and N stage across six cluster. M1 and stage IV were absent in cluster 4, while increased in other hypo-methylated clusters, which indicated that promoter hypomethylation can accelerate GAC progression (Fig. 4A). In CAC, there was no significant difference in the proportion of sex and M stage in six clusters; the patients were older in cluster 6; in cluster 4, stage I and T1/2 were absent, and N1 was less, while stage II/III/IV, T3/4 and N1/2 accounted for more ( Fig. 4B), which proved that promoter hypermethylation can promote the malignant progression of GAC.
Gene expression and molecular pathways. The methylation profiles of the GAC and CAC clusters were shown in Fig. 5A,B, respectively. To elucidate the effect of methylation on gene expression, the genes with independent prognostic methylation sites were explored, and their expression levels were visualized via unsupervised hierarchical clustering (Fig. 5C,D), which showed negative correlation between gene expression and promoter methylation. Additionally, molecular functional analysis showed that hypomethylation in GAC was associated with substance metabolism (e.g. hydrogen peroxide metabolic process, cytosolic calcium ion transport and amino-acid betaine metabolic process), ferroptosis (e.g. fatty acid oxidation, lipid oxidation and fatty acid degradation) and well-known cancer-related pathways (e.g. Ras signaling pathway, Rap1 signaling pathway and calcium signaling pathway) (Fig. 5E,F). Hypermethylation in CAC involved various types of pathways, including carcinogenic pathway (e.g. p53 signaling pathway), cell cycle (e.g. regulation of mitotic cell cycle phase transition and regulation of cell cycle phase transition), ferroptosis (e.g. peroxisome and fatty acid catabolic process), anion transport (e.g. positive regulation of ion transmembrane transport), cell senescence (e.g. cellular senescence, regulation of cellular senescence and regulation of cell aging), catabolism (e.g. fatty acid catabolic process and monocarboxylic acid catabolic process), inflammation (e.g. negative regulation of inflammatory response) and apoptosis (e.g. regulation of intrinsic apoptotic signaling pathway), etc., (Fig. 5G,H).
Generation and validation of hypo-methylated site-based signature for GAC . According to the difference analysis, since GAC is dominated by hypomethylation, 63 hypo-methylated sites were enrolled into Lasso Cox regression analysis to construct a prognostic risk scoring signature for GAC. And a 16-hypomethylation site-based signature for OS was identified, and the RS for each patient could be calculated on the basis of the methylation levels of 16 hypo-methylated sites and the relative coefficients (Table 1) Generation and validation of hyper-methylated site-based signature for CAC . The differential methylation analysis showed that CAC was dominated by hypermethylation, 18 hyper-methylated sites were included in the Lasso Cox regression to generate a hyper-methylated site-based prognostic risk scoring signature. And a 12-hypermethylated site-based signature was generated for predicting the OS of CAC patients ( Table 2). RS = (3.02 * cg03017653) + (22.84 * cg03977782) + (3.00 * cg05417950) + (36.54 * cg06250108) + (2.18 * c g09893305) + (3.12 * cg10414946) + (3.06 * cg15170424) + (5.77 * cg15639045) + (-9.51 * cg15786837) + (22.02 * cg 17329249) + (8.12 * cg21212956) + (− 23.31 * cg24206256) ( Table 2). Considering 10 coefficients are greater than 0, which indicated that 10 hyper-methylated sites are hazardous markers and the remaining 2 are protective factors. The Kaplan-Meier survival curves ( Fig. 6E-F) revealed more favorable OS in the low-risk group than in the high-risk group (training cohort: P = 6.078e−7, validation cohort: P = 4.992e−2). Besides, the AUC of ROC curves was 0.874 in the training cohort ( Fig. 6G), while AUC was 0.681 in the validation cohort ( Fig. 6H), indicating the high accuracy and efficiency of the hyper-methylated site-based signature.  7A, all P < 0.05). The older the age, the later the stage; and the higher the RS, the worse the prognosis of patients. Additionally, multivariate analysis indicated that age, TNM stage and RS are independent prognostic factors for GAC patients (Fig. 7B, all P < 0.05). Subsequently, four prognosis-related factors were combined, and a nomogram was constructed to predict OS (Fig. 7C). The AUCs of ROC curves for predicting 3-year and 5-year OS are 0.788 and 0.775, respec-  (Fig. 8D), and calibration curves showed that the nomogram prediction effect is excellent (Fig. 8E).

Discussion
Gastrointestinal adenocarcinoma is the most common pathological type of gastrointestinal cancer, with high incidence and mortality; this disease seriously endangers human health, and its pathogenesis involves a complex process of accumulation of classical DNA sequence changes and epigenetic modifications 10 . DNA methylation can affect gene transcription and expression via various mechanisms, such as interfering with transcription factors, recruiting histones, and altering chromatin structure. Abnormal DNA methylation manifests mainly as a decrease in the genome-wide methylation level and promoter hypermethylation; the former can lead to protooncogene activation, loss of imprinting and chromosome instability, while the latter can silence the expression of tumor suppressor genes, cell cycle regulatory genes and apoptotic genes 11,19,20 . Promoter DNA methylation refers to the selective addition of a methyl group to cytosines in CpG sequences to form 5-methylcytosine via  www.nature.com/scientificreports/ a reaction catalyzed by methyltransferases 21 . This event plays a vital role in epigenetic modification and can regulate gene expression without changing the DNA sequence 21 . CpG sequences are distributed unevenly across the human genome and are often clustered in CpG islands, which are located mostly in the promoter and first exon regions and are found in 56% of human protein-coding genes 10 . Currently, numerous genes with promoter methylation have been identified in gastrointestinal cancer and are related to tumorigenesis, progression and prognosis 22 . Therefore, the therapeutic targets and prognostic biomarkers for gastrointestinal adenocarcinoma at the epigenetic level (e.g., DNA methylation level) remain to be investigated. Several studies have identified gene-specific DNA methylation signatures for OS in gastrointestinal cancers [14][15][16][17] , but no study has classified cancer subtypes or constructed prognostic signatures for gastrointestinal adenocarcinoma based on promoter DNA methylation sites. Here, we assessed the DNA methylation profile and corresponding clinical information of patients with gastrointestinal adenocarcinoma from publicly available databases and identified promoter methylation-based cancer subtypes, as well as hypo-and hyper-methylated site-based signatures for predicting OS of GAC and CAC patients.
In recent years, with the rapid development of gene arrays and sequencing technique, the idea of molecular typing has emerged quietly. Molecule-based cancer subtypes carry unique genomic characteristics, which provides accurate diagnosis and treatment of gastrointestinal cancers. Previously well-known molecular subtyping in GAC is TCGA subtypes 23 and Asian Cancer Research Group (ACRG) subtypes 24 , in which TCGA typing is composed of Epstein Barr Vims positive, MSI, genomically stable (GS) and chromosomal instability (CIN), while ACRG contains MSI, MSS/EMT, MSS/TP53 + and MSS/TP53-. The detailed clinicopathological and molecular characteristics of each subtype within the two molecular typing are shown in reference 25 Both molecular typing identified MSI characterized by high-frequency mutations and the best prognosis, but the other 3 subtypes in the two molecular typing were partially overlaping. For instence, GS and CIN of TCGA exist in all ACRG subtypes; GS of TCGA is not equal to MSS/EMT of ACRG based on the mutation frequency of CDH1 and RHOA; TCGA typing does not involve hypo-methylated sites, and the samples of ACRG typing are all Asian populations. Therefore, the existing molecular typing in GAC is not perfect, and the idea of individualized treatment of GAC based on molecular subtyping is just emerging and worthy of further exploration. In this study, six hypo-methylated clusters for GAC were generated with different OS profiles, the patients' OS and tumor progression become worse as the methylation level decreases, and MSI was inversely proportional to hypomethylation. Hypomethylation can promote the up-regulation of proto-oncogenes or tumor progression-related genes through various pathways, thereby changing the biological behavior of cancer [26][27][28] , which is consistent with our results. Currently, people are generally keen to study the phenomenon of hypermethylation in cancer. However, the effect of hypomethylation www.nature.com/scientificreports/ on tumor biological behavior and prognosis is rarely explored. Therefore, our results lay a strong foundation for the research of hypomethylation in GAC. In 2012, CAC was divided into three subtypes (namely TCGA typing: CIN, high mutation and ultra-high mutation) based on whole exome sequencing 29 . Afterwards, a variety of molecular typing was proposed based on gene mutation, copy number, non-coding RNA, and proteomics, etc. [30][31][32][33][34][35] . Different molecular typing was related to the clinicopathologic characteristics and prognosis of patients, but the pattern was not observed. Until 2015, the Cancer Subtyping Consortium (CRCSC) put forward a new classification method (consensus molecular subtype, CMS), classifying CAC into CMS1, CMS2, CMS3 and CMS4, with different clinicopathologic and molecular characteristics 36 . CMS typing is the most recognized molecular typing in the world, integrating the most data. In view of the multiple factors affecting tumorigenesis and malignant progression, further research is needed to improve the existing molecular typing. In this study, six hyper-methylated clusters were generated for CAC, with distinct OS profiles, clinicopathologic features and MSI level, revealing that promoter hypermethylation is closely associated with the malignant progression of CAC, which is consistent with the previous studies 37 . Therefore, our results can be used as an important supplement to CAC molecular typing.
To explore the molecular mechanism of independent prognostic methylation sites involved in tumorigenesis, genes containing methylation sites were extracted. Their expression was negatively correlated with their promoter methylation, consistent with the general view that promoter hypomethylation usually causes gene up-regulated, while hypermethylation silences gene expression [38][39][40] . Additionally, molecular functional analysis revealed that hypomethylation in GAC was closely related to substance metabolism, ferroptosis, Ras signaling pathway, while hypermethylation in CAC involved in p53 signaling pathway, cell cycle, ferroptosis, anion transport, cell senescence, catabolism, inflammation and apoptosis, etc. These results indicated that genes in the these pathways may be potential therapeutic or prognostic targets for GAC and CAC by regulating their activity. A scan of the published literature revealed that some results support our observations. For instance, approximately 40%-50% of sporadic colorectal cancer cases exhibit P53 mutations 7,41 , which play a decisive role in tumor biological behavior. The mutation of P53 is related to lymphatic invasion of proximal colon cancer, and to lymphatic and vascular invasion of distal colon cancer 42 . Additionally, the mutant showed stronger drug resistance and poorer prognosis than the wild type 43 . Although the roles of these enrichment pathways or genes in the pathways in GAC and CAC have not been fully confirmed, there is evidence that they are associated with GAC and CAC carcinogenesis.
Since the promoter methylation-based cancer subtypes show distinct OS profiles, suggesting that the classification based on promoter methylated sites can be used to predict patients' OS. Therefore, the hypo-and hypermethylated site-based signatures with high accuracy, high efficiency and strong independence were established, which can separately predict the OS of GAC and CAC patients. Two promoter methylation-based predictive signatures involving 791 samples were identified from two cohorts and validated though two independent cohorts though Lasso Cox regression, which can identify the combination of promoter methylation sites with the best predictive power. Moreover, two nomograms combining RS and prognosis-related clinicopathologic variables provide a visual method to predict the OS of patients, which is more accurate and effective than using signature alone, which can guide individualized treatment of clinical decision-making.
However, the study had limitations. Firstly, robust clinical and experimental research is necessary to gain more insight into the modulatory roles of promoter methylation on gene activity and the crucial effects of regulated genes in the corresponding pathways. Secondly, despite the high accuracy and predictive performance of the nomogram, other prognostic clinical parameters of the patients can not be obtained from databases, so the variables involved in the nomogram are limited, and further improvement is needed in the later stage. Thirdly, the sample size of rectal adenocarcinoma in the TCGA database is limited, so this research has not been downloaded and analyzed yet.

Conclusion
The OS profiles and tumor progression became worse as the methylation level decreased in GAC or increased in CAC, and hypomethylation in GAC and hypermethylation in CAC were negatively correlated with MSI. The hypo-and hyper-methylated site-based signatures with high accuracy, high efficiency and strong independence can separately predict the OS of GAC and CAC patients, and two nomograms combined RS and prognosisrelated clinicopathologic characteristics provide an intuitive and accurate method for predicting patients' OS. Our research indicated that methylation mechanisms differ between GAC and CAC, and provided novel clinical biomarkers for the diagnosis and treatment of GAC and CAC. Considering the limitations of our study, future experimental studies will facilitate the extension of these findings.

Materials and methods
Acquisition and processing of publicly available data from open public databases. The detailed process of this study is shown in Fig. 1 Table 3, were enrolled in the study after processing of the original data with Perl software. The abovementioned samples were matched with RNA sequencing (RNA-seq) data, which quantified the gene expression values. With a standard deviation (SD) threshold of greater than 0.2, Perl software was used to extract the matrix of methylation sites located within − 2000 bp ~ + 500 bp of transcription start sites (TSSs), which covered 26,574 promoter region loci (11,247

Consensus clustering analysis.
With a criterion of P < 0.001, 177 prognosis-related promoter methylation sites (74 for GAC and 103 for CAC) and 131 independent prognostic methylation sites (68 for GAC and 63 for CAC) were identified via univariate and multivariate Cox regression analysis, respectively. The co-methylation patterns of these independent prognostic methylation sites were explored by correlation analysis, and visualized through the 'corrplot' package. Based on the independent prognostic methylation sites, the R package 'ConsensusClusterPlus' , which provides quantitative and visual evidence of stability for measuring the number of unsupervised clusters in the dataset, was used for promoter methylation-based consensus clustering of GAC and CAC 44 . In addition, the K-means algorithm and cumulative distribution function (CDF) curve were applied to determine the best number of clusters. According to the CDF curve, the best and most stable number of clusters is the K value at which significant changes no longer occur. Moreover, 50 iterations (with 80% of the samples per iteration) with a variable of maxK = 9 were conducted for stable clusters. We plotted survival curves for each cluster according to promoter methylation-based clusters and compared site methylation levels in each subtype with heatmaps. To determine the methylation level between clusters, differential analysis of methylation sites in samples among the clusters was performed in R with a false discovery rate (FDR) threshold of < 0.05. The differential methylation sites were visualized with heatmaps (combining clinical parameters) and box plots with the packages 'ComplexHeatmap' and 'reshape2′, respectively. Additionally, we accessed the MSI of each sample from TCGA database, and compared the MSI level between clusters. The R package 'ggplot2′ was utilized to display the distribution of clinicopathologic features in clusters. www.nature.com/scientificreports/ Genes containing prognostic sites and molecular pathway enrichment analysis. Initially, genes containing independent prognostic methylation sites were identified by Perl, and their expression levels were visualized via unsupervised hierarchical clustering. To predict the potential functions of genes containing independent prognostic methylation sites, which were subjected to Gene Ontology (GO) biological process (BP) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis with R packages ('colorspace' , 'stringi' , 'ggplot2′, 'clusterProfiler' , 'org.Hs.eg.db' and 'enrichplot').

Identification of promoter methylation-based signatures.
To search for hypo-or hyper-methylated sites to construct prognostic signatures, the hypo-or hyper-methylated sites from TCGA database were included in Lasso Cox regression analysis to generate prognostic scoring signatures,which could divide patients into highrisk and low-risk groups based on the mean risk score (RS) value. The RS was calculated as the sum of the products of locus methylation levels and coefficients, via the following formula: where 'i' and 'k' represent the 'i'th methylation locus and the number of methylation sites, respectively. To verify the efficiency, accuracy and independence of the signatures, Kaplan-Meier analysis and receiver operating characteristic (ROC) curves were used to evaluate the accuracy and prediction efficiency of signature. Univariate and multivariate analysis were used to explore the prognostic value of risk scoring signatures.
Validation in separate cohorts. The DNA methylation profiles (Illumina Human Methylation 27 platform) of GAC and CAC were separately downloaded from the UCSC Xena platform (http://xena.ucsc.edu/). These two data sets were used as validation cohorts to verify the stability and mobility of methylation-based signatures.
Nomogram construction. According to univariate analysis, the prognosis-related variables were enrolled into the training cohort with the 'rms' package as a medium, so as to participate in the construction of the nomogram for predicting 3-year and 5-year OS of patients. Afterwards, the ROC curve and calibration curve were used to evaluate the predictive performance and accuracy of the nomogram.
Statistical analysis. Continuous variables were represented as mean ± standard deviation. χ 2 test and T-test/variance analysis were separately used to compare the difference distribution of dichotomous variables and continuous variables. Survival analyse was conducted using Kaplan-Meier statistics and Log-rank tests. All statistical analyses were carried out in R and Perl softwares, and P < 0.05 was considered to be statistically significant.
Scientific Reports | (2020) 10:21234 | https://doi.org/10.1038/s41598-020-78228-y www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.