Integrative analysis with expanded DNA methylation data reveals common key regulators and pathways in cancers

The integration of genomic and DNA methylation data has been demonstrated as a powerful strategy in understanding cancer mechanisms and identifying therapeutic targets. The TCGA consortium has mapped DNA methylation in thousands of cancer samples using Illumina Infinium Human Methylation 450 K BeadChip (Illumina 450 K array) that only covers about 1.5% of CpGs in the human genome. Therefore, increasing the coverage of the DNA methylome would significantly leverage the usage of the TCGA data. Here, we present a new model called EAGLING that can expand the Illumina 450 K array data 18 times to cover about 30% of the CpGs in the human genome. We applied it to analyze 13 cancers in TCGA. By integrating the expanded methylation, gene expression, and somatic mutation data, we identified the genes showing differential patterns in each of the 13 cancers. Many of the triple-evidenced genes identified in majority of the cancers are biomarkers or potential biomarkers. Pan-cancer analysis also revealed the pathways in which the triple-evidenced genes are enriched, which include well known ones as well as new ones, such as axonal guidance signaling pathway and pathways related to inflammatory processing or inflammation response. Triple-evidenced genes, particularly TNXB, RRM2, CELSR3, SLC16A3, FANCI, MMP9, MMP11, SIK1, and TRIM59 showed superior predictive power in both tumor diagnosis and prognosis. These results have demonstrated that the integrative analysis using the expanded methylation data is powerful in identifying critical genes/pathways that may serve as new therapeutic targets.


Introduction
The Cancer Genome Atlas (TCGA, https://tcga-data.nci.nih.gov/tcga/) has profiled the genomic and epigenomic variations of thousands of samples for several dozens of cancers 1 . These multiomics data include genetic variation, gene expression and DNA methylation that provide an invaluable resource for understanding the cancer mechanisms and identifying new therapeutic targets. A limitation of the TCGA DNA methylation data is that it was generated using Illumina Infinium Human Methylation 450K BeadChip (referred to as Illumina 450K array hereinafter), which only covers about 1.5% of the CpGs in the human genome. This poor coverage restricts epigenomic analysis and many differentially modified loci are likely missed. While Whole Genome Bisulfite Sequencing (WGBS) and other technologies are available to measure DNA methylation with much higher coverage, it is unlikely to repeat the DNA methylation analysis in the large number of TCGA samples considering the expense and effort in the near future.
Therefore, there is an urgent need to develop new analysis strategy to better use these data.
Previously, we developed a method to expand the Illumina 450K array data by considering sequence features and local methylation profile in the neighboring CpGs 2, 3 . Despite the promising results provided by these methods, their speed is slow and applying them to expand the thousands of TCGA data is infeasible. Here, we present an improved model called EAGLING (Expanding the 450K methylation Array with neiGhboring methylation value and Local methylation profilING) with a more than 10 times faster speed compared to our previous methods. Furthermore, the location distribution of the expanded CpG sites is less biased towards CpG rich regions, and the hyper/hypo-methylated ratio is also more similar to the ratio from the WGBS data. Importantly, the coverage of CpGs is significantly increased from about 1.5% of all CpGs in the human genome in the original Illumina 450K data to about 30% after expansion.
This new model allows integrative analysis of genetic variation, gene expression and expanded DNA methylation to identify genes and pathways that are important for diagnosis and therapeutic treatment. We identified the triple-evidenced genes in each of the 13 TCGA cancers that have sufficient samples. The triple-evidenced genes represent the genes that are differentially methylated, differentially expressed and associated with somatic mutation. We found that the triple-evidenced genes shared by a majority of the 13 cancers include many previously identified biomarkers or therapeutic targets [4][5][6][7]

Results
We propose here an integrative analysis strategy to identify key regulators and pathways in cancers from the TCGA data. By comparing gene expression, genetic variation and DNA methylation data between normal and cancer samples, we extracted the triple-evidenced genes for 13 cancers and analyzed the characteristics of these genes ( Figure 1A).

The EAGLING model expands the 450K array methylation data 18 times
In order to expand the Illumina 450K array DNA methylation data, we previously developed prediction models based on local methylation patterns and sequence features. In this work, we proposed a new model dubbed as EAGLING that has two steps to predict methylation levels of CpGs based on the Illumina 450K array data. First, it finds a WGBS methylome that shares the most similar local methylation profile around the CpG L under consideration, and the methylation value of the CpG in the selected WGBS methylome is taken as an input feature; second, the methylation value of the closest CpG from Illumina 450K array is taken as another input feature. A logistic regression model was built on these two features to predict the methylation level at L ( Figure 1B, see details in Materials and Methods). Note that this procedure is repeated for each CpG so that different CpGs may be predicted from different WGBS methylomes. Thirty-three tissues/cell lines in which both 450K array and WGBS data were available were used to optimize the parameters. There are three major improvements over our previous models 2, 3 . First, DNA sequence features are not included in EAGLING, which significantly improves the speed without deteriorating the performance; second, the methylation value of the closest CpG is used because of its higher precision compared to the weighted neighbor CpGs used in our previous models 2, 3 ; third, more training data are included (33 versus 14 tissues/cell lines), which is expected to improve the model.
We have searched for the optimal number of CpGs to represent the local methylation pattern in step 1, which is crucial to identify the WGBS methylome from which to take the methylation level for the CpG under consideration. We performed the leave-one-tissue-out cross validations using 1-10 neighbor CpGs and 5Kbp-50Kbp flanking regions (Figure 2A). The flanking regions confine the CpGs we considered. Only the CpGs with the required number of neighbor CpGs in the specified flanking regions were included for expansion because their local methylation profiles could be accurately represented. We chose 4 CpGs each on upstream and downstream sides to represent the local methylation profile as there was no performance improvement by including more than 4 flanking CpGs and 30Kbp for the flanking region size considering the balance between satisfactory performance and genome coverage ( Figure 2B). Using these parameters, the leave-one-tissue-out cross validations achieved superior performance ( Figure   2C). To show the impact of the training size on the model performance, we trained the EAGLING model using 14 (the training sample size for our previous model in reference 3) and 23 WGBS data sets (the 14 WGBS data plus another 9 randomly chosen WGBS data) separately.
We compared their prediction results on another 10 WGBS samples not included in the training set. We repeated this cross validation 10 times and the results are shown in Figure S1a Particularly, the intergenic coverage was significantly increased from 39.02% in 450K array to 50.94% in the expanded data and the non-CpG island coverage also increased from 69.03% to 79.98%, which is important to identify functional enhancers ( Figure 2D and 2G). The location distribution of the expanded data is much closer to that of all the CpGs in human genome ( Figure   2J) than the original 450K array. Furthermore, we identified the hyper-methylation (>0.7) and hypo-methylation (<0.3) CpGs from the original 450K array data and calculated their percentages among all the CpGs for the tumor and normal samples of the 13 cancers ( Figure 2D and 2E). Obviously, the ratio distributions of the expanded CpGs are much closer to those of the WGBS data ( Figure 2K and 2L), indicating that the analysis results based on the expanded methylation data would be less biased compared to the results from the 450K array data.

Identification of triple-evidenced genes
Using the expanded methylation data in the 13 cancers, we identified the differentially methylated genes (DMGs) between the tumor and normal samples (see Methods for detail). We also identified the differentially expressed genes (DEGs) using the RNA-seq data and genes containing somatic mutations (see details in Methods and Figure S2). As an example, the overlap between DMGs, DEGs and genes with somatic mutation of LUSC is shown in Figure 3A. In the 13 cancers, the number of triple-evidenced genes ranges from 396 in PRAD to 1438 in LUSC ( Figure S2).
Only a small portion of the triple-evidenced genes were found in more than 6 cancers ( Figure   3B). The top 5 triple-evidenced genes found most often in the 13 cancers are listed in Table S2.
They were CELSR3, TNXB, TRPM2, KCNAB1 and TRIP13 that were identified as tripleevidenced genes in 11, 11, 11,10 and 10 types of cancers, respectively. We first examined the difference of their methylation and expression levels between tumor and normal samples ( Figure   3C) (difference = normal value -tumor value). CELSR3, TRPM2 and TRIP13 are overexpressed in all the 13 cancers, TNXB is under-expressed in all the 13 cancers, KCNAB1 is over-expressed in KIRC but under-expressed in the other 12 cancers. These genes show abnormal but consistent expression patterns across different cancers. The methylation level does not show clear trend though, indicating that the relationship between gene expression and their promoter methylation is complex, which is consistent with the previous studies 8, 9 . Four of the 5 genes have been reported as biomarkers, potential biomarkers or therapeutic targets. CELSR3 was found to be highly expressed in ovarian cancer 4 , and hypermethylated in primary oral squamous cell carcinoma, and might be used as a biomarker in OSCC prognostication 10 and small intestinal neuroendocrine tumor 5 . TNXB was reported to be important for the tumorigenesis of lung adenocarcinoma 6 , and was validated as a promising biomarker for early metastasis of breast cancer 7 .TRPM2 was reported to be a potential target of the selective treatment of prostate cancer 11 and was suggested to be a potential therapeutic target in breast cancer 12 . TRIP13 promoted early steps of the DNA double-strand break repair and its presence was associated with progression in prostate cancer and squamous cell carcinoma of the head and neck 13,14 . For KCNAB1, there were few reports about its function in cancer, but it was downregulated in follicular carcinoma and could be combined with other genes for the classifier construction 15 .
As a comparison, we also identified the triple-evidenced genes using the Illumina 450K data ( Figure S3). There are two advantages using the expanded methylation data. First, the triple-evidenced genes can be identified in more cancers. For example, the CELSR3 gene was found as the triple-evidenced gene in two cancers using the original 450K array data but in 11 cancers using the expanded data; second, consistently, more triple-evidenced genes can be identified in a particular cancer by the expanded data than the original 450K array data. For example, 5 genes (FANCI, RECQL4, TACC3, CLU and SIK1) were reported to function in different cancers 10, 16-19 but they could not be identified using the Illumina 450K array data in any of the 13 cancers; in contrast, all of them were found as triple-evidenced genes in more than 6 cancers using the expanded data (Table S3).

Triple-evidenced genes are enriched in particular pathways
For each of the 13 cancers, we checked the enriched pathways among the triple-evidenced genes using Ingenuity Pathway Analysis (IPA) (with Benjamini-hochberg adjusted p-value < 0.05).  Table S4, including expression of 13 genes and methylation levels of 34 CpGs. As shown in Figure 4A, most of the test tumor samples could be correctly predicted as cancers while about 75% of the test normal samples were predicted as normal samples. Obviously, using triple-evidenced genes derived from the expanded methylation data outperformed using the original 450K array data (t-test), particularly the specificity. The AUC of the cancer diagnosis with the triple-evidenced genes is about 0.85, which indicates that there are common differential gene expression and methylation features of the triple-evidenced genes that distinguish tumors from the normal samples.
To validate whether the triple-evidenced genes could also perform well in other datasets. Firstly, we extracted gene expression data of normal and tumor tissues of liver, breast, uterus, bladder, esophageal and colon from GENT 30 , and tested the classification performances based on the triple-evidenced genes (Table 1); secondly, we extracted the expression data of 5 other cancers (STAD, READ, CHOL, GBM and PAAD) not included in the 13 cancers studied here from TCGA, and investigated the prediction performances based on the triple-evidenced genes ( Table   2). We used the random forest model constructed from all the tumor samples with 47 features THCA and PRAD were with the highest accuracies (99.32% and 99.16%), while the LUSC was with the lowest accuracy (87.41%). When looking into the misclassification results, KIRP is prone to be predicted as KIRC and vice versa, LUAD is prone to be predicted as LUSC and vice versa, which is reasonable because they belong to the same tumor category. Also we found that the majority of mis-classified HNSC were predicted as LUSC, and many of the mis-classified LUSC were predicted as HNSC. The interesting results were consistent with the previous reports that patients treated for head and neck squamous cell carcinoma frequently developed second primary tumors in the lung, and they shared many common patterns 31,32 .
Among the top 20 features (Table S5), 13 features are gene expression and 7 are CpG methylation values of the triple-evidenced genes. Some of these features have literature evidences to support their importance in discriminating cancers. For example, the gene expression of SUSD2 is the second most important feature, which is consistent with its reported variable expression in cancers, e.g. down-regulation in colon cancer 33 and hepatocellular carcinoma 34 , and highly expressed in breast cancer 35 . Another example is CYGB expression, the fifth most important feature. CYGB shows variable expression in cancers: it is down-regulated in many cancers 36 but up-regulated in lung and brain metastases, and head and neck cancer 37,38 .
The methylation level at a LAMA4 promoter CpG was found as the seventh most important feature; previously, the aberrant methylation at the LAMA4 promoter was observed in breast carcinoma 39 and low methylation was associated with poor progression-free survival 40 .

Prognostic value of the triple-evidenced genes
We also investigated whether the triple-evidenced genes are useful in predicting survival rate.
For the survival data of each cancer (11 cancers with sufficient samples were analyzed, see details in Materials and Methods), we applied the LASSO cox proportional hazards regression for feature selection. The candidate features include gene expression and expanded methylation data (the methylation level of all the CpGs in the promoters) of triple-evidenced genes identified in each cancer. The performance was assessed using 10-fold cross validation for 100 times. We compared the concordances (C-indexes) based on four different candidate features (expanded methylation and expression levels, only expanded methylation, only expression level, and the original 450K methylation and expression levels of the triple-evidenced genes). As an example, the boxplots of the concordances of COAD in cross validation are shown in Figure 4B. The concordance based on the features selected from using both expression and expanded methylation data is superior to using either data alone, or the combination of the original 450K methylation and expression levels: the p-values are 0.03 (compared with expanded methylation data), 1.2e-11 (compared with expression data) and 9.1e-6 (compared with the combination of the original 450K methylation and expression levels).
Furthermore, we focused on the gene features frequently selected among the cross validations to cluster the tumor samples. For example, using the 19 features selected in more than 20% of the cross validations in the COAD samples (as shown in Figure 4C), hierarchical clustering identified two obvious subgroups, which shows significantly different survival times in the Kaplan-Meier survival plot in Figure 4D (Table S6 and see discussion below).

Triple-evidenced genes important for both diagnosis and prognosis
Nine genes are most often selected in both diagnosis and prognosis analyses: TNXB, RRM2, CELSR3, SLC16A3, FANCI, MMP9, MMP11, SIK1, TRIM59. Their differential expression and methylation levels between normal and cancer samples as well as the somatic mutations in the 13 cancers are shown in Figure 5. The expression and somatic mutation patterns of the 9 genes are quite consistent in the 13 cancers but the methylation patterns of their promoters vary.
These 9 genes are currently considered as biomarkers or potential biomarkers for diagnosis or prognosis in specific cancers. However, our analyses suggested that they are likely general biomarkers for at least the 13 cancers analyzed here.
TNXB was reported as a potential marker for prognosis in patients with stage III serous ovarian cancer 41 . RRM2 was reported as independent negative prognostic marker for survival in patients with resected pancreas cancer 42 and a promising prognostic biomarker and therapeutic target for ER-negative breast cancer patients 43 . CELSR3 was suggested as a biomarker in OSCC prognostication 10 , and prognostic marker in small intestinal neuroendocrine tumor 5 . MMP11 and MMP9 were reported as breast tumor biomarkers and associated with tumor survival 28,29 . For SLC16A3, there is no report on its prognostic power but studies showed that it might be an epigenetic marker for clinical outcome in clear cell renal cell carcinoma 44 .
It is worth of noting that FANCI and SIK1 genes could not be identified as the triple-evidenced genes using the original 450K array data. FANCI belongs to Fanconi anemia complementation group and it was a negative regulator of Akt activation that connects with the oncogenic PI3K-Akt pathway and the tumor suppressing FA pathway 45 . This gene has also been linked to drug resistance in cancer treatment 46 . SIK1 is stimulated by a cancer suppressor LKB1, which leads to metastatic spread and invasiveness, as well as apoptosis resistance 47 . Loss of SIK1 has been found in epithelial ovarian cancer and pancreatic cancer 48 , and decreased SIK1 expression is correlated with poor outcome of breast cancer treatment 49 , indicating the potential application in prognosis. Our results further support the potential of using FACNI and SIK1 as prognosis marker and provide insight in broadening its application in other cancer types.
Among the 9 genes, RRM2, MMP9, MMP11 and SIK1 are known drug targets. For example, they are inhibited by gemcitabine (RRM2), marimastat (MMP9 and MMP11) for treating several cancers. Our analyses suggested these inhibitors may be effective for the majority of the 13 cancers, which suggests possible broader applications of these inhibitors. For example, gemcitabine targeting RRM2 are validated for the treatment of non small cell lung cancer 50 , ovarian cancer 51 , pancreatic cancer 52 , adrenocortical cancer 53 , and oral squamous cell carcinoma 54 . We speculate that gemcitabine can be used to treat bladder, colon, kidney, liver and prostate cancers. Furthermore, HG-9-91-01(SIK1) was reported to induce anti-inflammatory phenotype and could be used to treat certain autoimmune diseases 55, 56 , we speculate that it can be repurposed to treat cancers as 4 of the top 5 enriched pathways in the pan-cancers analysis are closely related to inflammatory processing or inflammation response.

Discussion
We present here a method EAGLING to significantly expand the Illumina 450K array data with a fast speed and better precision than the previous models. We have performed pan-cancer analysis on 13 TCGA cancers to identify genes with differential methylation and gene expression between cancer and normal samples as well as containing somatic mutations. These tripleevidenced genes, particularly TNXB, RRM2, CELSR3, SLC16A3, FANCI, MMP9, MMP11, SIK1 and TRIM59 show diagnostic and prognostic power. Note that FANCI and SIK1 could only be identified as triple-evidenced genes using the expanded methylation data. The pathways in which they are enriched also suggest new therapeutic targets or repurposing the existing drugs.
We focused on discussing the common features among the 13 cancers but it is worth of noting that the triple-evidenced genes in individual cancers can also be potential biomarkers or drug targets.

Conclusion
We showed that the expanded methylation data allowed identification of more cancer-related genes, which led to better performance in both diagnosis and prognosis. The common patterns shared among the 13 cancers suggest that some drugs (such as gemcitabine) currently aiming to specific cancers might be useful to treat other cancers, and drugs aiming to immune diseases (such as HG-9-91-01) might be repurposed for cancer therapy.

DNA methylation data for model construction
In total, 33 tissues or cell lines with both WGBS and 450K array data were retrieved from the NIH Roadmap Epigenomics project 57 and TCGA (Table S1). We downloaded the methylation proportion values of WGBS data and beta values of 450K array from the GEO Database and TCGA data portal directly. Both WGBS data and 450K array data were quantile normalized. We used the quantile normalization between the 450K arrays, and between the WGBS data, to reduce the batch effect, as the quantile normalization strategy was reported to be efficient for the intra-and inter-arrays normalization 58,59 .
In the EAGLING model, the CpG site to be predicted is denoted as L. The WGBS methylation value at L in the tissue that shows the most similar local methylation profile among the 33 tissues For performance validation, the 450K array and WGBS data of K562 and HepG2, two independent cancer cell lines from ENCODE project were retrieved from GEO database. The expanded methylation levels from EAGLING model were compared with the real WGBS data for performance validation.

Expanding the DNA methylation data in the TCGA samples
We downloaded the 450K array data from TCGA. We only included 13 cancers with at least 10 normal samples of 450K array and RNA-seq data for the integrative analysis (Table S7) In calculating the ratio of hyper/hypo-methylated CpGs of the tumor and normal samples, the methylation value(ranging from 0(totally unmethylated) to 1(totally methylated)) larger than 0.7 was defined as hyper-methylation, and the value less than 0.3 was defined as hypo-methylation.
For comparison, the ratios of hyper/hypo-methylated CpGs of WGBS data of lung cancers were calculated.

Identification of triple-evidenced genes in cancers
We compared the methylation levels of each CpG between tumor samples and the corresponding normal sample data, and defined a CpG site to be differentially methylated (DML) if the q-value of t-test <0.05 and the absolute difference of methylation value > 0.1. We considered genes whose promoters contain any CpG covered by the original or the expanded methylation data. To identify the genes differentially methylated in each cancer, the methylation status of all the CpG sites covered in promoters were considered. For each promoter, the Fisher's combined test was used to get the q-value to evaluate whether a gene is differentially methylated. Similar to call DMLs, genes with q-value < 0.05 and mean difference of DNA methylation > 0.1 were selected as differentially methylated genes (DMGs).
The RNA-seq data were downloaded for the cancers listed in Table S7 from TCGA. The sample sizes are also shown in Table S7. The gene expression data were log2-transformed and normalized. Differentially expressed genes (DEGs) were defined if the fold change > 2 and the q-value of t-test is < 0.05.
To collect genes with mutation related to the 13 cancers, we downloaded the somatic mutation level 2 data from TCGA. For each cancer, a gene that was annotated with curated somatic mutation in TCGA was considered. We extracted the gene lists with somatic mutation of all the 13 cancers for integrative analysis.

Diagnostic and prognostic analysis
To search for common features in pan-cancer analysis, all the 13 cancer samples and normal samples were combined together, respectively. A random forest model was constructed using cross validation. In each of the cross validation, the cancer and normal samples of 10 cancers were randomly selected for model training, and the remaining samples of 3 cancers were used for test. The cross validation was repeated for 100 times. We chose the triple-evidenced genes that appear in more than half of the 10 training cancer types as the candidate genes. Their associated gene expression and DNA methylation levels of individual CpGs in the promoters of the selected genes in the expanded data were used as input features; Both of the gene expression and the methylation levels of CpGs in the promoters were candidate features for feature selection with LASSO.
To indicate whether these potential common features also reflect some differences between the cancers, a multi-class regression model was constructed with the tumor samples of the 13 cancers. The candidate features were the combination of all the features selected in any of the 100 cross validations in the above diagnosis analysis. The gene expression and the methylation levels of promoter CpGs were further selected using LASSO.
In the prognostic analysis, the PRAD and THCA were not analyzed due to their limited samples with the expression, methylation and survival data. The sample sizes of the 11 tumors are listed in Table S8.