Integrative modeling of tumor DNA methylation identifies a role for metabolism

DNA methylation varies across genomic regions, tissues and individuals in a population. Altered DNA methylation is common in cancer and often considered an early event in tumorigenesis. However, the sources of heterogeneity of DNA methylation among tumors remain poorly defined. Here, we capitalize on the availability of multi-platform data on thousands of molecularly-and clinically-annotated human tumors to build integrative models that identify the determinants of DNA methylation. We quantify the relative contribution of clinical and molecular factors in explaining within-cancer (inter-individual) variability in DNA methylation. We show that the levels of a set of metabolic genes involved in the methionine cycle that are constituents of one-carbon metabolism are predictive of several features of DNA methylation status in tumors including the methylation of genes that are known to drive oncogenesis. Finally, we demonstrate that patients whose DNA methylation status can be predicted from the genes in one-carbon metabolism exhibited improved survival over cases where this regulation is disrupted. To our knowledge, this study is the first comprehensive analysis of the determinants of methylation and demonstrates the surprisingly large contribution of metabolism in explaining epigenetic variation among individual tumors of the same cancer type. Together, our results illustrate links between tumor metabolism and epigenetics and outline future clinical implications.


Introduction
DNA methylation is a major epigenetic mechanism that determines cellular outcome by regulating gene expression and chromatin organization 1 in a fashion more dynamic than previously appreciated 2 . Altered DNA methylation is frequently observed in cancers compared to corresponding normal cells [3][4][5][6][7] . For example, global DNA hypomethylation 8 and tumor suppressor silencing by DNA hypermethylation are two of the most well characterized cancer associated alterations common across many human malignancies [9][10][11] .
In addition to hypo and hyper methylation, cancer cells exhibit increased variability in DNA methylation across large portions of the genome compared to their corresponding normal tissues 12 . A previous study showed that, for several cancer types, variation in methylation levels among tumor samples is significantly higher than normal samples of the same tissue of origin 4 , possibly indicating that deregulated epigenetics provides tumor cells with potential proliferative advantages 6 . While inter-tissue variability in DNA methylation is mainly explained by differentiation and tissue-specific regulatory mechanisms 13,14 , very little is known about the functions and determinants of the high inter-individual variation among tumors of the same tissue type. Notably, a recent twin study on the determinants of inter-individual variability in DNA methylation reported that genetic difference among individuals account for only 20% of total variance with the remaining variance explained by environmental and stochastic factors that are yet to be identified 15 .
The source of the methyl group for methylation is S-adenosylmethionine (SAM) which is generated from the methionine (met) cycle and is coupled to serine, glycine, one -carbon (SGOC) metabolism 16 . A large body of evidence indicates numerous roles for one-carbon metabolism in proliferation and survival of tumor cells through its roles in biosynthesis and redox metabolism [16][17][18][19] . The met cycle also mediates histone and DNA methylation in physiological conditions and provides a link between intermediary metabolism and epigenetics [20][21][22] . Although the network contributes methyl units to DNA, whether and to what extent this interaction is apparent in tumors and may contribute to cancer biology is unknown.
We set out to comprehensively quantify the contribution of various factors in explaining variation in DNA methylation. The advent of standardized genomics and other high-dimensional multi-platform 'omics' data through The Cancer Genome Atlas (TCGA) allows for systematic assessments of molecular features across cancers 23 . With combined statistical analysis, computational modeling, and machine-learning approaches, we directly evaluated the quantitative contributions of molecular and clinical variables that lead to DNA methylation. We found a surprisingly large contribution for the expression of the methionine cycle and related SGOC network genes in determining DNA methylation and identified numerous contexts where this interaction may contribute to cancer pathology.

Integrative modeling allows for quantitation of determinants of variation in DNA methylation
It has been previously proposed that factors normally regulating the epigenome are disrupted in cancer, leading to increased variability of the cancer epigenome 6 . However, the nature and contributions of such factors is largely unknown. Upon analysis of global and local DNA methylation in tumors as measured by the Illumina Infinium HumanMethylation450K BeadChip arrays, we indeed found higher variation among tumors from the same tissue vs. between different tissue types (Supplementary Information; Supplementary Fig. 1a-d; Online Methods). Arrays were used over bisulfite sequencing because of the higher availability of these data in a standardized format allowing for an integrative analysis. To establish quantitative relationships between DNA methylation and molecular and clinical features of tumors, we developed an integrative statistical modeling and machine-learning approach with the goal of identifying the relative contributions to within-cancer DNA methylation variation (Online Methods). We incorporated hundreds of variables into comprehensive statistical models of DNA methylation (Fig. 1a). Factors with a known role in DNA methylation machinery (chromatin remodeling enzymes and transcription factors), as well as factors with a potential biochemical link to DNA methylation (SAM-metabolizing enzymes, met cycle enzymes, and other serine, glycine, one-carbon (other SGOC) enzymes that are connected to the met cycle 24 ) were together considered (Fig. 1a). We also curated available clinical information such as age, gender, and cancer stage in the calculations where appropriate. Furthermore, since mutations are known to affect the cancer methylome 25 , we included all recurrent genetic lesions (somatic mutations and copy number alterations) for each cancer type in our models. Together, over 200 variables were collectively analyzed for each cancer type (see Supplementary Table 1). Our models are therefore not completely agnostic as we pre-select classes of biological variables that are known to affect DNA methylation to avoid loss of statistical power by including too many features (e.g. expression of all genes in the genome). Therefore, to test for potential bias, we also considered the expression levels of sets of random genes with functions non-related to DNA methylation as additional variables in our models (see Online Methods).
Subsequently, we incorporated all variables into unbiased selection algorithms suitable for dealing with large numbers of prediction variables. For this task, we considered two independent approaches: a generalized linear model (Elastic Net) 26, 27 and a machinelearning algorithm (Random Forest) 28, 29 . A distinct computation was carried out for each 10kilobase (kb) genomic region with variable methylation (sd > 0.2) in each cancer type.
Samples of each cancer type were divided into three independent test subsets and three training subsets and separate models were generated using each subset. The models were then combined resulting in a single final model for each 10kb region of DNA methylation in each cancer. Model performance was evaluated by measuring mean squared prediction error of test samples from Elastic Net and Random Forest separately (Online Methods).
We observed that our models predicted test set DNA methylation with small mean squared error (MSE <0.04) in many regions across the genome (Fig. 1b). Comparison of the performances of the two methods showed that Random Forest and Elastic Net algorithms were able to predict DNA methylation with comparable MSEs on average ( Fig. 1c; Supplementary Fig. 2a). In general, predictability of local DNA methylation was largely dependent on cancer type as well as chromatin region in each model. For example, we observed that local DNA methylation was most predictable in prostate and lung cancers and least predictable in liver and bladder cancers ( Fig. 1c; Supplementary   Fig. 2a). Together with the high variation in local DNA methylation levels seen in liver and bladder cancers ( Supplementary Fig. 1d), these results suggest a higher stochasticity in the epigenetic signatures for these two cancer types compared to others in this study.
Upon annotating genomic regions where local DNA methylation could be predicted with a low error (MSE < 0.04) in each cancer, we found that the majority of the predictable regions lie within 20kb of the transcriptional start site (TSS) of a gene ( Supplementary   Fig. 2b), suggesting that regulation of DNA methylation by the factors included in our models is stronger at genic regions.
We next performed a set of tests to evaluate the robustness of our modeling approach. To this end, we compared the original gene expression variables included in our models, with a group of variance-matched randomly selected genes from the genome (see Online Methods) in their ability to predict DNA methylation. In the presence of both groups of gene expression variables (original and random), both Elastic Net and Random Forest models selected our original variables significantly more frequently than random genes (Mann-Whitney p-value= 0.0007 for Elastic Net and <0.0001 for Random Forest) ( Fig. 1d; see Online Methods). These results validate our models and confirm that the Elastic Net and Random Forest algorithms are suitable for quantitation of variable contributions in determining DNA methylation.

Metabolism is a major predictor of DNA methylation in human cancers
Using the results of the integrative modeling, we next quantified the relative contribution of different functional classes of variables in explaining DNA methylation variation within each cancer type. For this, we measured two independent metrics, one using the Random Forest variable importance scores, and the other using a binary score for whether or not a variable was selected by the Elastic Net models (non-zero co-efficient). For each variable, an overall importance score was calculated by averaging its relative importance across all models of 10kb DNA methylations, and an overall usage score was calculated by measuring the fraction of 10kb regions in which Elastic Net models selected the variable (Online Methods). To estimate the contribution of each functional class of variables in explaining total variation in DNA methylation, we pooled all variables in the same functional category and averaged across their importance and usage scores separately ( Supplementary Fig. 2c,d).
Results from both Random Forest and Elastic Net algorithms identified a considerable contribution from the variables within the SGOC metabolic network relative to other classes of variables ("Other SGOC enzymes" was the 2 nd highest scoring among all classes, closely following "Transcription factors" according to both methods. "Methionine cycle enzymes" was the 3 rd and 4 th according to Random Forest and Elastic Net, respectively) (Fig. 2a). Previous studies have shown that transcription factor abundance and occupancy strongly mediate dynamic DNA methylation turnover in regulatory regions 30,31 . Consistent with this observation, our results confirm the "Transcription factors" class has the highest contribution to predicting DNA methylation levels across human tumors. Notably, even in the presence of most if not all known variables that are thought to mediate the status of DNA methylation, metabolic factors still uniquely explained a large part of the variability in methylation (Fig. 2a).
Given the contribution of the methionine cycle and its biochemical link to DNA methylation, we further explored the variables within the met cycle class compared to all other variables in their ability to predict DNA methylation (Fig. 2b). Within the met cycle class, methionine adenosyltransferase 2 beta (MAT2B) and betaine-homocysteine S-methyltransferase 2 (BHMT2) exhibited higher predictive values than methionine synthase (MTR) and adenosylhomocysteinase (AHCY) on average ( Supplementary Fig.   2e,f). Notably, in the presence of the nearly 200 other variables in the computations, the met cycle -especially MAT2B-still contributed substantially to DNA methylation prediction (MAT2B was ranked among the top 5% of highly selected variables in prostate, breast, liver, lung, and brain cancers) (Fig. 2c). We observed that the levels of MAT2B contribute to DNA methylation in nearly half of the variable regions across the genome even after accounting for various factors related to DNA methylation (MAT2B was selected by 42% of all Elastic Net models with MSE <0.04 on average) ( Supplementary Fig. 2e). Together, our results confirm that metabolism contributes to DNA methylation in many cases of human cancer and the association between metabolism and DNA methylation is stronger in some genomic regions than others.

Functional annotation of metabolically regulated regions of DNA methylation
Results of the integrative modeling across cancers indicate that defined regulation of DNA methylation happens in regions where gene expression may be affected, thereby suggesting that this regulation could drive essential cancer biology. We next set out to characterize all regions across the genome where the association between DNA methylation and the met cycle activity is particularly strong. To identify such regions, we designed a scanning algorithm to locate genomic regions spanning multiple CpGs with significant peaks of correlation of methylation with expression of met cycle enzymes ( Fig. 3a; Online Methods). We performed this analysis on each of the eight cancer types separately and identified distinct peak sets across the genome (Supplementary Table 2).
To assess potential bias toward highly methylated regions and regions where there is higher probe density, we analyzed the relationship between average absolute methylation of individual CpGs and their correlation with met cycle expression, and found no significant association (p-value of correlation=0.62), confirming that the identified peaks are distinct from highly methylated regions ( Fig. 3b; Online methods).
Density plots of peak distributions relative to the TSS of the nearest gene were concentrated around the TSS in all cancers ( Supplementary Fig. 3a), as expected given the higher density of probes in gene regulatory regions in the Illumina arrays ( Supplementary Fig. 3b). However, by further visualizing the distribution of the peaks immediately surrounding the TSS, we observed that peak distributions are more diffuse around the TSS (Supplementary Fig. 3c) compared to the probe density distribution control ( Supplementary Fig. 3d). This suggests potential enrichment in areas of the genome overlapping with gene body regions and CpG island shores where dysregulated DNA methylation has previously been observed in human cancers 6 . The peak distribution density plots extended up to a few hundred kilobases in distance from the nearest TSS, suggesting that DNA methylation at inter-genic parts of the genome may also be affected by the activity of met cycle.
We next tested the met cycle specificity of the identified peaks by correlating them with expression of randomly selected genes in the genome (Online Methods; Supplementary Fig. 4a). For the majority (>83%) of the identified peaks, the met cycle's correlation with DNA methylation was significantly non-random (p-value<0.05) ( Supplementary Fig. 4b). These results show that our approach was able to identify genomic regions where DNA methylation levels are specifically affected by the met cycle activity.
We next set out to identify genes that overlap with the identified peaks in each cancer type. Functional annotation of genes overlapping these peaks by means of pathway enrichment analyses across a comprehensive collection of more than 70 gene-set libraries 32 showed enrichment of epigenetic features in these regions consistently across all cancers. Strikingly, many of our peaks overlapped with peaks of histone-3 lysine-27 tri-methylation (H3K27me3) ( In addition to histone marks, tissue-specific and cell identity gene sets were also enriched in relevant cancer types, including "breast and ovarian cancer genes" in breast cancer ( Supplementary Fig. 5a); "abnormal nervous system" and "abnormal neuron morphology" in brain cancer (Fig. 3d); "asthma" and "lung carcinoma" gene sets in lung cancer (Fig. 3f); "kidney-specific" gene set in kidney cancer ( Supplementary Fig. 5b); and "large intestinal genes", "inflammatory bowl disease", and "colorectal carcinoma" gene sets in colon cancer (Fig. 3e). Finally, a number of developmental and signaling pathways were among the enriched pathways including "TGF-beta signaling" in kidney ( Supplementary Fig. 5b), "cell communication" pathway in liver (Fig. 3c), and "Gprotein coupled signaling" in bladder cancer ( Supplementary Fig. 5c). Organ and embryonic morphogenesis pathways were enriched in breast ( Supplementary Fig. 5a), bladder ( Supplementary Fig. 5c), and prostate ( Supplementary Fig. 5d), all of which are hormonally driven cancers. Interestingly, a previous study in breast cancer showed that embryonic developmental genes are enriched in regions of DNA hypomethylation compared to normal breast 36 . Together, these results illustrate the functional importance of the relationship between met cycle and DNA methylation across cancers.

Contribution of metabolism to DNA methylation at cancer gene loci
So far, we have shown that there is a surprising enrichment of peak regions of metabolically regulated DNA methylation at loci that link to essential aspects of cell identity and chromatin structure. We next questioned whether cancer-specific loci may also exhibit this interaction. We chose 19 well-characterized cancer-related genes such as TP53, PTEN, and ESR1, as well as 4 genes frequently differentially methylated in cancer APC, RASSF1, GSTP1, and MGMT (Online Methods). A recent study showed that DNA methylation for any given gene has two major principal components: one representing the promoter region and the other representing the coding sequence 37 . Furthermore, CpG methylation at promoter regions of genes is typically associated with repression, while gene body methylation is thought to increase expression 38 . We therefore applied our integrative modeling to DNA methylation at promoter and gene body regions of each cancer gene separately. In addition to the integrative approach, we also generated models using only the met cycle genes as prediction variables to quantify the predictive ability of met cycle in the absence of other factors. Thus, each cancer gene locus was analyzed once using the integrative approach and once using met cycle alone and 3-fold cross validation was performed in each case as previously described (Online Methods). We independently evaluated these findings by applying the same models to both permuted cancer gene methylation values and also randomly generated methylation values ( Supplementary Fig. 7a). In all tests, met cycle contribution was significantly (p-value< 10e-16) higher when applied to cancer gene methylation vs. permutations or random numbers (Online Methods; Supplementary Fig. 7b), confirming the specificity of signals contained in the true DNA methylation values at cancer loci. Furthermore, we tested the performance of the machine-learning algorithm using randomly generated variables for prediction of cancer gene methylation (Online Methods) and found in each We also simulated a dataset where prediction variables and the response are related via linear relationships and compared the accuracy of predictions in this simulated linear dataset with our original dataset (Online Methods). We saw in all cases that the improvement in MSE from our dataset (MSE 1.4-2 fold smaller than random MSE) is even more than what we observed with data of the same dimension that have a linear relationship (MSE 1.3 fold smaller than random MSE) ( Supplementary Fig. 8e,f). These independent tests confirm that machine-learning using the Random Forest algorithm is able to identity non-random signals in the data, and also that it can detect non-linear relationships between prediction variables and the response.
Next, we ranked all of the variables based on their overall usage according to the integrative models of cancer gene promoter and body methylations (Fig. 4e). Notably, many SGOC (including met cycle) enzymes were among the most frequently selected variables in all cancers ( Previous work has shown that expression of enzymes across different regions of the SGOC network is predictive of metabolic flux through the network 24 . Notably, we observed that several SGOC genes are consistently among the highly ranked variables by both Random Forest and Elastic Net models in multiple cancer types. Therefore, to understand which features of SGOC metabolism contribute to the interaction with methylation, we defined a sub-network that was commonly highly ranked by the models in multiple cancer types ( Fig. 4h; Online Methods). This SGOC sub-network comprises the MAT enzymes in the met cycle (MAT2B and MAT2A), as well as enzymes within serine-glycine metabolism such as phosphoglycerate dehydrogenase (PHGDH) and glycine amidinotransferase (GATM) (Fig. 4h). We generally observed negative associations between DNA methylation and expression of PHGDH and GATM, but positive associations with expression of MAT enzymes. A cautionary note however is that in many disease states, levels of particular metabolites in the methionine cycle substantially deviate from physiological ranges, thus activating compensatory mechanism and leading to correlation with DNA methylation in directions opposite of what would be expected from the biochemistry of the reactions 49 . Therefore, when interpreting the direction of correlations between metabolic enzyme levels and DNA methylation, it is important to note that they not only depend on the stoichiometry of the corresponding enzymatic reactions, but also on endogenous abundance of the related metabolites.
Together, our results suggest that a particular flux configuration through the SGOC metabolic network-which previous studies have shown to be predictable from gene expression patterns 24 -may be important for regulation of DNA methylation.

Cancer pathogenesis of metabolically regulated DNA methylation
Involvement of the met cycle in promoter and gene body methylation at cancer genes suggests a potential implication for this metabolic pathway in explaining part of the variability in cancer pathogenesis and patient outcome. To further assess this relationship, we divided patients in each cancer type into two groups based on overall predictability of their cancer loci methylation by the met cycle (see Online Methods). We then compared survival rates between the two groups ("predictable" by met cycle vs. "not predictable" by met cycle) in each cancer type using the Kaplan-Meier estimator 50 (Fig. 5a-h). An improved overall survival for the "predictable" group was observed, although the magnitude of this trend varied depending on cancer type with brain, kidney, liver, and colon cancers showing statistically significant differences (log-rank test p-values: brain= 3.92e-05, liver=0.0048, kidney=0.0085, colon=0.04) (Fig. 5a-d). The difference in survival between the predictable and non-predictable groups was not significant in the rest of the cancers studied here (Fig. 5e-h), possibly explained by limited power due to data censoring at later time points. The overall patterns however suggest that the regulation of DNA methylation by the met cycle may be important in maintaining a normal epigenome, and disruption of this relationship in specific subtypes of tumors can lead to high epigenetic stochasticity in those tumors that correspond to poor clinical outcomes. This is consistent with a previous study that showed DNA methylation stochasticity increased across samples with increasing malignancy (from normal to adenoma to carcinoma) 6 .
To validate the results of our survival analyses, we applied multivariate cox regression models to account for covariates such as mutations and clinical factors that are know to be associated with survival rates (Online Methods). We performed this test in the cases of brain, liver, and kidney cancers were the univariate analyses found highly significant differences between the predictable and non-predictable groups (Fig. 5a-c).
The models including covariates still showed a significant difference (p<0.05) between the predictable and non-predictable groups of patients even after taking mutational and clinical factors into account (see Online Methods for the list of covariates considered in each cancer), suggesting that a unique part of variation in survival may be explained by epigenetic regulation. We next tried to further validate our results through comparison with independent analyses of the TCGA data by the cBioPortal for Cancer Genomics (cBioPortal) 51 and Prediction of Clinical Outcome from Genomic profiles (PRECOG) 52 .
These analyses found lower survival in prostate cancer patients harboring tumors with deep deletions in the met cycle genes ( Supplementary Fig. 10a), and higher survival in kidney cancer patients where the met cycle enzymes are over-expressed ( Supplementary   Fig. 10b), respectively. These results confirm a relationship between met cycle and survival in the same direction as predicted by our hypothesis.

Discussion
In this study, we conducted a pan-cancer TCGA analysis of the molecular and clinical contributions to within-cancer (inter-individual) variation in DNA methylation. Through several lines of integrated analysis, we found the overall expression of both the methionine cycle and SGOC network to be strong predictors of multiple aspects of DNA methylation and consistently ranked as one of the highest contributing factors to cancerassociated DNA methylation such as methylation of numerous cancer genes. Within the methionine cycle, we consistently observed a more significant contribution from MAT2B and BHMT2, suggesting that the regulation may be occurring at these enzymatic steps.
MAT2B is the enzyme that converts methionine to SAM, therefore it is expected that this enzyme affects SAM levels more directly than other metabolic enzymes. The significance of BHMT2 but not MTR suggests that metabolism of choline and betaine may be more prevalent than folates in cases where one-carbon metabolism fuels DNA methylation.
We introduced a novel approach to identify chromatin regions with strong correlations between DNA methylation and metabolic enzyme levels. The identified regions for the met cycle enzymes significantly overlapped with histone modifications, consistent with enzymatic cross-talk between the two epigenetic processes 35

Data curation
Publically available genome-wide mRNA expression and DNA methylation data were

Assessment of batch effects
We used the TCGA Batch Effects online tool (http://bioinformatics.mdanderson.org/tcgabatcheffects) to check for the existence of batch effects in the data used in our study. For each cancer types in our study, both the DNA methylation and the RNA-seq batch effects were negligible (Dispersion Separability Criterion (DSC) score < 0.5 for all sample batches included in the study).

DNA methylation
The  Fig. 1a). Sex chromosomes were also excluded from all subsequent analyses of DNA methylation. In order to assess local DNA methylation, we divided the genome into 10kb intervals and calculated the average beta value across all probes within each bin. We then filtered regions where variation in methylation was modest (standard deviation < 0.2 across each dataset). The average beta-value across all remaining 10kb regions was then calculated for each sample individually and plotted in Supplementary Fig. 1c. In order to study DNA methylation at cancer loci, probes that mapped to each gene according to Illumina annotations were identified. Promoter DNA methylation was then defined as the average beta value across all probes mapping to a given gene and falling within one of the following positional categories based on Illumina chip annotation information: "TSS1500", "TSS200", or "5'UTR". Gene body methylation for each gene was defined as the average beta value across all probes mapping to a given gene and falling in "1 st exon", "Body", or "3'UTR" based on the annotation. Promoter and gene body methylation were separately modeled for each of the cancer genes in the study (Fig. 4).

Gene expression
Log-transformed gene normalized RSEM values were used as expression levels and lowexpression genes in each dataset were defined as having less than 70% of the samples with a count value larger than 3. Such genes were removed from further analysis.

Gene expression variables included in the integrative models
In addition to the major enzymes in the met cycle (MAT2B, MTR, BHMT2, AHCY), four classes of expression variables with potential links to DNA methylation were also included in the integrative models (see Supplementary Table 1 for the complete list of variables). The four classes are described in the following: "Other SGOC enzymes": Serine, glycine, one-carbon (SGOC) metabolic genes from our previous network reconstruction were included 24 . In order to separately assess the effect of the met cycle from the rest of the network, we excluded the met cycle enzymes from this class and treated them as a separate class ("Methionine cycle enzymes").
"Chromatin Remodelers": A list of human chromatin remodelers and DNA methylation machinery was constructed by combining the Gene Ontology (GO) chromatin modifiers list, GO chromatin remodelers list 58 , and methylated DNA binding proteins and demethylases 59 .
"Transcription factors": For each cancer type, transcription factors important in the pathogenesis or subtype specification based on previous literature were included 60 .
"SAM-metabolizing enzymes": DNA methyltransferases and other SAM-consuming enzymes (except for MAT enzymes already included in the class "Methionine cycle enzymes") according to Human Cyc 61 were included in this class.

Mutations included in the integrative models
For each cancer type, genes with frequent somatic mutations (minimum frequency of 5%) among the TCGA cohort according to the cBioPortal 51

Predictive modeling and variable ranking using the Random Forest algorithm
The Random Forest is a machine-learning algorithm that generates predictions by averaging over a collection of randomized decision trees. Since successive trees are built with bootstrap samples, the algorithm is robust to over-fitting, and also those samples that are left out (the out-of-bag (OOB) samples) can be used to quantify the contribution that prediction variables make to the overall response. The Random Forest method is designed to accommodate nonlinearities between the response and prediction variables as well as unknown interactions among the variables 29,63,64 . We used the R package "randomForest" 65 and performed 3-fold cross validation by manually dividing the samples in each cancer type into 3 training and test subsets. To build each forest, tree size was set to 500 and the "importance" parameter was set to "TRUE" in the R function "randomForest" so as to provide estimates for the importance of prediction variables.
Missing data were imputed using the "na.roughfix" function in the "randomForest" package. We obtained separate measures of importance for each variable from each Random Forest run. These importance scores are calculated as the percent increase in the mean squared prediction error on the OOB samples when a given variable is permutated.
Variables were ranked based on average importance scores across all cross validation folds. Prediction errors were calculated as the mean squared difference between the predicted vs. the observed methylation values for the test set samples. The square root of the mean squared error (MSE) has the same scale as the response (DNA methylation beta values in this case), and is therefore a direct measure of the accuracy with which predictions were made. (Fig. 1c; Supplementary Fig. 6a,b).

Predictive modeling and variable selection using the Elastic Net algorithm
Elastic Net is a penalized regression approach for variable selection and quantitative inference that identifies linear combinations of unique variables that contribute to a response variable such as the amount of DNA methylation. The algorithm was developed and benchmarked to avoid over-fitting in statistical modeling of high-dimensional data containing collinearity 27 . We applied the Elastic Net algorithm using the R package "glmnet" 66 . Elastic Net performs variable selection by minimizing a regularized cost function using the following equation where lambda is the tuning parameter and alpha is the Elastic Net penalty term. For each cancer type, the samples were divided into 3 independent test subsets (3-fold cross validation), and separate models were generated using each training subset. Using a grid of different tuning parameter values, we found the lambda that minimized the mean squared error using 5-fold cross validation within each training set for each model separately. The value of alpha was set to α=0.5 to handle potential correlated variables.
Finally, for each variable, average coefficient across the 3 independent models was calculated for each region and each cancer type. Due to the existence of categorical factors among our variables (for which scaling is not appropriate), we also calculated the selection rate as an alternative measure of variable importance referred to as "variable usage" in the manuscript. Variable usage was measured as the fraction of times across all cross validation folds that a variable was selected by the Elastic Net to be included in the final model (Supplementary Fig. 6c; Fig.4 f,g; Supplementary Fig. 9a-f). Finally, prediction errors were calculated as the squared difference (mean squared error (MSE)) between the predicted and measured DNA methylation values for the test sets ( Fig. 1c; Supplementary Fig. 6a,b).

Variable class contributions to DNA methylation
Variables were functionally categorized into the following 8 classes: "Methionine cycle enzymes", "Other SGOC enzymes", "Chromatin remodeling factors", "Transcription factors", "SAM metabolizing enzymes", "Clinical factors", "Copy number variations", and "Mutations". Results of the integrative modeling were summarized and reported in terms of the average contribution from each of the above functional classes in explaining DNA methylation variation. Variable importance scores from Random Forest models were averaged across all variables within a given class, and an overall class importance score was calculated. In the case of Elastic Net models, variable usage as described in the previous section, was averaged across variables in each class and an average percentage showing selection rate was calculated. Finally, classes were ranked in each cancer type according to their average contribution and the overall class ranks were plotted in Fig. 2a, Supplementary Fig. 2c,d, and Supplementary Fig. 6d,e.

Comparison to variance-matched random gene expression controls
A set of 100 randomly selected genes from the genome with similar cross-sample variation in expression as our original gene expression variables (TFs, SGOC, MET-C, SAM, and RMs) were considered. We performed this test on local DNA methylations (all variable 10kb regions) in brain cancer (LGG) as an example and repeated the integrative modeling using this set of randomly selected genes in addition to all other variables present in the original models. All gene expression variables were then ranked using a similar approach as described above. To compare our original gene expression variables with the variance-matched random genes, the ranks across all models were averaged (Fig.   1d), and p-values were obtained from one-tailed Mann-Whitney non-parametric test between the two groups from Elastic Net and Random Forest.

Distance to nearest gene transcriptional start site (TSS)
Selected 10kb regions were converted to genomic range objects using the R package "GenomicRanges" 67 . The distance to single nearest gene's transcription start site (TSS) was found using Genomic Regions Enrichment of Annotations Tools (GREAT) 68 .
Genomic regions are associated with nearby genes by first assigning a regulatory domain to every gene in the genome, and then finding genes whose regulatory domains overlap with a given genomic region. We set the association rule parameter in GREAT to "Single nearest gene" with a maximum extension of 1000kb for definition of regulatory domains.
Density plots of distance to TSS are depicted in Supplementary Fig. 2b. The same approach was used for annotating peaks obtained from Fig. 3 (density plots shown in Supplementary Fig. 3a,c). To obtain the distribution of Illumina probe densities around the TSS, we randomly selected 10000 probes across the arrays and applied the abovedescribed approach to measure the distance to nearest gene's TSS for each probe. Density plots were obtained for the purpose of comparison with the distribution of metabolically regulated peaks ( Supplementary Fig. 3b,d).

Identification of metabolically regulated genomic regions
To find peaks of strong association between the met cycle and DNA methylation, we designed a novel scanning method by applying the idea of Manhattan plots from e-QTL analyses 69 to DNA methylation data. In each cancer type, we first selected one of the major enzymes in the met cycle with the highest overall Spearman correlation with global and local DNA methylations (BHMT2 in brain, breast, prostate, and liver; MAT2B in lung and bladder; and AHCY in colon and kidney cancers), and calculated the Spearman correlation between its expression and the beta value of each individual probe across the genome. We then sorted the probes according to genomic coordinates and aligned the -log10 of the p-values obtained from the Spearman correlations along the chromosomes.
Next, we applied a sliding window scan for regions of strong association across the genome separately in each cancer type (Fig. 3a). For this, probes with the highest correlations (top 10% across the genome) were located and a 6kb window (+3kb and -3kb) flanking the genomic coordinate of the original probe was scanned. A region was reported as a "peak" if the following criteria were met: 1-Region included at least 3 probes with a correlation in the same direction as the original probe (positive or negative); 2-At least 80% of all probes within the region had a significant (p<0.00001) correlations with met cycle expression. After applying these filters, the selected regions were annotated and genes overlapping with each of the peaks were used for subsequent pathway enrichment analyses. Given the window size and the above criteria, the majority of the identified peaks only overlapped with one unique gene (see Supplementary Table 2 for a complete list of all identified peaks).
To assess potential bias toward highly methylated regions in the identified regions where correlation of methylation with met cycle expression peaks, we tested 2000 randomly selected probes across the genome. We then evaluated the association between methylation of each probe with the value of its Spearman correlation rho with met cycle expression-we used AHCY in colon cancer as an example in this test (Fig. 3b).
Finally, an additional filter was applied to rank the identified peaks according to peak shape. For this, the aligned correlation coefficients in each region were assessed with respect to whether they formed a peak according to an information theory score calculated by the R function "turnpoints" (refer to R package "pastecs" 70 ). This function finds all turning points (peaks and pits) in a series of points (in this case, aligned correlation coefficients), and calculates the information quantity of each turning point using Kendall's information theory. Finally, it measures a p-value against a random distribution of the turning points in a given series, with smaller p-values corresponding to less random shape and a higher probability of a turning point corresponding to a real peak or pit. We selected regions containing turning points with the most significant p-values (lowest 20%) in each cancer type and subsequently tested them for specificity for the met cycle as described in the following section.

Test of specificity of peaks for the met cycle
Each of the selected peaks was tested for specificity of their correlations with the met cycle expression (vs. gene expression in general). For this, 500 genes were randomly selected from the genome in each cancer type, and the Spearman correlation coefficient was measured between their expression and the methylation of every probe within a given peak. The fraction of significant correlations was calculated for all of the 500 genes as well as for the met cycle gene. A randomization q-value was calculated for the met cycle gene by comparing it to the distribution of the correlations calculated for the 500 random genes. This procedure was repeated separately for each peak in each cancer type and the results are summarized in Supplementary Fig. 6a,b.

Pathway enrichment analyses
Peaks were annotated according to Illumina information and UCSC Ref gene names for genes overlapping with the identified peaks were extracted. Pathway enrichment analysis was performed on the resulting gene list for each cancer type using Enrichr 32 . Combined scores from Enrichr were used to rank pathways. The Combined score "c" is defined as c=log(p)*z where p refers to the p-value from the Fisher's exact test and z is the z-score indicating the deviation from the expected rank. Enrichr first calculates Fisher's exact pvalues for many random gene sets to generate a distribution of expected p-values for each pathway in their pathway library. The z-score for deviation from this expected rank is therefore an alternative ranking score and the combined score is considered a corrected form of the enrichment score and p-value, which we used to sort pathways in Fig. 3c-f and Supplementary Fig. 5a-d. All gene sets in Fig. 3c-f and Supplementary Fig. 5 had Fisher's exact p-values <0.05, and the most highly enriched sets are shown ranked by the combined enrichment scores. Gene set names used in Fig. 3c-f and Supplementary Fig. 5 follow the convention used and described by Enrichr  Supplementary Fig. 7b.

Evaluation of modeling performance using randomized prediction variables
Using prostate cancer as an example, we performed simulation tests to determine whether the Random Forest as a methodology is able to utilize the information in the prediction variables beyond what could be expected if the predictors were only random noise and unrelated to the response. To investigate this, we modeled methylation in the prostate cancer dataset at 3 example cancer loci (GSTP1, RASSF1, and PITX2). These genes were selected based on previous evidence indicating the critical importance of their aberrant methylation in prostate cancer 42,72 . As controls, we generated 3 additional datasets. For the first dataset, we copied the exact response as the GSTP1 methylation, but randomly generated a predictor variable set of the same dimensions as the original variable set by sampling from a standard normal distribution. That is, each observation on each variable is a sample from a normal distribution of unit variance and should therefore have no relationship to the response. The other two datasets were generated in the same fashion, using RASSF1 and PITX2 methylation as responses and randomly generated variable sets as predictors. To assess the performance of the Random Forest computations, we compared the mean squared error (MSE) from predictions made using the original data with those made by the datasets consisting of random variables unrelated to the responses. For each of the three responses, we randomly divided the data into training and test sets, generated a total of 100 simulations consisting of 500 decision trees, and compared the resulting MSEs of the predictions made on the test points. Results are summarized in Supplementary Fig. 8a-d.
To quantify the improvement in the Random Forest algorithm by using the original variables over the randomly simulated variables, we defined an improvement metric (MSE-Imp), describing the relative improvement in prediction accuracy: MSE-Imp := !"#!!"#$ !"#!!"#$ where MSE-rand is the average MSE calculated using the random simulated variables and MSE-orig is the average MSE calculated using the original variables.
In this test, another simulated dataset of the same dimensions as the original dataset was generated where the variables and response were linearly related via the following equation: Y= ! ! ! !!! + To generate this linear dataset, we sampled the value of each prediction variable Xi from a standard normal distribution and the noise ε from a normal distribution with mean 0 and standard deviation 0.05. The values of the coefficients βi were selected uniformly at random from the interval [0,1]. We then measured the MSE improvement (MSE-Imp) for the linear dataset using the same approach as MSE improvements for the original datasets were calculated (explained in the previous paragraph). This allowed us to compare a linearly simulated dataset to our real dataset. Results are shown in Supplementary Fig.   8e-f.

Network construction
Genes in the serine, glycine, one-carbon (SGOC) network (including the met cycle genes) that were among the most highly ranked variables (top 15%) in at least 4 of the cancer datasets according to the Elastic Net models and at least 3 of the cancer datasets according to the Random Forest models were selected. A metabolic network consisting of these enzymes was then constructed using MetScape 73 where nodes represent genes and metabolites, and edges represent biochemical links. We fixed the node size for metabolites but adjusted node sizes for genes to correspond to the number of cancers in which each variable was highly ranked (among the top 15% of all variables) (Fig. 4g).
For nodes not directly connected to the rest of the network, we manually added dashed lines where appropriate.

Survival analyses
In each cancer type, the average error of prediction of DNA methylation at cancer loci was measured for each patient across all Elastic Net models using only met cycle variables for prediction. Patients were then divided into 2 groups based on predictability of their methylation by the met cycle activity ("predictable"= below-median prediction error, "not predictable"= above-median prediction error). To estimate overall survival time, "days-to-death" was used with vital status information and last follow-up date used to right-censor subjects (subjects alive at last follow-up were censored from the analysis beyond their last follow up date). The relationship between survival and predictability was then analyzed using the "survfit" function in the R package "survival" 74 and visualized by Kaplan-Meier curves. Log rank test p-values were calculated by fitting models of overall survival to patients' "predictability" group assignments using the "survdiff" function in the survival package for each cancer type separately. Results are depicted in Fig. 5.

Multivariate cox regression
In the three cancer types (brain, liver, and kidney) where univariate analysis showed a highly significant difference in survival between the predictable and non-predictable groups as described above, and also the sample size allowed for sufficient power to perform multivariate analysis, we used relevant clinical and mutational factors as covariates and repeated the survival analysis. The following factors were individually tested as covariates in separate models of overall survival along with "predictability" status as the fixed effect: Brian cancer: all frequent somatic mutations (see Supplementary Table 1 for the complete list), histological diagnosis, age, gender, and initial weight; Liver cancer: all frequent somatic mutations (see Supplementary Table 1 for the complete list), tumor stage, history of other malignancies, and residual tumor; Kidney cancer: all frequent somatic mutations (see Supplementary Table 1 for the complete list), age, and race. In each case, the results of regression using the "coxph()" function in R provided the p-value for the significance of the predictability status when modeling overall survival in the presence of covariates.