Collective effects of long-range DNA methylations predict gene expressions and estimate phenotypes in cancer

DNA methylation of various genomic regions has been found to be associated with gene expression in diverse biological contexts. However, most genome-wide studies have focused on the effect of (1) methylation in cis, not in trans and (2) a single CpG, not the collective effects of multiple CpGs, on gene expression. In this study, we developed a statistical machine learning model, geneEXPLORE (gene expression prediction by long-range epigenetics), that quantifies the collective effects of both cis- and trans- methylations on gene expression. By applying geneEXPLORE to The Cancer Genome Atlas (TCGA) breast and 10 other types of cancer data, we found that most genes are associated with methylations of as much as 10 Mb from the promoters or more, and the long-range methylation explains 50% of the variation in gene expression on average, far greater than cis-methylation. geneEXPLORE outperforms competing methods such as BioMethyl and MethylXcan. Further, the predicted gene expressions could predict clinical phenotypes such as breast tumor status and estrogen receptor status (AUC = 0.999, 0.94 respectively) as accurately as the measured gene expression levels. These results suggest that geneEXPLORE provides a means for accurate imputation of gene expression, which can be further used to predict clinical phenotypes.


Cancer specificity of geneEXPLORER
Since enhancers are cancer specific, we expected that geneEXPLORER models are cancer specific as well. To demonstrate this, we applied geneEXPLORER trained in TCGA breast cancer data to predict 13,823 genes of TCGA lung cancer data. The results demonstrated that the model did not work in lung cancer (mean R2=0.02), confirming our hypothesis ( Figure S4).

Predicting additional phenotypes using predicted gene expression
In addition to cancer status and ER status, 5 years survival and breast cancer sub-type were also predicted. For survival data, since 732 samples (83.8%) were censored among 873 patients, there were only 298 samples whose 5 year-survival data are available. 207 patients died before 5 years and 91 patients lived more than 5 years. If censoring occurred before 5 year of follow-up, 5-year survival is indicated as NA. If censoring occurred after 5 year of follow-up, 5-year survival is indicated as Yes. The model predicted 5 years survival with lower accuracy (AUC=0.71) than cancer status or ER status, possibly due to the high portion of censored data ( Figure S5).
The breast cancer sub-type status was available for 620 samples. The subtypes are Luminal A (LumA), Luminal B (LumB), Triple-negative/basal-like (Basal), HER2enriched (Her2), and Normal-like (Normal). Using the predicted gene expression, breast cancer subtypes were accurately predicted with 0.174 mis-classification error. Using observed (true) gene expression, breast cancer subtype was predicted with similar prediction accuracy with 0.134 mis-classification error (Table S1).

Training sample size calculation
We investigate gene expression prediction accuracies using geneEXPLORE with various training sample sizes (50, 100, 150, .., 650) and tested on 221 samples from TCGA breast cancer data. As we tested 3 genes, we found that the model with n=250 reaches saturation point in terms of prediction accuracy as shown in Figure S6. We also conducted a survey that shows many samples are available in TCGA data in various cancer types for methylation (array) and Expression (sequencing) ( Table S2). We found that among 21 cancer types, 16 cancer types have samples around n=250 or more for both methylation and expression. These various types of TCGA cancer data can be served as training data sets. This survey shows that geneEXPLORE is widely applicable for various cancer types Figure S1. Prediction accuracy of gene expression (Cross-validated R2) with various dist ance from promoter regions: We included probes within promoter regions, 2Mb, 5Mb,10Mb, 20Mb,30Mb,40Mb, and 50Mb from promoter regions as inputs for the model. Finally, we inclu ded all probes in the entire chromosome on which the gene located (referred as Entire.Chr). F urther, we selected the distance which maximized prediction accuracy by CV for each gene a nd included all probes within the distance as inputs for the model (referred as Selected.By CV ). The number of genes included in the boxplot is 13,910.    is saturated around n=250. Probes within 10Mb from promoter regions were used to build the models.

Figure S 6. Comparison of cross-validated R2 between Lasso vs Elastic Net to
predict gene expression using methylation probes within 10Mb from promoter regions for 13982 genes from TCGA breast cancer data. Pearson's correlation between two is 0.99991. The prediction accuracy using the elastic net is better than the lasso for 86.18% of the genes. Figure S 7. Prediction performance of geneEXPLORE tested on a separate cohort. Using geneEXPLORE trained in TCGA Lung cancer data set using methylation array and RNA seq, we tested the model in another lung cancer data (GSE60645). The data consists of 117 samples for which both methylation array data and gene expression array data available. Methylation probes within 10Mb from the promoter region of each gene were used to predict gene expression. Gene expression data were only used to measure prediction accuracy. The data are publicly available in gene expression omnibus. In the qqplot, R2, between predicted and observed expression levels plotted against the null distribution of R2.