Gene-set Analysis with CGI Information for Differential DNA Methylation Profiling

DNA methylation is a well-established epigenetic biomarker for many diseases. Studying the relationships among a group of genes and their methylations may help to unravel the etiology of diseases. Since CpG-islands (CGIs) play a crucial role in the regulation of transcription during methylation, including them in the analysis may provide further information in understanding the pathogenesis of cancers. Such CGI information, however, has usually been overlooked in existing gene-set analyses. Here we aimed to include both pathway information and CGI status to rank competing gene-sets and identify among them the genes most likely contributing to DNA methylation changes. To accomplish this, we devised a Bayesian model for matched case-control studies with parameters for CGI status and pathway associations, while incorporating intra-gene-set information. Three cancer studies with candidate pathways were analyzed to illustrate this approach. The strength of association for each candidate pathway and the influence of each gene were evaluated. Results show that, based on probabilities, the importance of pathways and genes can be determined. The findings confirm that some of these genes are cancer-related and may hold the potential to be targeted in drug development.

 indicates hypermethylation in cases compared to controls; while a large negative value implies hypomethylation. The indices i and j assume that the difference is probe-specific and pair-specific. The property of pair-specificity has been observed in earlier studies (Martino et al. 2013). That is, the difference in DNAm between each case-control pair per probe varies among probes. This difference follows a normal distribution 2 ( , ) ij j N  where both mean and variance parameters are probe specific. This assumption takes account of the heterogeneity across genes (indexed by j ) and between individuals (indexed by i ). , where each k f represents the information from each individual pathway as well as its constituents.
If any cross-talk exists among pathways, then this can be incorporated in the function  . Several examples of the specification as well as the contribution from its gene members will be described in the following sections. The complete Bayesian hierarchical model for the differential DNA methylation profiling is Here k  stands for the pathway effect and j  is used to evaluate and select influential genes. In addition, the impact of CGI on the DNAm is assessed through

Specification of Pathway Information
Suppose there are K pathways to be examined for possible association with the disease under study. Then, to incorporate such biological information, we propose the following functions for the following different scenarios. Suppose the probe j P locates in gene j G , which may or may not be in the k th pathway, where k =1,…, K . To simplify the notation, we assume for now that the joint-effect  can be partitioned as shown below. We now define the functions   kj fP in the following ways to account for the given type of pathway information.
(a) Null effect: The first way is for the "null effect". This represents the case where each of the K pathways has no impact on the phenotype. This can occur when each pathway shows no relationship with the level of DNAm. In other words,   0 kj fP for every k under the independence assumption for the K pathways. The null information can also stand for the case when the gene is not a member of the pathway.
(b1) CGI-independent constant pathway effect: (b2) CGI-dependent constant pathway effect: (c2) CGI-dependent degree effect: Similar to the case in (b2) for CGI-dependence, the degree effect may differ according to CGI status. That is, the quantity C k  is for genes whose probes locate in CGIs and nC k  for genes whose probes are not in CGIs.

Data management, matching, and normalization:
The ovarian cancer study from the United Kingdom Ovarian Cancer Population Study (UKOPS) recruited 266 postmenopausal women with ovarian cancer (131 pre-treatment and 135 post-treatment cases) and 274 age-matched healthy controls. The DNA methylation data are available from the GEO database (accession number GSE19711). The Illumina HumanMethylation27 BeadChip was used to obtain DNA methylation profiles across 27,578 CpGs in 540 whole blood samples. DNA was bisulphate converted, amplified, fragmented and hybridized to the BeadChip arrays. The methylation level of each specific CpG site was calculated from the intensity values of methylated and unmethylated DNA beads as a ratio of fluorescent signals. This ratio is 7 usually called the  value. It is continuous between 0, standing for completely unmethylated, and 1 for completely methylated.
The procedures for data management include outlier detection, removal of batch effect, matching, and normalization. First, the quality of data was evaluated based on methylation level, defined as the ratio of signal from a methylated probe relative to the sum of both methylated and unmethylated probes. As the first step, the outliers were detected with the boxplot of 540 ratio values per each CpG site. Any value larger or smaller than 1.5 times the IQR (inter-quartile range) was denoted as an outlier value.
Next, for each sample subject, the number of CpG sites detected as outlier values among the total 27,578 sites was counted, and this number was defined as the "outlying number" for this subject sample. A new boxplot of such numbers of outliers for all 540 subjects was then created. Any extreme value in this new boxplot corresponds to a subject whose outlying number is extreme and thus this subject was considered as an outlying subject sample and removed from analysis. This excluded a total of sixty-two subjects, 11 of which were pre-treatment patients, 21 post-treatment patients, and 30 healthy controls.
After the quality control procedure, 478 subject samples remained. The quantile normalization was next performed with the preprocessCore package in R to avoid a potential batch effect, following the removal of three batches with an overwhelming portion of healthy samples (47 controls and 0 case in batches numbered 10 and 11; 19 controls and 4 cases in batch 12). For the remaining batches, the number of controls was between 15 and 21; while the number of cases was between 26 and 32.
In addition, to avoid the interference of a treatment effect on methylation, we removed the 112 post-treatment cases for the remainder of the analysis. When the data 8 management procedure was finished, we were left with 118 pre-treatment cases and 135 controls. We then matched case-control pairs by restricting the difference in age to be less than three, obtaining 104 case-control pairs for the following analysis.

Computational details:
The final methylation data included 104 case-control pairs and 1,675 probes. The    Figure S1a : Boxplots of 200 randomly selected probes ( ij  ) locating in CGIs.
15 Figure S2: To visualize the difference between the DMGs identified in model (b2) and the disease-related genes reported in other studies, we colored the genes in the plot of Pathways in cancer in Figure S2. If the DMGs were also reported in more than one earlier study, they were colored in purple. If the DMGs were not covered by previous reports or were reported in only one article, then they were colored in blue. For genes not identified in model (b2) but in other studies, they were colored in darker pink if identified in methylation-related studies, or lighter pink if in non-methylation-related studies. Notice that, as compared with other genes in this pathway, the genes colored in blue are highly connected to the genes in purple or darker pink, and are more heavily linked among themselves. This pattern suggests not only significant association among blue genes, purple genes and the disease, but also a strong relationship between genes of these three colors (purple, blue and darker pink). These results can be considered as supplementary findings to earlier research and studied for further evaluation of their molecular interactions.