Introduction

In recent years, epigenetic changes have been extensively studied and many studies have demonstrated their association with biological phenomena such as genomic imprinting, immune response regulation and developmental programming1,2,3,4. Epigenetics is the study of the connections between genotype and phenotype and one of its unique revelations is that gene expression patterns can be regulated without altering DNA sequences5,6. Different types of epigenetic changes, such as DNA methylation, microRNA expression and chromatin modification, have been reported as important players in many physiological functions6,7. Among them, DNA methylation is the most studied mechanism and participates in the pathogenic processes of many diseases, such as cancers, neurodevelopmental disabilities and allergic diseases1,8,9. Thus, a growing body of research has been devoted to dissecting the methylation profiles in patients and trying to identify potential methylation biomarkers.

In the mammalian genome, DNA methylation usually occurs in a cytosine within a CpG dinucleotide and occasionally is found outside of CpG10. With the advancement in experimental technologies, several methods, including Illumina Infinium microarray and whole genome shotgun bisulfite sequencing, can be used to investigate genome-wide methylation profiles in tissue samples11. An important feature of these methods is that most of them need to perform bisulfite conversions on DNA samples in order to distinguish methylated and unmethylated nucleotides. Bisulfite conversion transforms cytosine residues into uracil residues but leaves 5-methylcytosine residues unchanged, which allows researchers to quantify the methylation levels. Challenges arise, however, when trying to treat DNA samples with bisulfite. A critical question is how to determine whether input DNA samples are completely converted by bisulfite or not. Although Illumina methylation microarrays do have quality control probes for assessing the efficiency of bisulfite conversion, such information was usually not available in the public datasets. An arbitrary threshold between the intensity ratios of bisulfite-treated and untreated DNAs was used to indicate whether the bisulfite conversion was completed or not, which cannot fully and quantitatively reflect the level of bisulfite conversion. However, incomplete bisulfite conversions lead to overestimation of the methylation levels, since only a portion of cytosine is converted. Alternatively, over-treatment of bisulfite causes degradation of DNA samples and increases the probability of converting a methylated cytosine to a thymine11. Consequently, identification of gene markers associated with the efficiency of bisulfite conversion may help to overcome this challenge.

High-throughput technologies, such as microarrays and next-generation sequencing, facilitate the identification of genes with altered methylation levels and other experimental methods are usually performed to validate the results. For example, mass spectrometry has been widely used in methylation analyses12,13,14. However, few studies have explored genes with consistent methylation levels across different samples. Similar to the concept of “housekeeping” genes showing consistent and stable gene expression levels15, appropriate internal controls for methylation studies can not only help to reduce the experimental bias from artificial effects, but also provide a better baseline to compare data from distinct biological samples. Therefore, we aimed to perform a large-scale analysis of methylation data in order to identify potential housekeeping genes with stable methylation levels across multiple human tissues.

In this study, we analyzed a total of 682 methylation microarrays generated from Illumina Infinium HumanMethylation27 BeadChips and used a bioinformatics approach to identify 27 genes showing consistent methylation levels across all samples. The top five genes were validated using mass spectrometry in 24 human cell lines and a linear association between detected methylation levels and methyl concentrations of DNA samples was demonstrated in three genes, suggesting their potential role as markers for the efficiency of bisulfite conversion.

Results

Identification of consistently methylated probes

After the quality checks of samples and probes, a total of 668 samples (Table 1) containing 7,829 probes remained for further analyses. These 668 samples were comprised of more than 10 different cell types from 8 independent experimental batches and ethnicities. The following analysis procedures were all carried out using R version 2.9. For each gene, the coefficient of variance (CV) value and stability score16 were calculated to estimate the consistency of methylation levels among all samples and the top 100 probes with the lowest CVs and highest stability scores were recorded as list A and A′, respectively. To remove false-positive probes identified by coincidence, resampling tests were performed by randomly splitting the 668 samples into halves with equal sample sizes, i.e., 334 samples each. Similarly, the top 100 probes with lowest CVs were recorded as list B and C and the top 100 probes with highest stability scores were recorded as list B′ and C′. Detailed information about the resampling test is described under Methods. The results of the random trials are summarized in Table S1, which shows that the mean CVs and stability scores of the top 100 probes were generally larger than 80 and even approached or attained 90 among the six lists. Among the 10,000 trials, the number of probes identified in list B at least once was 224 and the number of probes identified at least once in any of the 6 lists was only 295. Such high concordance suggested that both CV value and stability score approaches were stable and their findings were generally very similar. In addition, these two approaches identified 69 common probes out of the top 100 probes in lists A and A′, which further demonstrated the consistency of the results. Therefore, we focused on the intersecting set of probes (n = 27) among all six lists for the following analyses.

Table 1 Characteristics of analyzed Illumina Infinium Human Methylation27 microarray datasets

Methylation levels of the selected 27 probes across different datasets

The 27 candidate probes consistently appearing in all six lists are shown in Table 2. As shown in Figure S1, all of these 27 probes (red dots) displayed high methylation levels and relatively very low CVs. For example, the highest CV value of the 27 probes was only 0.1347, whose rank was 36th among 7,829 probes. To be more specific, we further examined the methylation levels of the 27 probes in all samples from different datasets (Table 2). In general, their β values of methylation were very stable across all 668 samples, independent of different experimental batches and all of them were higher than 0.8 and even 0.9. For instance, as shown in Figure S2, the M-values of N4BP2 and EGFL8 in distinct datasets did not vary much. Therefore, these results suggested that our approach was able to successfully identify probes with consistent methylation levels. The 27 selected probes showed consistent methylation levels across samples with different diseases, tissue types and ethnicities.

Table 2 Information on the 27 probes commonly identified in the six lists

Validation of selected probes using mass spectrometry

To narrow down the target probes for validation, we repeated the same procedures shown in Figure 1, except that only the top 20 probes were tallied. Among the 10,000 resampling trials, only 1.09% of probes (n = 85) were identified at least once in all six lists, indicating that our proposed approach to identify probes with stable methylation levels is not sensitive to a change in the number of probes selected. Next, an average number of appearances in the lists B–C′ was ranked for experimental validation. The top 5 probes (N4BP2, EGFL8, CTRB1, TSPAN3 and ZNF690) were selected (Table S2) and all of them were identified more than 9,885 times out of the 10,000 trials, suggesting they had stable methylation levels across different biological samples.

Figure 1
figure 1

Flowchart for identification of genes with consistent methylation levels across different samples.

Twenty-four cell lines derived from 13 different cell types were investigated using mass spectrometry (Table 3). After DNA was extracted and bisulfate converted, mass spectrometry experiment was performed according to the standard protocols provided by the manufacturer (Sequenom, San Diego, CA). The results of the mass spectrometry are illustrated in Figure 2 and all of the five genes generally showed consistent and stable methylation levels among all cell lines. N4BP2 and EGFL8, for example, had methylation levels higher than 0.75 in all cell lines, which demonstrated that these two genes were highly methylated independent of tissues type (Figure 2A–B). In addition, CTRB1, ZNF690 and TSPAN3 showed high β values (>0.75) in 24 (96%), 23 (92%) and 19 (76%) cell lines. Thus, the results indicated that our approach can successfully identify genes that are stably and highly methylated across different cell types.

Table 3 Characteristics of the 24 cell lines investigated using the MassARRAY system
Figure 2
figure 2

Methylation levels of the five genes detected by mass spectrometry across 24 cell lines.

The X-axis denotes the names of the different cell lines and the Y-axis represents the average beta value of the methylation level.

Lastly, we evaluated the sensitivity of detecting methylation levels in the top three genes showing stable methylation levels, including N4BP2, EGFL8 and CTRB1, using different concentrations of methylated samples. Two standard DNA samples, which were fully methylated (100%) and unmethylated (0%), were purchased from Qiagen (Valencia, CA) and used to make DNA samples with 0%, 25%, 50%, 75% and 100% methylation levels. Subsequently, these samples were investigated in the MassARRAY system and the methylation levels of N4BP2, EGFL8 and CTRB1 are shown in Figure 3. For each gene, a linear relationship (R2 ≥ 0.98) was observed between its methylation level and the methyl concentration of DNA samples. In addition, these three genes all showed low (<0.2) and high (>0.8) methylation levels in the 0% and 100% methylated samples, respectively. This suggested that the methylation levels of these three genes were highly associated with the methylated concentrations of DNA samples. Therefore, these genes can serve as potential methylation markers for bisulfate conversion.

Figure 3
figure 3

Correlation between concentration and methylation levels of EGFL8, N4BP2 and CTRB1.

Five concentrations, including 0%, 25%, 50%, 75% and 100%, of the methylated DNA samples were investigated by mass spectrometry.

Discussion

Changes in methylation have been shown to be an important player in regulating cell growth, normal cellular functions and even the development of diseases17,18. Thus, how to effectively and accurately measure the methylation levels of multiple genes simultaneously has become a critical issue. Although some experimental technologies, such as enzyme-based gel electrophoresis and affinity-enrichment methods, can be used in methylation studies without performing bisulfite conversions, most of the popular techniques still require treating samples with bisulfite in advance11. However, inappropriate bisulfite conversion may easily introduce systematic errors and lead to incorrect conclusions. A previous study has demonstrated that the rate of cytosine deamination to uracil highly depends on temperature and incubation time19. Therefore, identification of internal controls for assessing the conversion efficiency of DNA samples is necessary. In addition, internal controls can provide a baseline for comparison of the quality of input DNA samples and provide a stable reference line to normalize methylation data among different samples. For example, the delta-delta cycle threshold (ddCt) method has been widely used in analyzing PCR data for mRNA expression values20,21 and internal controls, such as ACTB and 18s rRNA, which have high and stable expression values in different tissues types, are essential for interpreting the results. In this study, we demonstrated that N4BP2, EGFL8 and CTRB1 were highly methylated not only in samples detected by microarrays (β values > 0.9, Table 2), but also in 24 cell lines across 13 tissue types examined by mass spectrometry (β values > 0.75, Figure 2). Therefore, the results of two independent techniques both showed that these genes had high methylation levels in several tissue types. In addition, a linear relationship (R2 ≥ 0.98) was demonstrated between the methylation levels of three identified genes and the methyl concentration of DNA samples (Figure 3). These data further suggested their capability for serving as internal controls because their methylation levels can be used to reflect the efficiency of bisulfite conversion in input samples. In conclusion, N4BP2, EGFL8 and CTRB1 were possible internal controls for methylation studies since their methylation levels were not only consistent in many different human tissues but also proportional to the methyl concentration of DNA samples.

Two approaches, CVs and stability scores, were performed in this study to identify probes showing consistent methylation levels (Figure 1). For a given gene, the CV was used to evaluate consistency across different samples, whereas the stability score approach16 utilized a rank product method to estimate its suitability in serving as a control in distinct datasets. Interestingly, the results of these two approaches were very similar and identified 69 probes in common out of the top 100 probes, motivating us to use both approaches. Also, moderate to high Pearson correlation coefficients (r = 0.62–0.76) were observed between the rankings of genes obtained from CV and stability score approaches, further suggesting their concordance.

Resampling tests were used to exclude probes identified by random chance and high similarities were observed in the results (Table S1). In addition, although selecting the top 100 probes is an arbitrary threshold, the results showed minimal variation when the threshold number was changed to 20. To summarize, the results suggest that our procedures were not sensitive to the chosen parameters and were able to reproducibly identify probes by integrating two different approaches.

The expression levels of hypermethylated genes are down-regulated, if these genes are subject to the regulation of DNA methylation18. Such an epigenetic regulation mechanism is observed in several genes related to embryonic development22. For instance, DAZL. one of the top 27 probes, is an important regulator participating in spermatogenesis and oogenesis and its demethylation is only observed in germ cells but not somatic cells23. GDF11 is a growth factor involved in the formation of mesoderm and neurogenesis24 and its gene expression level can be induced by a histone deacetylase (HDAC) inhibitor and inhibited by HDAC325. Accordingly, the results suggest that these identified genes have biological relevance.

Although we used gene symbols to denote the CpG islands showing high methylation, readers should keep in mind that methylation levels are dependent on the specific chromosome coordinates (Table S2), because different methylation statuses of distinct CpG loci in the same gene could be observed. For example, methylation changes were observed in the first exon of HTRA3 in smoking-related lung cancer, but such alterations were not detected in its promoter region26. However, gene symbols were chosen to represent the CpG islands in this study, since such methylation changes in CpG islands may affect the overall function of the corresponding gene. To date, the literature has rarely reported methylation changes in the top five genes identified by our analysis (N4BP2, EGFL8, CTRB1, TSPAN3 and ZNF690). A single study has shown that TSPAN3 was down-regulated in relapsed Wilms tumor; however, such gene expression changes were not controlled by methylation27. Therefore, additional studies of the methylation status of these five genes are required to evaluate their functional roles in relationship to methylation.

In this study, we have demonstrated the consistent methylation levels of N4BP2, EGFL8 and CTRB1 in many human tissues and cell lines; however, one caveat is that methylation profiles in each cell line may be affected by in vitro cell culture procedures28,29. Epigenetic changes in cells are sensitive to their growth conditions and thus subtle differences in environment may lead to huge differences in methylation profiles. Two previous studies showed that some variations in the methylation profiles existed between cell lines and tissues, even if they were from the same organ28,29. Therefore, a preliminary test in different cell lines is prerequisite before utilizing the methylation markers identified in this study.

In conclusion, we have identified five genes with stable hypermethylation across different human tissue types. Among them, N4BP2, EGFL8 and CTRB1 not only can serve as internal controls for methylation studies, but also are markers for the efficiency of bisulfite conversion.

Methods

Sample collection

All methylation microarrays analyzed in this study were investigated by using Illumina Infinium HumanMethylation27 BeadChips, containing probes to interrogate 27,578 CpG loci covering more than 14,000 genes. Methylation levels in Illumina methylation assays were quantified by the β value using the ratio of methylated alleles over all alleles for a given CpG locus. Most of the microarray samples were retrieved from the Gene Expression Omnibus website30, with the accession numbers of GSE17648, GSE1776931, GSE2006732, GSE2008032, GSE24087, GSE2613333, GSE2728434 and the other microarrays were collected from our in-house studies. The details of analyzed microarrays are summarized in Table 3.

Processing and filtering of microarray data

The protocol used to identify probes with high and consistent methylation levels is illustrated in Figure 1. First, to remove microarray samples with low quality and intensity, the mean signal of every probe within each slide was calculated in all 682 samples. Samples were excluded for subsequent analyses if the following condition was met: the mean of average β value across all probes was ≤0.335. In addition, individual probes were filtered out if they displayed a missing value in any one of the samples. Consequently, 14 samples were excluded and approximately 20,000 probes were filtered out, which resulted in 7,829 probes as potential targets in the following approaches.

Identification of probes with stable methylation levels across different samples

Prior to performing subsequent statistical approaches, the average β values in all microarrays were transformed into “M-values” based on the following equation.

Du et al. reported that this M-value transformation is able to improve the determination of methylation levels in statistical analyses by showing greater consistency and robustness36. After the M-transformation, the coefficient of variances (CV) was utilized to rank the investigated probes for suitability as “housekeeping” probes. Specifically, the CVs of the 7,829 probes were calculated over 668 samples and sorted in ascending order. Based on the results, the top 100 probes having the smallest CV values were reported as possible “housekeeping” candidates (list A). To establish a null baseline for comparison, a resampling test was performed 10,000 times through the following steps. First, the 668 methylation samples were randomly divided in half (334 samples each in lists B and C) and the CV values were calculated. Similar to the approach in identifying list A, the top 100 probes with smallest CV values were recorded and compared with the members of list A. In addition, the top 100 probes identified in list B were compared with the members in list C. Lastly, the matching probes between list A and lists B and C created from 10,000 random trials were recorded and the common members in list B and C were also tallied for further comparisons.

Verification of possible probes with consistent and stable methylation levels

To evaluate the reliability of identifying housekeeping methylation probes by using CV values, another established algorithm was utilized16. Briefly, this approach estimated the stability score of each probe based on its methylation level. The formula to calculate the stability score was:

The symbols μi and σi denote the expression level of gene i and the standard deviation across all 668 samples, respectively. The coefficient α was set to its default value, 0.25, as suggested by the authors16. Similar to the CV value approach, a gene was excluded for further analyses if its mean β value was smaller than 0.3. This criterion was applied in order to yield the same number of probes investigated in both methods to establish a fair baseline for comparison. Moreover, since all samples used in this study were Illumina Infinium HumanMethylation27 BeadChips, the rank product score considering platform-independence, which was outlined by the original authors, was not performed here16. The scoring scheme in this approach was similar to the previous method implementing CV values, that is, lists of candidate probes over 668 samples were examined and ranked by the stability score in descending order. Likewise, 10,000 random trials were carried out and three gene lists were obtained for each trial. Meanwhile, the three lists were also compared to each other and the numbers of times that each gene was identified in the lists B′ and C′ were also tallied. Lastly, the candidate probes with stable methylation levels were narrowed down to those consistently found in all six lists after 10,000 random trials.

Validation of possible gene targets using the MassARRAY system

A total of 24 cell lines were analyzed using the MassARRAY system to validate the methylation levels of selected gene targets. The characteristics of the cell lines are summarized in Table 3. First, genomic DNA was isolated from the cells by proteinase K-phenol/chloroform extraction following standard protocols with 0.5% SDS and 200 μg/ml proteinase K. The DNA concentration of each sample was adjusted to 50 ng/ml and total genomic DNA (500 ng) underwent DNA bisulfate conversion using an EZ DNA Methylation™ kit (ZYMO research, Orange, CA). Among the bisulfate treated DNA products, 200 ng of the bisulfate treated DNA were used for PCR amplification. The primers were designed by using the program EpiDesigner β (http://www.epidesigner.com/start3.html). PCR conditions were optimized to preferentially amplify fragments within a size range of 300 to 500 bp. Subsequently, 2 μL of Shrimp Alkaline Phosphatase (SAP) enzyme was added into 5 μL PCR products to dephosphorylate unincorporated dNTPs. Lastly, in vitro transcription and RNase A cleavage were carried out and the mass spectrum was obtained from the PCR reactions. Quantitative methylation analysis software provided by the manufacturer (Sequenom, San Diego, CA) was used to analyze the results.