DNA methylation (DNAm) has a critical role in regulating gene expression. Recent epigenome-wide association studies in humans have revealed that locus-specific DNAm signatures are associated with susceptibility to different environmental exposures, intermediate phenotypes, and diseases.1,2 Hence, locus-specific DNAm signatures are potential biomarkers in the era of precision medicine.3 We recently found that CpG sites with large interindividual DNAm variation are more likely to be potential biomarkers,4 suggesting that a database of interindividual DNAm variation would be useful to determine target regions for future epigenome-wide association studies.

Several studies have surveyed interindividual DNAm variation5 using peripheral blood, which contains many different cell types, but they did not investigate cell-type-specific signatures.6 Only a few studies have reported interindividual DNAm variation using purified cells, such as neutrophils7 and monocytes.8,9 Because differences in DNAm profiles among cell types are greater than those among individuals,4 profiling of DNAm variation using purified cells is essential to revealing interindividual DNAm variation within a cell type. In addition, the DNAm profiling methods frequently used in previous studies (e.g., array-based and targeted bisulfite sequencing) cover a limited number of human autosomal CpG sites (2–13%).4 Accordingly, whole-genome bisulfite sequencing, which provides the highest coverage (~90%) of human CpG sites among currently available methods, is desirable for compiling an interindividual DNAm variation database.4

Here we report the development and release of “iMETHYL” (http://imethyl.iwate-megabank.org), an integrative database (methylome, transcriptome, and genome) featuring interindividual DNAm variation. iMETHYL provides summarized open data calculated in our previous study, which characterized interindividual DNAm variation in two principal blood cell types, CD4+ T-lymphocytes (CD4T) and monocytes, which were collected from a cohort of healthy subjects (102 CD4T subjects and 102 monocyte subjects; Table 1) by whole-genome bisulfite sequencing.4 In addition to DNAm analysis, we performed whole-genome sequencing and whole-transcriptome sequencing to comprehensively profile genomic variation and gene expression, respectively. Briefly, sequence reads were aligned to the human reference genome GRCh37/hg19 using BWA-MEM (ver. 0.7.5a-r405), and single-nucleotide variant (SNV) calling was conducted using the Genome Analysis Toolkit (GATK version 2.5-2). Gene annotation was performed using GENCODE release 19.10 Details regarding the methods of quality-control filtering, DNAm profiling, gene expression profiling, and variant calling were described by Hachiya et al.4 In addition to CD4T and monocytes, we isolated neutrophils from 94 subjects and performed whole-genome bisulfite sequencing, whole-genome sequencing, and whole-transcriptome sequencing (Table 1). All subjects were recruited as part of the Tohoku Medical Megabank Project, and they provided written informed consent to participate in our study. All subjects belonged to a single large cluster on a PCA plot that consisted of Japanese subjects of the 1000 Genomes Project and the Tohoku Medical Megabank Project (Supplementary Figure 1). The study was approved by the Ethics Committee of Iwate Medical University (HG H5-558 19). iMETHYL was implemented on a UNIX server with CentOS, Apache HTTP Server, and JBrowse 1.12.1.11

Table 1 Demographic and profile statistics of iMETHYL

Based on the DNAm profiles, we estimated the average DNAm levels and variation for ~24 million autosomal CpG sites. iMETHYL provides information on interindividual DNAm variation that was calculated by two methods, i.e., standard deviation (SD) and reference interval (RI), which is defined as the difference between the 95th and 5th percentiles of the DNAm level among individuals.4 In addition, iMETHYL includes the average and SD of gene expression levels for >14, 000 genes and allele frequencies for ~9 million autosomal SNVs (Table 1). Statistics regarding age, sex, and database profiles used in iMETHYL are presented in Table 1. Furthermore, genomic annotation tracks, such as gene models, repetitive elements, CpG islands, and microarray probes, are available in the iMETHYL browser (Table 2).

Table 2 List of available tracks in iMETHYL

iMETHYL was developed to provide an informative, easy-to-use resource that enables investigators to explore DNAm levels and the variability of potential biomarkers identified by epigenome-wide association studies or candidate gene approach studies. From the iMETHYL browser, regions of interest can be specified using gene symbols (GENCODE release 19), dbSNP ID, DNA methylation array probe ID, and genomic positions. The genome browser provides graphical views of genomic annotations and the average methylation level and variability (SD and RI) of each CpG site in each of the three human cell types (Figure 1a). In addition, tracks for the average expression level and SD of each gene for each cell type and allele frequencies of each SNV within 102 (CD4T), 102 (monocytes), and 94 (neutrophils) subjects are provided.

Figure 1
figure 1

Graphical view of iMETHYL. (a) Three-layer omics data are provided as browser tracks. The browser displays several tracks, which are shown for the region surrounding the DNAm biomarker for tobacco smoking, cg05575921. Users can select tracks that provide information from external sources on gene structure, expression, and SNVs and cell-type-specific original tracks (e.g., CD4T, monocytes, and neutrophils) that show average DNAm levels and different measures of variation (SD and RI). (bd) Detailed information on CpG tracks for CD4T, monocytes, and neutrophils. The frequencies of the three DNAm categories among individuals are shown as Mlf_high (≥ 67%), Mlf_mid (34–66%), and Mlf_low (≤ 33%). CD4T, CD4+ T-lymphocytes; DNAm, DNA methylation; Mlf_high, frequency of hypermethylated DNA; Mlf_high, frequency of hypermethylated DNA; Mlf_low, frequency of hypomethylated DNA; Mlf_mid, frequency of intermediate methylation DNA; RI, reference interval; SD, standard deviation; SNV, single-nucleotide variation.

In the example shown in Figure 1a, the iMETHYL genome browser showed different tracks in the region flanking cg05575921, which is a DNAm biomarker for tobacco smoking12,13 located in the aryl-hydrocarbon receptor repressor (AHRR) gene. This DNAm biomarker is markedly demethylated in current smokers.12,13 Using iMETHYL, the average methylation level and variability of each CpG site in the three cell types (CD4T, monocytes, and neutrophils) are shown, and by selecting the bar in the CpG tracks, histograms of DNAm levels at this CpG site for each cell type appear in pop-up windows (Figure 1b–d). iMETHYL is also useful for investigating cell-type-specific DNAm variability. In the CpG site shown in Figure 1, the DNAm levels in CD4T were hypermethylated with a narrow distribution (Figure 1b), whereas broader distributions of DNAm levels were found in monocytes and neutrophils (Figure 1c and d).

Furthermore, investigators can use the browser to explore variability in gene expression and SNVs. For example, upon selecting the bar shown in the fragments per kilobase of exons per million mapped fragment tracks, a histogram of gene expression levels appears in the pop-up window. In addition, the average expression level and SD for each gene are shown. This information provides important clues into the functional relevance of known or putative DNAm biomarkers.

Data on the mean and variation of the DNAm level of each CpG site for each of the three cell types can be downloaded from the iMETHYL website so that users can find CpG sites of their own interest based on the DNAm level and variation or differences between cell types.

In summary, we constructed a public database, iMETHYL, that provides a reference for human DNAm variation. iMETHYL is the first database featuring interindividual DNAm variation based on high-coverage whole-genome bisulfite sequencing using purified CD4T, monocytes, and neutrophils. Because the data were obtained from apparently healthy subjects, the multi-omics genomic data provided by iMETHYL can be used as a reference control. Investigators can examine DNAm variation, gene expression, and SNVs at any specific region of the human genome, which can enable the identification of variable regions in the population to design assay probes for microarrays or targeted sequencing. iMETHYL provides multi-omics data for three different cell types to the scientific community. The iMETHYL browser will be a useful resource not only for researchers specializing in epigenomics but also for those interested in the interactive analysis of DNA methylation, gene expression, and genomic variation.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.