Expression quantitative trait methylation analysis elucidates gene regulatory effects of DNA methylation: the Framingham Heart Study

Expression quantitative trait methylation (eQTM) analysis identifies DNA CpG sites at which methylation is associated with gene expression. The present study describes an eQTM resource of CpG-transcript pairs derived from whole blood DNA methylation and RNA sequencing gene expression data in 2115 Framingham Heart Study participants. We identified 70,047 significant cis CpG-transcript pairs at p < 1E−7 where the top most significant eGenes (i.e., gene transcripts associated with a CpG) were enriched in biological pathways related to cell signaling, and for 1208 clinical traits (enrichment false discovery rate [FDR] ≤ 0.05). We also identified 246,667 significant trans CpG-transcript pairs at p < 1E−14 where the top most significant eGenes were enriched in biological pathways related to activation of the immune response, and for 1191 clinical traits (enrichment FDR ≤ 0.05). Independent and external replication of the top 1000 significant cis and trans CpG-transcript pairs was completed in the Women’s Health Initiative and Jackson Heart Study cohorts. Using significant cis CpG-transcript pairs, we identified significant mediation of the association between CpG sites and cardiometabolic traits through gene expression and identified shared genetic regulation between CpGs and transcripts associated with cardiometabolic traits. In conclusion, we developed a robust and powerful resource of whole blood eQTM CpG-transcript pairs that can help inform future functional studies that seek to understand the molecular basis of disease.


Results
An overview of the study design is presented in Fig. 1.Of 2115 FHS participants (48% women, age 54 ± 15 years) included in the eQTM analysis, 686 were from the FHS Offspring cohort and 1,429 were from the FHS Third Generation cohort (Table 1).The average BMI was 28.0 ± 5.6 kg/m 2 , while the average serum triglycerides concentration was 114 ± 78 mg/dL.Average fasting blood glucose concentration was 100 ± 21 mg/dL.

Gene-level eQTM results.
We defined cis CpG-transcript pairs as those where the CpG site and transcription start site were within 1 Mb of one another.We identified 70,047 significant cis CpG-transcript pairs (33,385  unique CpGs; 8534 unique eGenes) at p ≤ 1E−7; the top 10,000 most significant cis CpG transcript pairs are presented in Supplemental Table 1.The majority (49,280; 70%) of the significant cis CpG-transcript pairs involved transcripts of protein-coding genes, while seven percent (4971) were annotated to lncRNAs.The 8534 unique eGenes corresponded to 6693 unique lead CpGs, i.e. the CpG site that is most significantly associated with that eGene in an eQTM CpG-transcript pair.DNA methylation of the lead CpG site was associated with increased cis expression of 2909 (34%) eGenes, and with decreased cis expression of 5625 (66%) eGenes.
We additionally defined trans CpG-transcript pairs as those where the CpG site and transcription start site were > 1 Mb from one another, and we identified 246,667 significant trans CpG-transcript pairs (12,637  unique CpGs, 4300 unique eGenes) at p ≤ 1E−14; the top 10,000 most significant trans CpG-transcript pairs are presented in Supplemental Table 2.The majority (210,944; 86%) of the significant trans CpG-transcript pairs involved transcripts of protein-coding genes, and 11,271 (5%) involved expression of lncRNAs.The 4300 unique eGenes identified corresponded with 1139 lead CpGs.DNA methylation of the lead CpG site was associated with increased trans expression of 1566 (36%) eGenes, and with decreased trans expression of 2734 (64%) eGenes.

Replication of CpG-transcript pairs.
A recent eQTM analysis of nasal epithelial cells by Kim et al. 1 identified 16,867 significant CpG-transcript pairs using the Illumina HumanMethylation450K methylation data and RNA-seq gene expression data.We mapped 8481 CpG sites and 3331 gene transcripts to those included in our eQTM analyses, which corresponded to 15,158 CpG-transcript pairs that we subsequently compared with our significant cis and trans eQTM results.Of these, 1192 of 2703 (44%) cis CpG-transcript pairs replicated in our study at p ≤ 1E−7 with matching effect directionality.In contrast, only 41 of 6,165 (0.67%) trans CpG-transcript pairs identified in Kim et al. replicated in our study at p ≤ 1E−14 with matching effect directionality.
For the most significant cis and trans CpG-transcript pairs, we identified the top 1000 most significant unique eGenes and their lead CpG.We evaluated these pairs for replication in two independent external cohorts: a cohort of 1248 participants with European ancestry (EA), African ancestry (AA), and Hispanic ancestry (HA) from the Women's Health Initiative (WHI), and a cohort of 521 Black participants from the Jackson Heart Study (JHS).Replication results for cis CpG-transcript pairs in both cohorts are presented in Supplemental Table 3, while replication results for trans CpG-transcript pairs in both cohorts are presented in Supplemental Table 4.
In the WHI, all but one of the 1000 cis CpG-transcript pairs from discovery in FHS replicated at a nominally significant p-value of 0.05 with matching effect directionality, while 997 replicated at p < 1E−7 with matching effect directionality.Furthermore, 973 of 1000 trans CpG-transcript pairs from FHS discovery replicated at a nominally significant p-value of 0.05 with matching effect directionality, while 926 trans CpG-transcript pairs replicated at p < 1E−14 with matching effect directionality.
In the JHS, 697 of 820 (85%) cis CpG-transcript pairs from FHS discovery replicated at a nominally significant p-value of 0.05 with matching effect directionality, while 455 (55%) replicated at p < 1E−7 with matching effect directionality.Furthermore, 722 of 880 (82%) trans CpG-transcript pairs from FHS discovery replicated at a nominally significant p-value of 0.05 with matching effect directionality, while 228 (26%) replicated at p < 1E−14 with matching effect directionality.www.nature.com/scientificreports/Gene ontology.We evaluated gene ontology terms to identify specific biological and cellular pathways that were enriched among the top 1000 unique cis and trans eGenes 13,14 .The top 1000 unique cis eGenes were enriched in 33 pathways related to cell signaling and adhesion at a false discovery rate (FDR) ≤ 0.05 (Supplemental Table 5).The top 1000 unique trans eGenes were enriched in 582 pathways related to the activation of the immune response (Supplemental Table 6).

GWAS enrichment.
We examined the overlap of the significant eQTM CpG-transcript pairs with genetrait associations in the National Human Genome Research Institute (NHGRI)-European Bioinformatics Institute (EBI) GWAS catalog 15 .Using lead eQTL genetic variants for eQTM transcripts in enrichment analysis, the cis eGenes were enriched for genes associated with 1208 traits in the GWAS catalog at an enrichment FDR ≤ 0.05 (Supplemental Table 7).The trans eGenes were enriched for genes associated with 1191 traits in the GWAS catalog at an enrichment FDR ≤ 0.05 (Supplemental Table 8).
Genomic feature enrichment.We evaluated enrichment of the CpG sites included in significant CpGtranscript pairs for 17 specific genomic features, including CpG islands, shores, bodies, and differentially methylated regions (DMRs).Significant cis CpG-transcript pairs were significantly enriched for 15 of 17 genomic features at an enrichment FDR ≤ 0.05 (Supplemental Table 9), with the top most significant features being upstream CpG shores, CpG islands, downstream CpG shores, DNase hypersensitivity sites (DHS), and gene bodies.Significant trans CpG-transcript pairs were significantly enriched for 13 of 17 genomic features at an enrichment FDR ≤ 0.05 (Supplemental Table 10), with the top most significant features being CpG islands, DHS, gene bodies, regions within 200 bases upstream of the transcription start site (TSS), and enhancers.
eQTM CpG-transcript associations.To illustrate the association between DNA methylation at specific CpG sites and gene expression, we identified the top five most significant cis CpG-transcript pairs and the top five most significant trans CpG-transcript pairs and plotted the associations for each pair (Supplemental Figs. 2 through 11).In general, these figures show a negative association between DNA methylation and gene expression; the only CpG-transcript pair to show a positive association is the trans relationship between methylation at cg13704117 (KIAA1267) and expression of ENSG00000214425.7 (LRRC37A4P) (Supplemental Fig. 11).The CpG-transcript associations in Supplemental Figs. 6, 6, and 8 exhibit bimodal distributions that were not attributable to any nearby SNPs (i.e., within 100 bp) in conditional analysis.rs115687047, located near cg04071440, did not contribute to the bimodality of its cis association with ENSG00000204644.9 (ZFP57) shown in Supplemental Fig. 4 (p = 0.15).rs72989301, located near cg17901463 (GSTM1), was significantly associated with expression of ENSG00000134184.12(GSTM1) (p = 1.2E−26), but a significant cis association between the CpG and transcript remained (p < 1E−308) as shown in Supplemental Fig. 6.Finally, the SNPs rs1619178 and rs3113782 are located near cg20262684 and have a linkage disequlibrium of 1; they did not contribute to the bimodality of its trans association with ENSG00000185304.15(RGPD2) shown in Supplemental Fig. 8 (p = 0.75).
eQTM CpG-transcript pairs in mediation analysis of cardiometabolic traits.To illustrate the application of the eQTM resource we generated, we provide examples from the associations of significant cis CpG-transcript pairs with three cardiometabolic traits: serum triglycerides, fasting blood glucose, and body mass index (BMI).Specifically, we first evaluated the mediated effect of DNA methylation on the clinical phenotype through the expression of the linked eGene.To identify candidate CpG-transcript pairs to include in mediation analysis, we identified (1) the overlap of significant cis CpG sites with published CpG-trait associations from the MRC Integrative Epidemiology Unit (IEU) EWAS catalog (http:// www.ewasc atalog.org/) 16 , and (2) significant results from de novo transcriptome-wide association studies (TWAS) of RNA-seq gene expression data and log-transformed serum triglycerides, fasting blood glucose, and BMI in the FHS.The strategy of aligning trait-associated CpGs with corresponding trait-associated transcripts identified 14 CpG-transcript pairs associated with serum triglycerides, 36 associated with fasting glucose, and 153 associated with BMI.These pairs and their corresponding traits were then tested for the following mediation effects: (1) the total effect of DNA methylation on the cardiometabolic trait, (2) the direct effect of DNA methylation on the cardiometabolic trait independent of gene transcript expression, and (3) the mediated effect of DNA methylation on the cardiometabolic trait through gene transcript expression.Results are summarized in Supplemental Table 11.As an example, Fig. 2 illustrates significant mediation of the effects of DNA methylation at the CpG site cg11024682 on serum triglycerides, fasting blood glucose, and BMI via expression of the SREBF1 gene.The proportion of the association between DNA methylation and the three cardiometabolic traits mediated by SREBF1 expression ranged from 0.16 to 0.36.Gene transcript expression mediated the effect of DNA methylation on serum triglycerides (log-transformed) for ten of the 14 CpG-transcript pairs.Significant mediation of DNA methylation on gene expression was observed for the expression of RNASET2, SREBF1, ABCG1, and TOM1L2 at p ≤ 0.05.
Gene transcript expression significantly mediated the effect of DNA methylation on fasting blood glucose for 24 of the 36 CpG-transcript pairs, of which six had direct and mediated effects in opposite directions (e.g., a negative direct effect and a positive mediated effect).The greatest proportions of the CpG-fasting glucose associations mediated by gene expression (p < 2E−16 for all) were observed for the expression of KLRF1 (proportion mediated = 0.87), TRAPPC2B (proportion mediated = 0.82), NKIRAS2 (proportion mediated = 0.74), and RRP12 (proportion mediated = 0.52).
Finally, gene transcript expression significantly mediated the effect of DNA methylation on BMI for 117 of the 153 CpG-transcript pairs, of which 55 had direct and mediated effects in opposite directions.For the CpG-transcript pairs where the direct and mediated effects were in the same direction, the greatest proportions of the CpG-BMI associations mediated by gene expression (p < 2E−16 for all) were observed for the expression of TSPYL1 (proportion mediated = 0.98), PLD3 (proportion mediated = 0.94), and FAS (proportion mediated = 0.80).Additionally, two CpG-transcript pairs showed significant mediation of the CpG-BMI association through expression of lncRNA genes BX284668.5 and LINC00996.
eQTM CpG-transcript pairs in colocalization analysis of cardiometabolic traits.We conducted a Bayesian colocalization analysis to evaluate whether the cis-CpG-transcript eQTM pairs associated with serum log triglycerides, fasting blood glucose, and BMI were regulated by shared genetic variants.The MRC IEU EWAS catalog was used to identify CpG sites associated with each of the clinical phenotypes of interest, and three sources of data were used for colocalization analysis (Fig. 3).Among significant serum triglycerides-associated CpG sites identified in the MRC IEU EWAS catalog, 15 overlapped with CpGs from significant cis CpG-transcript pairs where the CpG and gene transcript also shared at least one SNP (i.e., the eQTL variant for the transcript matched the mQTL variant for its paired CpG).Colocalization results for the 15 cis eQTM CpG-transcript pairs, which included 12 unique CpG sites associated with 4876 significant cis-mQTL variants and 11 unique gene transcripts associated with 6,327 significant cis-eQTL variants, are presented in Supplemental Table 12.
The probability of colocalization H4 ≥ 80% was observed between the CpG and transcript SNPs for ten eQTM CpG-transcript pairs.The SNP rs7215055 was associated with serum triglycerides 17 and was also associated with methylation of CpG cg11024682 and expression of gene ATPAF2 (probability of colocalization H4 = 100%).Among significant fasting glucose-associated CpG sites identified in the MRC IEU EWAS catalog, 138 overlapped with CpGs from significant cis-CpG-transcript pairs where the CpG and gene transcript also shared at least one SNP.Colocalization results for the 138 cis-eQTM CpG-transcript pairs, which included 109 unique CpG sites associated with 38,758 significant cis-mQTL variants and 172 unique gene transcripts associated with 100,546 significant cis-eQTL variants, are presented in Supplemental Table 13.Probability of colocalization ≥ 80% was observed between the CpG and transcript SNPs for 51 eQTM CpG-transcript pairs.There was no overlap of SNPs identified in a GWAS of fasting glucose 18 with both mQTL and eQTL variants for CpG-transcript pairs, thus further colocalization analyses were not carried out for this trait.
Finally, among significant BMI-associated CpG sites identified in the MRC IEU EWAS catalog, 207 overlapped with CpGs from significant cis CpG-transcript pairs where the CpG and gene transcript also shared at least one SNP.Colocalization results for the 207 cis-eQTM CpG-transcript pairs, which included 134 unique CpG sites associated with 82,823 significant cis-mQTL variants and 182 unique gene transcripts associated with 232,511 significant cis-eQTL variants, are presented in Supplemental Table 14.Probability of colocalization ≥ 80% was observed between the CpG and transcript for 67 eQTM CpG-transcript pairs.The SNPs rs3817334, rs10838738, and rs1064608, which were associated with BMI 19 , were also mQTL genetic variants associated with the CpG cg17580616 and eQTL genetic variants associated with the gene ACP2 (probability of colocalization H4 = 100%).

Discussion
Our study of DNA methylation in relation to gene expression in 2115 FHS participants identified over 70,000 significant cis CpG-transcript pairs and over 246,000 significant trans CpG-transcript pairs.We generated a comprehensive database of whole blood cis and trans eQTM CpG-transcript pairs that can be used to investigate disease mechanisms, pathways, and processes.We previously evaluated eQTM CpG-transcript pairs in the FHS using the Illumina 450 K DNA methylation array in conjunction with array-based gene expression data 7 .The present study expands on this prior work by incorporating more extensive DNA methylation profiling (Illumina EPIC DNA methylation array) and RNA-seq expression, resulting in a powerful resource for capturing associations between DNA methylation and gene expression.Not only did we identify four times as many cis-eQTM CpG-transcript pairs (70,047 vs. 16,416) and 24% more trans-pairs (246,667 vs. 198,960) than we previously identified using array-based gene expression data 7 , but we also provide eQTM CpG-transcript pairs for the expression of lncRNAs.
DNA methylation can alter the regulation of gene transcription in multiple ways, such as aiding in the recruitment of proteins involved in increasing gene expression or inhibiting transcription factor binding to a specific DNA sequence.At promoter sites, DNA methylation generally precludes transcription directly by blocking the binding of transcriptional activators or indirectly by recruiting methyl-binding proteins and co-repressor complexes.In this study, we used eQTM analysis to identify numerous CpG sites that, when methylated, are associated with cis and trans eGene expression.Associations between SNPs and DNA methylation (i.e., mQTL associations) may contribute to bimodal distributions in eQTM CpG-transcript associations, many of which were not attributable to any nearby SNPs (i.e., there was no evidence of SNP-on-CpG effects 20 contributing to bimodality).Indeed, conditional analyses of nearby SNPs on select associations showed that the CpG-transcript associations are still highly significant.Furthermore, to showcase the potential clinical utility of our findings, we evaluated the role of our eQTM resource when applied to the analysis of three cardiometabolic risk factors-serum triglycerides, fasting blood glucose, and BMI-in mediation and colocalization analyses.We illustrate specific examples of how the eQTM resource can be used to suggest molecular mechanisms linking DNA methylation to clinical phenotypes: Mediation analyses identified putative pathways by which DNA methylation is associated with clinical traits in EWAS, and colocalization analysis suggested shared genetic regulation of CpGs, transcripts, and clinical traits.For example, the results of our mediation analysis showed that expression of the genes ABCG1 and SREBF1 mediates the association between DNA methylation and all three clinical traits.
While the eQTM resource we created is robust and highly replicable, several limitations must be noted.First, FHS is an observational study consisting of participants of predominantly European ancestry; however, we provide evidence of substantial independent external replication of these results, including among individuals with African ancestry from the JHS.Second, while the use of DNA methylation and gene expression data derived from whole blood samples makes our methods more accessible and more easily replicated in other study cohorts, evaluation of CpG-transcript pairs in other tissues may be more relevant to specific disease processes.Additionally, these analyses do not account for linkage disequilibrium and other factors that may correlate www.nature.com/scientificreports/proximal CpGs with one another.Finally, this study does not examine the effect of environmental factors on these associations; lifestyle factors such as cigarette smoking are known to affect DNA methylation and gene expression 7,[21][22][23] .Future eQTM research may incorporate environmental data to better understand the epigenetic effects of environmental exposures.

Conclusions
We created a powerful eQTM resource that leverage DNA methylation and RNA-seq data to characterize associations of DNA methylation with gene expression.We provide proof of concept that eQTM resources can be leveraged to explore molecular mechanisms of disease.

Methods
Cohort description.DNA methylation and RNA-seq data were collected from 2115 participants in the FHS Offspring (n = 686) and Third Generation (n = 1429) cohorts.Peripheral whole blood samples were collected in the fasting state at the ninth examination cycle from FHS Offspring participants (2011-2014) 24 and at the second examination cycle (2006-2009) 25 from FHS Third Generation participants.All study protocols were approved by the Boston Medical Center Institutional Review Board and performed in accordance with the Declaration of Helsinki and relevant regulations, and all study participants provided their written informed consent.
DNA methylation data collection.Buffy coats were isolated from whole blood samples and prepared with bisulfite conversion for the DNA methylation assays.Samples from 1045 FHS Third Generation participants were assayed for DNA methylation using the Infinium HumanMethylation450 BeadChip (Illumina Inc., San Diego, CA).Samples from the remaining 384 FHS Third Generation participants and all FHS Offspring participants were assayed for DNA methylation using the Infinium MethylationEPIC BeadChip (Illumina Inc., San Diego, CA).Methylation probes on autosomal chromosomes were analyzed while probes containing polymorphic SNPs were excluded.
DNA methylation data pre-processing.DNA methylation data were pooled across both the Human-Methylation450 and the MethylationEPIC platforms, which shared 452,568 CpG loci.In addition to the data captured at these loci, samples assayed by the HumanMethylation450 platform had methylation data for 32,945 additional CpG sites, while those assayed by the MethylationEPIC platform had methylation data for 413,524 additional CpG sites.Quality control (QC) and normalization were performed on the DNA methylation β values using the "dasen" function in the R wateRmelon package (version 1.16.0) 26 .Normalized β values were then residualized after accounting for batch effects, row effects, column effects, and four PCs constructed from the normalized β values.These residualized β values were subsequently winsorized at the mean ± 3 * standard deviation (SD) of the distribution for each probe.
Ten principal components (PCs) representing technical confounders were identified from the winsorized residuals.These ten PCs and the winsorized β values representing the proportion of methylation per CpG locus were subsequently used in statistical analyses.
RNA-seq data collection.Whole blood samples were collected in PAXgene tubes (PreAnalytiX, Hombrechtikon, Switzerland).Isolation of total RNA was performed according to standard protocols (Asuragen, Inc., Austin, TX) using the PAXgene Blood RNA System Kit as described previously 27,28 RNA-seq was performed in accordance with TOPMed protocols (University of Washington Northwest Genomics Center), and the mRNAseq library was generated using the Illumina TruSeq (Illumina Inc., San Diego, CA).Stranded mRNA kit and sequencing was performed using the Illumina NovaSeq system (Illumina Inc., San Diego, CA).Gene expression was evaluated using RNA SeQC version 2.3.3 29 and transcript expression was evaluated using RSEM version 1.3.1 30 .
RNA-seq data pre-processing.Gene expression data were normalized using the trimmed mean of M-values (TMM) approach in the R edgeR package 31 .A log2 transformation was applied to the TMM-normalized values after addition of 1 unit to avoid taking log of zero.This normalized gene expression value was residualized to account for batch effects, RNA concentration, and RNA integrity number.
Statistical methods: eQTM analysis.We identified PCs from the residualized DNA methylation and gene expression data and performed an extensive search of the number of PCs to optimize the cross-validated replication of the CpG-transcript pairs.This study identified ten PCs for the DNA methylation data (percent of variation 53.7%) and five PCs for the gene expression data (percent of variation 51.6%).
We then calculated the association between gene-level CpG sites and gene transcripts to identify significant eQTM CpG-transcript pairs.For each CpG-transcript pair, residualized gene expression was modeled as the outcome with residualized DNA methylation β values as the primary explanatory variable.These models were adjusted for age, sex, white blood cell count, blood cell fraction 32 , platelet count, five gene expression PCs, and ten DNA methylation PCs.Each CpG-transcript pair association was calculated individually according to this model, which was repeated for each pair using Graphical Processing Units (GPUs) to accelerate computation.Cis gene-level CpG-transcript pairs were defined as those where the CpG site and the eGene (i.e., the gene transcript associated with the CpG) were within 1 Mb kb of one another and were deemed statistically significant at a Bonferroni-corrected p-value of 1E−7 33 .Trans pairs were defined as those where the CpG site and the eGene www.nature.com/scientificreports/identify overlapping single nucleotide polymorphisms (SNPs) associated with the CpG and gene transcript, respectively.We then evaluated the overlap of cis-SNPs associated with CpG sites (mQTL) and cis-SNPs associated with gene transcripts (eQTL) with SNPs associated with serum log triglycerides 17 , fasting blood glucose 18 , and BMI 19 in published genome-wide association studies (GWAS).We used the R package coloc (version 5.1.0) 41o quantify the probabilities (probability of colocalization H4) that a single genetic variant was associated with both DNA methylation and gene expression (i.e., colocalization of the mQTL and eQTL SNPs) and that a single genetic variant was associated with both DNA methylation and the clinical trait of interest (i.e., colocalization of the mQTL and GWAS SNPs).

Figure 2 .
Figure 2. Pathways showing cis mediated effects of DNA methylation at cg11024682 on serum triglycerides, fasting blood glucose, and BMI through the expression of SREBF1.The effects of DNA methylation at the CpG site cg11024682 on all three cardiometabolic traits of interest are mediated through cis effects on SREBF1 expression.Coefficients presented in the figure represent the change in gene expression associated with a 10% increase in DNA methylation; the change in serum triglycerides (log(mg/dL)), fasting blood glucose (mmol/L), or BMI (kg/m 2 ) associated with a 10% increase in DNA methylation; or the change in serum triglycerides (log(mg/dL)), fasting blood glucose (mmol/L), or BMI (kg/m 2 ) associated with a 1-unit increase in gene expression.

Table 1 .
Characteristics of FHS participants.