Abstract
Highdimensional omics datasets provide valuable resources to determine the causal role of molecular traits in mediating the path from genotype to phenotype. Making use of molecular quantitative trait loci (QTL) and genomewide association study (GWAS) summary statistics, we propose a multivariable Mendelian randomization (MVMR) framework to quantify the proportion of the impact of the DNA methylome (DNAm) on complex traits that is propagated through the assayed transcriptome. Evaluating 50 complex traits, we find that on average at least 28.3% (95% CI: [26.9%–29.8%]) of DNAmtotrait effects are mediated through (typically multiple) transcripts in the cisregion. Several regulatory mechanisms are hypothesized, including methylation of the promoter probe cg10385390 (chr1:8’022’505) increasing the risk for inflammatory bowel disease by reducing PARK7 expression. The proposed integrative framework can be extended to other omics layers to identify causal molecular chains, providing a powerful tool to map and interpret GWAS signals.
Similar content being viewed by others
Introduction
In the past decade, genomewide association studies (GWASs) have identified thousands of genetic variants associated with complex traits^{1}, however, linking these variants to molecular pathways still remains challenging^{2}. GWAS signals of common diseases predominantly fall into the noncoding genome^{3} and both their enrichment in regulatory elements (e.g., quantitative trait loci (QTL)^{3,4}), as well as advances in omics technology^{5}, have motivated the establishment of largescale consortia providing publicly available QTL datasets for molecular phenotypes such as DNA methylation (DNAm)^{6}, transcript^{7,8}, protein^{9,10,11} and metabolite^{12,13} levels.
Integrative statistical methods combining GWAS and omics QTL summary data include colocalization tests^{14,15}, summary versions of transcriptomewide association studies (TWAS)^{16,17} and Mendelian randomization (MR) studies^{18,19}. Colocalization methods identify shared QTL and GWAS signals, and while this might indicate causality between the molecular and GWAS trait, signal overlap can also arise due to reverse causality (i.e., causal effect of the GWAS trait on the molecular trait^{20}) or horizontal pleiotropy (i.e., the identified shared genetic variant drives the molecular and trait perturbation independently). In comparison, MR studies, which are conceptually similar to TWAS, use multiple genetic variants as instrumental variables (IVs) and are less prone to reverse causality and artefacts arising from LD patterns^{21}  although horizontal pleiotropy can never be ruled out entirely. In addition, MR analyses allow the quantification  direction and magnitude  of the causal effect of the omic on the outcome trait.
With the advent of QTL datasets with increased sample sizes^{6,8}, opportunities to integrate GWAS data with multiple molecular traits are no longer hampered by low statistical power. Previous efforts integrating multiple QTL omics data either adopted colocalization strategies^{22,23} or combined pairwise MR associations (twostep MR)^{24,25} testing only a single molecular mediator. Multivariable MR (MVMR) approaches have been proposed to identify multiple mediators of exposureoutcome relationships^{26,27}. These approaches enable the dissection of the total causal effect of an exposure on an outcome into a direct and indirect effect measured via mediators. Similar to classical MR approaches, the use of genetic instruments allows for robust causal inference and MVMR has proven to be an unbiased approach for mediation analyses, even in the presence of confounders^{26,27}. Hence, in addition to identifying causal effects through multiple layers, MVMR allows the quantification of mediation effects.
Here, we propose a threesample MVMR (3SMVMR) framework to quantify the role of cistranscripts in mediating DNAm → complex trait causal relationships (Fig. 1). To do so we integrated methylation and transcript QTLs (mQTLs and eQTLs, respectively) with GWAS summary data of 50 clinically relevant traits to estimate global mediation proportions (MPs), i.e., the proportion of transcriptmediated causal effect relative to the total effect of DNAm on complex traits. In contrast with previous multiomics integration methods, each 3SMVMR regression analysis makes use of at least 5 nearindependent instrumental variables (IVs) allowing for more robust causal inference and posthoc sensitivity analyses. We performed simulation studies to assess biases of the 3SMVMR estimates for MP under various parameter settings. In addition to quantifying the regulatory connectivity between DNAm and transcript levels, we investigated underlying factors driving high MPs, and hypothesized several mechanistic pathways between DNAm, gene expression and complex traits.
Results
Overview of the methods
We performed univariable and multivariable MR to estimate total (\({\hat{\theta }}_{T}\)) and direct (\({\hat{\theta }}_{D}\)) causal effects, respectively, with MP (mediation proportion) estimates being calculated as the ratio of the indirect effect (i.e., mediated through the molecular mediators) to the total effect of the exposure on the outcome trait^{28} (Fig. 1; Eqs. (1) and (3)). If weak genetic instruments can introduce a bias towards the null in a univariable MR setting^{29}, this bias can be in any direction for MVMR studies^{30}. Both sample size and choice of instruments and mediators can introduce a bias in any direction^{30}, leading to under or overestimations of the MP. To quantify these biases and assess the sensitivity of estimated \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s, we conducted simulation studies mimicking settings that emerge from real data applications (Methods; Supplementary Fig. 4).
We then applied our framework in a genomewide screen to estimate \({\hat{\theta }}_{T}\) of DNAm sites on 50 outcomes and contrasted them to the effects not mediated by transcripts in cis (\({\hat{\theta }}_{D}\)). Genetic effect sizes on the DNAm and transcript levels came from the largest publicly available mQTL and eQTL datasets, respectively, derived from whole blood^{6,8}. MP estimates were then computed only for DNAmtrait pairs with significant Bonferronicorrected \({\hat{\theta }}_{T}\) effects, grouped by trait, trait category and all pairs combined. We present MP results for DNAmtrait pairs with at least one mediator significantly associated to the exposure ("detectable mediation"), but also for pairs, including the ones without a significant causal effect on any potential transcript ("overall mediation"). The overall MP quantifies more accurately the role of cistranscripts in mediating DNAm effects, as the restriction to only DNAmtrait pairs with a mediator could introduce a selection bias towards higher MPs. Additionally, we performed various sensitivity analyses on these MR results to assess the robustness of the MP estimates: assessing weak instruments (through conditional Fstatistics), heterogeneity tests (through heterogeneity Qstatistics and leaving the strongest instrument out) and estimating bias due to bychance signal overlap (through simulations).
Simulation results
We performed simulation studies to assess the bias in estimated MPs (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)) by exploring a wide range of realistic parameter settings which cover at least the interquartile range as observed in real data (Supplementary Figs. 45; Supplementary Tables 12; Methods). Using default settings (i.e., median values for each parameter such as 2 true mediators N_{med} and a true MP of 35%), the bias in \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) is minimal with the mean \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) equalling 33.5% (95% CI: [32.0%–35.0%]; Supplementary Fig. 6; Supplementary Table 2). A determining factor in accurately estimating MPs was the available sample size to derive the mediator QTL effects. Low sample sizes resulted in significant underestimations of the MP, with mediator sample size of 3000 compared to 30,000 resulting in a 17% relative decrease (6% in absolute values) of the estimated \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) (Fig. 2a). The reason for this significant underestimation was not only weak instrument bias, but also the omission of relevant mediators with on average only 1.17 (N_{med,sig}) out of the 2 (N_{med}) relevant mediators detected at a sample size of 3000 (Fig. 2a). We further tested the robustness of the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) with respect to the number of included mediators by varying the mediator selection threshold P_{EM}. Among a set of 20 potential mediators, those not passing the P_{EM} as determined by univariable MR effects of the exposure on each of these mediators were excluded from the MVMR model (Methods). Using a too lenient or too stringent P_{EM} threshold resulted in downward biased \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s (Fig. 2b), as the former leads to the inclusion of too many nonmediators in the model (giving rise to weak instrument bias), while the latter case fails to include relevant mediators in the model. The used mQTL and eQTL datasets provide SNP effect sizes in cis of the assessed DNAm probe and transcript levels, respectively, and were primarily restricted to significant mQTLs for the former. Thus, in the MVMR analysis SNPexposure effects for mediator instruments are often nonsignificant (hence unreported) and set to zero to reduce regression dilution bias (i.e., weak instrument bias). Our simulation studies, which mimicked this scenario by setting nonsignificant effects to zero (Methods), confirmed that this did not introduce any bias.
Furthermore, we investigated weak instrument bias of both exposure and mediatorassociated IVs. When mediatorassociated IVs were weak (i.e., low direct mediator heritabilities (\({h}_{M{{{{{{{\rm{,direct}}}}}}}}}^{2}\); Methods), a high variability and significant underestimation of the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) was observed (Fig. 2c). In case of low mediator heritability, the conditional Fstatistics of the exposure was also below the critical threshold of < 10 (Methods) indicating weak instruments. Similarly, for low exposure heritability (\({h}_{E}^{2}\)), underestimated \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s were obtained, even in case of high conditional Fstatistics ( >120; Fig. 2d). Additional simulation studies with more polygenic exposures and increased number of relevant mediators N_{med} for different exposure and mediator heritabilities corroborated the findings of underestimated \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s in case of weak instruments (Supplementary Fig. 7).
Application to 50 complex traits
We first estimated the causal effects of DNAm probes on 50 complex traits, ranging from biomarkers indicative for diseases, such as lowdensity lipoprotein (LDL) and glucose levels, to diseases such as asthma and schizophrenia (Supplementary Data 1). DNAmtrait pairs with a significant total causal MR effect (P_{T} < 1e6) were then further assessed to examine what fraction of the DNAm → trait causal effect is mediated by transcripts in cis (Fig. 3a; Supplementary Fig. 1). Mediation analyses could be conducted for 2069 pairs, for which at least 1 transcript was causally associated to the DNAm exposure (detectable mediation). First, we regressed \({\hat{\theta }}_{D}\) against \({\hat{\theta }}_{T}\) within each trait influenced by at least 10 DNAm probes while accounting for regression dilution bias^{31} (Eq. (6)). \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s estimated for each of these 41 traits ranged from 18.0 to 78.0% (mean: 36.9%, 95% CI: [13.5%–60.3%]) with the trait with the highest \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) being grip strength and the one with the lowest testosterone level (Fig. 3b). Regressing \({\hat{\theta }}_{D}\) against \({\hat{\theta }}_{T}\) for all pairs combined yielded an \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) of 37.8% (95% CI: [36.0%–39.5%]) (Fig. 3c). Grouping the traits into 10 physiological categories (Supplementary Data 1) showed that the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) was highest for hepatic biomarkers (mean: 46.6%, 95%CI: [41.5%–51.7%]), followed by renal biomarkers (mean: 43.5%, 95%CI: [37.5%–49.5%]). In contrast, adiposityrelated and hormonal traits exhibited the lowest \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) (Fig. 3b; Supplementary Fig. 8).
In addition to the 2069 DNAmtrait pairs with detectable mediation, there were 554 pairs testable for mediation, but with no detectable causally implicated transcript (Fig. 3a). Setting \({\hat{\theta }}_{D}\) to \({\hat{\theta }}_{T}\) for these pairs and regressing \({\hat{\theta }}_{D}\) against \({\hat{\theta }}_{T}\) for all 2623 DNAmtrait pairs combined reduced the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) to 28.3% (95% CI: [26.9%–29.8%]) (Fig. 3d). We refer to this \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) as the overall \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\), as it is a more objective measure of the importance of the transcriptome in mediating DNAmtophenotype effects. While more reflective of mediated DNAm effects, it may also be overly conservative since the set of testable transcript mediators (N = 19,250^{8}) is a magnitude lower than that of the whole transcriptome^{32}.
The average number of mediator transcripts, potentially correlated, was 3.3 per methylationtrait pair with detectable mediation, indicating that the impact of methylation is not mediated by a single transcript. To further explore this observation, we assessed the extent to which DNAm → trait effects were mediated by the single most significantly DNAmassociated transcript ("top" transcript; Methods), as opposed to all transcripts in cis. This resulted in an \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{top} of 26.0% (range: [13.0%–46.8%]) averaged across the 41 traits, and an \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{top} of 26.6% (95% CI: [25.1%–28.1%]) when aggregating the 2069 DNAmtrait pairs (Supplementary Fig. 9). This significant drop in the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) (P_{diff} < 5e21) corroborates our initial hypothesis that DNAm sites regulate the expression of multiple transcripts in the cis region.
MVMR sensitivity analyses
We conducted MVMR sensitivity analyses to assess potential sources of bias of the MP estimates such as weak instruments and pleiotropy.
To test whether the MVMR estimates suffer from weak instrument bias, we calculated conditional Fstatistics^{33}. These statistics reflect whether genetic variants sufficiently explain the variance in the exposure given the presence of mediators. As demonstrated by Sanderson et al., direct effect estimates (\({\hat{\theta }}_{D}\)) of exposuretrait pairs for which the Fstatistic is ≤10 might be biased^{33}. Among the 2069 DNAmtrait pairs, 1061 had an Fstatistic > 10 with an \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) of 35.5% (95% CI: [33.6%–37.5%]) which was not significantly lower than the one for all pairs combined (P_{diff}=0.09). Pairs with Fstatistics ≤10 (N=1008) had significantly more mediators (4.32 vs 2.35, twosided ttest: P=2.13e64), but not a significantly higher \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) (mean: 40.9%, 95% CI: [37.8%–44.0%]; P_{diff}= 0.08; Supplementary Figs. 1011).
Pleiotropic IVs violate MR assumptions and heterogeneity tests, such as the Cochran’s Qstatistic, can be used to detect them, assuming that most IVs are valid^{34}. We calculated Qstatistics for the IV sets in both the univariable and multivariable MR analyses. Out of the 2069 DNAmtrait pairs, 1757 showed no signs of heterogeneity in the univariable MR analyses (P_{HET} > 0.01) and 1405 in neither the univariable nor multivariable analyses. The \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) of these 1405 pairs was not significantly different from the overall one (mean: 38.3%, 95% CI: [36.1%–40.6%]; P_{diff}=0.7; Supplementary Fig. 12).
Next, we assessed the influence of the pvalue threshold P_{EM} to select mediators based on the exposuretomediator causal effect (default P_{EM}=0.01 for which N=2069 DNAmtrait pairs with at least 1 mediator were found). With a more lenient threshold (P_{EM}=0.05), more DNAmtrait pairs with mediators emerged (N=2189). Conversely, with a more stringent threshold (P_{EM}=0.001), less pairs were detected (N =1881). No differences in MPs between the three settings were found in these detectable mediation analyses (P_{diff} > 0.05; Supplementary Fig. 13), but when calculating the overall MP (i.e., inclusion of all DNAmtrait pairs with potential transcript mediators in the cisregion) on a common set of DNAmtrait pairs (N=2543, \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{overall,P01}=27.6% (95% CI: [26.1%–29.2%])), a significantly higher MP for the more lenient threshold (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{overall,P05}=32.0% (95% CI: [30.4%–33.6%]); P_{diff}=1.1e4), and significantly lower MP for the more stringent threshold were observed (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{overall,P001}=24.6% (95% CI: [23.2%–26.1%]); P_{diff}=4.8e3; Supplementary Fig. 14).
Finally, we conducted sensitivity analyses to determine whether significant MR associations were due to horizontal pleiotropy. Regulatory pathways between DNAm exposure probes and transcript mediators were assessed in cis. As such, SNPs in LD with significant QTLs for both quantities could give rise to an association merely because of horizontal pleiotropy (i.e., due to random overlap between cisQTLs in close vicinity), an issue further exacerbated by the fact that molecular omics entities generally have fewer associated IVs than complex traits. To assess whether mediation results are only based on a single strong genetic instrument, we repeated the mediation analysis excluding the top IV (i.e., exposureassociated IV with the lowest pvalue) from both the total effect θ_{T} and direct effect θ_{D} calculations (Methods). The results show that while MR effect estimates remain concordant in magnitude and effect direction, the estimates are noisier due to the much weaker instruments (significantly lower Fstatistics; twosided ttest: P=5.37e11; Supplementary Fig. 15). MP estimates were also higher when the top IV was excluded (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) =47.3% (95% CI: [38.4%–56.2%]); P_{diff}=0.023; Supplementary Fig. 15), however, this was no longer the case when controlling for conditional Fstatistics > 10 (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) =40.9% (95% CI: [29.3%–52.4%]); P_{diff}=0.48). Additionally, we performed simulation analyses to assess the possibility of significant DNAmtranscript associations caused by cismQTL and eQTL signals being in LD (Methods). The analysis shows that randomly picked eQTLSNPs in the region result in slightly inflated, but much weaker MR associations than using the original eQTL data (Supplementary Figs. 1617). The results indicate that bychance LD between cismQTLs andeQTLs can yield false positive findings, but those signals are substantially weaker than the ones observed in real data. In other words, mQTL and eQTL IVs are in much higher LD than expected by chance.
Overall, these sensitivity analyses showed that the estimated MPs remain robust when removing DNAmtrait pairs that potentially violate MVMR assumptions, while also suggesting that the set P_{EM} threshold of 0.01 may lead to underestimated MP estimates. Finally, we found strong evidence that molecular associations mediating DNAmtrait effects are predominantly due to vertical pleiotropy, even when only a limited number of IVs were available.
Determining factors of mediation proportions
We explored underlying factors driving high MPs through transcript levels (Fig. 4a). \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{top} decreased with increased distances between the DNAm site and the gene transcription start site (TSS) of the top transcript (ρ = −0.076, P = 5.2e4; Fig. 4b). This distance was also negatively correlated to the DNAmtotranscript MR squared effect size, \({\alpha }_{{{{{{{{\rm{EM}}}}}}}}}^{2}\), (ρ = −0.13, P = 3.1e19; Fig. 4c), which in turn was a good predictor for high MPs (ρ = 0.39, P = 2.5e75; Fig. 4d). The mediation proportion was the highest for DNAm sites residing in the first exon, followed by those in the 5’UTR, within 200 bp of the TSS and lowest for those within 1500 bp of the TSS and in the gene body (Supplementary Fig. 18).
DNAm inhibiting the binding of transcription factors (TFs) thereby repressing gene expression is often alluded to as the classical mechanism of action for DNAm^{35}. From the 1,066,307 unique DNAmtotranscript causal effects assessed, 47,445 were significant at P < 4.7e8. Although negative effects had a larger magnitude than positive ones (twosided ttest: P = 0.0082) only 53.4% of DNAm → transcript causal effects were negative. Stratifying DNAm sites with respect to their location on the assessed transcript, we found that DNAm sites situated in the first exon and nearby the TSS were enriched for negative effects (P = 2.7e3, 1.2e5 and 3.8e4 for 1st exon, TSS ± 1500 bp and TSS ± 200 bp, respectively), whereas those in the gene body were enriched for positive ones (P = 2.2e10; Supplementary Table 3). These observations are in line with previous studies that only showed a slight trend for negative methylationgene expression correlations^{36,37,38,39}. We further tested whether the MR DNAmtotranscript causal effects correlated with reported methylationtranscript correlations^{37} and found a strong agreement (ρ = 0.39, P = 2.6e18, 471 DNAmtranscript pairs).
Consistent with higher MPs when mediating through multiple transcripts, we found a strong correlation between the number of mediators and the MP (ρ = 0.39, P = 4.4e75; Fig. 4e). Many of these mediators were correlated amongst each other, which in theory should be accounted for by the MVMR model. To ensure that this was the case, we repeated the mediation analysis with uncorrelated mediators (R_{med} < 0.3; Methods). The mean number of selected mediators dropped by more than half, from 3.3 to 1.2 (Supplementary Fig. 19), and the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) across all DNAmtrait pairs decreased (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{uncorrelated} = 30.5% (95% CI: [28.8%–32.1%])), while remaining significantly higher than \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{top} (P_{diff} = 6.6e4). Decreasing the R_{med} threshold to 0.2 and 0.1 did not significantly decrease \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)_{uncorrelated} (P_{diff} > 0.05), which stabilized at 29.2% (95% CI: [27.5%–30.8%]) for R_{med} < 0.1 (Supplementary Fig. 19).
Furthermore, we investigated whether \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s are dependent on the DNAm → transcript causal effect directions following the logic of a recent DNAmtranscript correlation study^{39}. To this end, we stratified DNAmtrait pairs by the α_{EM} sign and number of mediators (Table 1). If there was only a single mediator, \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s were significantly higher if the DNAm was decreasing expression (P_{diff} = 3.49e8). This is consistent with the observation that negative effects α_{EM} were larger than positive ones and the positive correlation between α_{EM} magnitudes and high \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s (Fig. 4d). When there were multiple mediators, most DNAm sites had negative effects on some transcripts and positive effects on others. These bivalent DNAm probes exhibited the highest \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s (\(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) = 53.9% (95% CI: [51.2%–56.5%]))  a consequence of being causally associated to more mediators than average (5.01 vs 3.31), with N_{med} being a strong predictor for high \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s (Fig. 4e). Combining DNAmtrait pairs with single and multiple mediators, but with consistent negative or positive α_{EM} values, the observation of higher \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s when DNAm was decreasing transcript levels persisted (P_{diff} = 0.020).
Putative regulatory mechanisms of action
In addition to providing insights into global patterns governing the mediation between different intermediate phenotypic layers and functional traits, our analyses generated plausible hypotheses regarding specific biological pathways. We chose to followup putative regulatory mechanisms of DNAmtocomplex traits through transcript levels which showed both strong total effects (\({\hat{\theta }}_{T} > 0.02\)) and substantial mediation proportion (\(\widehat{{{{{{{{\rm{MP}}}}}}}}} > 0.2\); complete list in Supplementary Data 2).
Involvement of the antioxidant and antiinflammatory protein PARK7 in inflammatory bowel disease (IBD) has recently been brought to light^{40,41,42,43}. While the exact role of the protein in the disease remains debated, reduced intestinal expression of PARK7 was observed in patients and mouse models for IBD^{43}. Moreover, Park7 knockout mice were shown to have increased levels of procolitis bacterial species in their microbiome^{42,44} and experience aggravated symptoms of experimentallyinduced colitis^{43}. In line with these observations, DNAm of the PARK7 promoter probe cg10385390 (chr1:8’022’505) decreased PARK7 transcript expression (\({\hat{\alpha }}_{{{{{{{{\rm{EM}}}}}}}}}\) = −0.675, P = 2.7e4; Fig. 5a). High transcript levels decrease IBD risk (\({\hat{\alpha }}_{{{{{{{{\rm{MY}}}}}}}}}\) = −0.131, P = 1.7e7) resulting in an overall increased IBD risk upon DNAm (\({\hat{\theta }}_{T}\) = 0.114, P = 8.2e9).
Despite often being associated with decreased expression^{35}, our data provides examples of methylation boosting expression. For instance, DNAm of cg13428477 (chr3:122’748’086) increased PDIA5 expression (\({\hat{\alpha }}_{{{{{{{{\rm{EM}}}}}}}}}\) = 0.333, P = 7.3e11), whose levels subsequently increased platelet count (\({\hat{\alpha }}_{{{{{{{{\rm{MY}}}}}}}}}\) =0.062, P = 0.018), so that DNAm resulted in significantly increased platelet count (\({\hat{\theta }}_{T}\) = 0.056, P = 1.3e43) (Fig. 5b). Association between the PDIA5 locus and platelet count was reported through GWAS^{45}. Platelets are small cell fragments produced by megakaryocytes, which themselves are derived from hematopoietic stem cells. Accordingly, PDIA5 has a binding site for the hematopoietic stem and progenitor cell TF MEIS1^{46} and is overexpressed in megakaryocytes as compared to other blood cell types^{47}. Further studies showed that pdia5 protein knockdown in zebrafish resulted in strongly decreased platelet count^{48}, matching our findings and confirming the role of PDIA5 in thrombopoiesis.
In another example, we observed that DNAm of cg09070378 (chr1:161’183’762) decreased asthma risk (\({\hat{\theta }}_{T}\) = −0.031, P = 8.1e11) by reducing FCER1G expression (\({\hat{\alpha }}_{{{{{{{{\rm{EM}}}}}}}}}\) = −1.0, P = 3.5e18), a gene listed in the KEGG pathway for asthma (hsa05310) and whose expression associated with an increased risk for asthma (\({\hat{\alpha }}_{{{{{{{{\rm{MY}}}}}}}}}\) = 0.019, P = 3e12) (Supplementary Fig. 21). The FCER1G promoter was found to be hypomethylated in patients with atopic dermatitis, with DNAm levels correlating negatively with the gene’s expression^{49}, suggesting a broad role of FCER1G in allergic disorders. Our data also supports and provides a mechanistical explanation for the recent finding that reduced IFNAR2 expression causally decreases the odds of severe coronavirus disease 2019 (COVID19)^{50,51}, which was later supported by the increased susceptibility for severe COVID19 in individuals with rare lossoffunction mutations in IFNAR2^{52}. Indeed, we found that DNAm of the IFNAR2 promoter probe cg13208562 (chr21:34’603’264) decreased the gene’s expression (\({\hat{\alpha }}_{{{{{{{{\rm{EM}}}}}}}}}\) = −0.446, P = 2.4e19) (Supplementary Fig. 22). As IFNAR2 expression protects against hospitalization following COVID19 infection (\({\hat{\alpha }}_{{{{{{{{\rm{MY}}}}}}}}}\) = −0.090, P = 4.2e6), DNAm of the locus increased the risk of severe infection (\({\hat{\theta }}_{T}\) = 0.064, P = 8.5e13).
Discussion
We presented a framework to quantify mediation of complex traitimpacting effects through an omics layer and demonstrated its application to assayed bloodderived DNAm (exposure) and transcript levels (as mediator). Evidence for mediation of DNAmtotrait effects through transcripts in cis was found to be at least 28.3% for the 2623 DNAmtrait pairs with significant total causal effects that could be assessed. While many robust methods are available for univariable MR, it is not the case for MVMR^{26,27}. Still, we could confirm the robustness of our MVMR estimates through various sensitivity analyses (conditional Fstatistic, heterogeneity Qstatistic, excluding the strongest IV) that could not pinpoint any factor drastically biasing our MP estimates. Importantly, simulation studies indicated that MP estimates were likely to be lower bounds. Low sample size was shown to lead to MP underestimations, as do weak instruments, both for exposure and mediatorassociated IVs.
Additionally, we quantified the causal connectivity and directionality between DNAm and transcript levels and its impact on MPs. We found that 46.6% of significant DNAmtotranscript effects were of positive sign (i.e., DNAm increasing transcription), particularly so when the DNAm site was situated in the gene body (P_{Enrichment}=2.15e10). Interestingly, MPs were higher when DNAm was downregulating rather than upregulating transcripts. Previous genomewide methylation and gene expression association studies reported high fractions of positive correlations (30–41%)^{36,37,39} and further investigations indicated that our estimated methylationtotranscript causal effects agree strongly with the respective correlations reported by Grundberg et al. (P=2.6e18). While poorly understood^{38}, several mechanisms have been proposed to explain the phenomenon of DNAm induced transcription: preferential binding of some transcription factors to methylated DNA^{53,54}, prevention of repressor binding indirectly leading to increased expression through looping DNA^{24,55}, or DNAm in the gene body promoting elongation efficiency and preventing spurious initiation of transcription^{56}. Furthermore, MP estimates indicated that DNAm sites typically regulate multiple transcripts in cis and that mediation through transcripts decreased the further away the TSS of the mediator transcript was from the DNAm site. Collectively, these results describe a more diverse picture of the transcription machinery, going beyond the classical views that DNAm solely reduces gene expression in the TSS region.
Statistical methods to integrate GWAS with omics data have seen a surge in recent years. Namely, colocalization methods based on a single genetic signal or corroborated by a secondary one, as well as methods supported by the SMR HEIDI statistic have been previously used in the study of DNAmtocomplex trait effects^{6,14,24}. In the most recent publication of the GoDMC consortium, the former strategy was applied to systematically evaluate DNAm and GWAS colocalizing signals and compare them to MR^{6}. This revealed a relatively poor overlap between colocalization and MR results, as both approaches have their weaknesses in detecting causal relationships. The major weakness of colocalization analysis is that it cannot detect directionality and does not estimate causal effect size. Colocalization of local association signals of two traits may be due to causal effects in either direction, common local confounder effect (e.g., shared regulatory mechanism) or causal markers in very high LD. Lack of colocalization can happen even if there is a true causal relationship, but there are additional associations impacting only the outcome trait. On the other hand, the major weakness of MR is that it may falsely detect a causal relationship when the causal variants for each trait are in reasonably high LD. The comparison of these two approaches is out of the scope of this work, but to explore the abovementioned weakness in our study, we performed simulations tailored to detect bychance overlaps in the association signals for methylation and gene expression (see pleiotropy sensitivity analyses for details). These analyses indicated that indeed elevated false positive rates are expected for MR, but the resulting MR pvalues under the null are much less significant than the ones observed for real methylationtranscript data.
Mapping genetic variants identified by GWASs to biological processes is notoriously difficult^{2}. In particular, a challenge in identifying causal chains through omics layers is the attenuation in the genetic association strengths when moving up along layers. In a linear model, the genetic effect on the phenotype is assumed to be the product of causal effects between the preceding layers and it was previously shown that the variance explained by the top associated QTL of the first layer weakens with each successive omics layer^{24}. In line with this observation, the examples depicted in Fig. 5 visualize the decrease in the genetic associations from the DNAm to the complex trait level. While in the future our 3SMVMR framework could be applied to further mediating layers (e.g. proteins or metabolites), current QTL datasets for these omics layers lack the dimensionality  both in terms of sample size and number of assessed entities. Once larger datasets become available, these could be used to support mechanistic findings resulting from transcript data.
While our method highlights candidate pathways and provides MP estimates, several limitations are to be considered. First, our MP estimates are based on a selection of 2623 DNAmtrait pairs with significant total effects (P_{T} < 1e6), which inherently focuses on DNAmtrait pairs with larger (and hence detectable) effects. In theory, MPs could depend on the magnitude of the total causal effect, thus the reported MP may differ for weaker total effects. A special case of these weaker total effects is when direct and indirect effects differ in sign, leading to a weak total effect with an MP potentially outside the [0,1] range. Furthermore, selected DNAm sites were those with the strongest DNAmtrait signal in their region (up to 1Mb). Thus, we omit secondary methylation signals, which may be mediated by transcripts to a different degree. Second, as for all MRIVW approaches, included IVs might be pleiotropic, i.e., violating MR assumptions and potentially biasing effect estimates. Although, filtering out DNAmtrait pairs with signs of heterogeneous IV sets did not change MP estimates, the presence of invalid IVs cannot be entirely excluded and could therefore compromise causal effect estimates^{57,58}. In particular, since selected IVs are in cis of the investigated molecular trait, they might be based on a single (pleiotropic) haplotype signal. Third, we select mediators based on their association to the exposure without taking into account their mediator potential, i.e., whether or not the mediator is additionally causally linked to the trait. Phrased differently, selected mediators are simply candidates and such selection serves as a first filter to remove nonmediators. In line with our simulations, it has been shown that an extremely large number of such “false" mediators (88 out of 92) can cause MVMR regression models to fail^{30}, indicating that our framework is less suitable for large numbers of molecular mediators unless the selection threshold P_{EM} is made more stringent. Finally, while molecular mechanisms ought to be tissue or even cell typespecific, QTL data used in this study were derived from whole blood. However, not correcting for blood cell types when analyzing gene expression data can introduce important artefacts^{59}. It is also known that different tissues express different isoforms^{60}, with many splicing and expression QTLs shown to differ across tissues^{61}. Accordingly, MPs for blood biomarkers were generally higher than those for diseases, for which blood might not be the most relevant tissue. Differences between biomarker and disease MPs might also be due to the fact that indirect pathways, through unmeasured mediators, play a greater role for the latter trait category. Once tissuestratified multiomics datasets of larger sample size become available, more accurate, and potentially higher MPs will be obtained in traitrelevant tissues.
To conclude, by adapting existing MVMR mediation techniques to molecular exposures and mediators, we quantified the causal connectivity between DNAm and transcript levels, and their importance in shaping complex traits. Overall, we found solid evidence that almost a third of DNAmtocomplex trait effects are mediated by transcripts in cis. Our integrative omics framework can be extended to other omicsGWAS combinations and provide a powerful tool for mapping GWAS signals to biological pathways and prioritizing functional followup experiments.
Methods
Univariable and multivariable Mendelian randomization
Univariable Mendelian randomization (MR) was applied to estimate the total causal effect (θ_{T}) and multivariable MR (MVMR) to estimate the direct causal effect (θ_{D}) of an exposure E on an outcome Y. The mediation proportion (MP) was defined as 1 − θ_{D}/θ_{T}. Under the MR assumptions, genetic variants G used as IVs must be i) associated with E, ii) independent of any confounder of the E − Y relationship, iii) conditionally independent of Y given E. We analysed exposures with at least five LDpruned (r^{2} < 0.05) IVs associated (P < 1e6) with the molecular exposure and located in cis ( < 1 Mb). To estimate θ_{T} we used the inversevariance weighted (IVW) MR method, while accounting for (mildly) correlated instruments^{19,62} as follows:
where β_{E} and β_{Y} are vectors of genetic effect sizes obtained from summary statistics for E and Y, respectively. C is the linkage disequilibrium (LD) matrix with pairwise correlations between IVs estimated from the UK10K reference panel^{63}. Sensitivity analyses confirmed that accounting for the LDmatrix safeguards against MR estimates being influenced by the pruning threshold r^{2} (Supplementary Figs. 23). Since in the following MVMR model more IVs than mediators are required, we chose a more lenient pruning threshold (r^{2} < 0.05), including IVs in mild LD (Supplementary Fig. 2). Prior to the causal effect calculations, IVs were Steigerfiltered to avoid that the IV’s effect on Y is significantly larger than it is on E^{64} and were thus required to pass a threshold \({t}_{{{{{{{{\rm{rev}}}}}}}}} < \frac{{\beta }_{{E}_{i}}{\beta }_{{Y}_{i}}}{\sqrt{{{{{{{{\rm{var}}}}}}}}({\beta }_{{E}_{i}})+{{{{{{{\rm{var}}}}}}}}({\beta }_{{Y}_{i}})}}\) with t_{rev} set at −2, equivalent to a one sided test pvalue threshold of 0.023^{34}. IVs not passing this threshold are prone to violating the third MR assumption of horizontal pleiotropy since they are more directly linked to the outcome. As a result, MR estimates including such IVs would potentially mix up forward and reverse causal effects. The standard error (SE) of θ_{T} can be approximated by the Delta method^{65}:
where Σ is a diagonal matrix with each diagonal element i equalling the maximum of the regression variance s^{2} and var(\({\beta }_{{Y}_{i}}\))^{34}.
Through the inclusion of mediators M_{k} and their associated cis genetic variants (r^{2} < 0.05, P < 1e6), θ_{D} can be estimated analogously to θ_{T} using a multivariable regression model^{28} as the first element of θ_{D}:
where B is a matrix with k + 1 columns containing the effect sizes of the IVs on the exposure in the first column and on each mediator in the subsequent columns. The remaining elements of θ_{D} represent the direct effects of the mediators on the outcome and were referred to as α_{MY,k}. In the estimation of MPs, we were not interested in α_{MY,k} values per se, but we took these effect sizes into account for inferring molecular mechanisms. If the number of mediatorassociated instruments was sufficient (≥3) to conduct a univariable MR from the mediator to the outcome, we estimated α_{MY,k} from this analysis instead. In fact, the (marginal) contribution of an individual mediator can be better disentangled in univariable analyses, when mediators are highly correlated.
As our MVMR model assumes a chain of causal effects from the exposure to the mediator and then to the outcome, we conducted several Steiger filtering steps to reduce biases due to reverse causation. Although it has been proposed that DNAm could be a consequence of gene expression in the same locus^{66}, our model investigates the commonly assumed concept of DNAm regulating gene expression. In addition to meeting the Steiger criterion described above, exposureassociated IVs were required to pass that same threshold t_{rev} of no larger mediator than exposure effects for each of the mediators M_{k}. Similarly, to mitigate reverse causal effects from the outcome on the mediators, mediatorassociated instruments with larger Y than M effects were removed if not passing the t_{rev} threshold. The SE of \({\hat{\theta }}_{D}\) was derived analogously to the univariable form (Eq. (2)) as shown in^{19}.
MVMR sensitivity analyses
Conditional Fstatistic
Conditional Fstatistics of the exposure were calculated following the approach of Sanderson et al.^{33}. This method involves the regression of the exposure on the mediators based on the IV effect sizes on each of these quantities. The residuals of this regression are then used to derive the conditional Fstatistic. The original method additionally includes the phenotypic correlation matrix between the exposure and mediators, which we omitted by default due to the lack of these data and thus used the identity matrix instead. However, as a sensitivity analysis, we calculated conditional Fstatistics incorporating the phenotypic correlations between transcript mediators. Transcript correlations were calculated on RNAseq data from the Cohorte Lausannoise (CoLaus) based on 555 samples^{67}. Transcript correlations could be estimated for 19,517 transcript of which 15,021 overlapped with the eQTL dataset (Methods: Omics and trait summary statistics)^{8}. We then calculated conditional Fstatistics that included mediator correlations for all DNAmtrait pairs with at least 2 mediators and for which at least half of them had available correlation data. Conditional Fstatistics > 10 allow to reject the null hypothesis that the IVs are too weak to reliably estimate the multivariable effect of the exposure in the presence of the mediators.
Heterogeneity Qstatistic
Heterogeneity Qstatistics were computed as implemented in the TwoSampleMR package (v0.5.6, IVWmethod)^{34}. This test statistic quantifies the deviation of MR effect estimates of each individual IV from the IVWestimate based on all IVs^{68}. The null hypothesis of homogeneity within the IV set follows a chisquared distribution with m − 1 degrees of freedom for the univariable MR, and m − k degrees of freedom for the MVMR, where m is the number of IVs and k the number of mediators.
Mediator selection threshold P _{EM}
For transcripts to be included as mediators in the MVMR regression model they had to be i) in cis of the DNAm exposure probe ( ± 500kb) and ii) causally associated to the DNAm probe. This latter condition was verified by univariable MR analyses (Eq. (1)) of the DNAm exposure probe on each mediator transcript k in the region estimating the effect sizes α_{EM,k} and pvalues P_{EM,k}. Transcripts satisfying P_{EM,k} < P_{EM} were included as mediators with the default threshold equalling 0.01. To assess the sensitivity of this threshold, we also tested milder and more stringent thresholds (P_{EM}=0.05 and 1e3).
Pleiotropy sensitivity analyses
To quantify whether significant MR estimates between the exposure and mediators were observed due to horizontal pleiotropy, we conducted two sensitivity analyses. First, we repeated the mediation analysis excluding the top IV (i.e., exposureassociated IV with the lowest pvalue) from both the total effect θ_{T} and direct effect θ_{D} calculations. This analysis allowed to assess whether mediation results are solely driven by a single strong IV. Second, we performed simulation analyses to quantify the possibility that causal links between DNAm probes and transcripts are driven by increased horizontal pleiotropy stemming from potential LD between methylation and transcript instruments due to their close genomic distance.
In the following, we outline stepbystep the workflow of the horizontal pleiotropy simulation study for which a schematic representation is shown in Supplementary Fig. 16. First, we considered DNAmtranscript pairs with a significant MR effect at P_{EM} < 1e6. For each of these selected DNAmtranscript pairs, we first fixed the SNPDNAm and SNPtranscript effects as observed in the data. Then, using nearindependent significant ciseQTLs (r^{2} < 0.05, P < 1e6) with observed marginal (univariable) effect sizes β_{M} (a vector of size m_{M}) and the corresponding pairwise local LD matrix C_{M}, we calculated the multivariable SNP effects on the transcript, β_{multi}, as:
Using the original data, we performed DNAmtranscript MR on m_{E} exposure (i.e., DNAmassociated) IVs, yielding the causal effect α_{EM} with corresponding pvalue, P_{EM}. We then performed simulation analyses as follows to obtain MR effects for a hypothetical transcript with identical multivariable eQTL effect size distribution as the real transcript. To achieve this, for each simulation j, we randomly selected m_{M} leniently pruned (r^{2} < 0.5) SNPs and assigned β_{multi} as their multivariable eQTL effects. Hence the marginal SNPtranscript effects for the m_{E} exposureassociated SNPs can be calculated as follows:
where \({{{{{{{{\bf{C}}}}}}}}}_{E,{M}_{j}}\) is the LDmatrix between the m_{E} exposureassociated SNPs and the m_{M} randomly chosen SNPs (with multivariable SNPtranscript effect β_{multi}). This way we assign marginal SNPtranscript effect sizes for the m_{E} exposureassociated instruments, while keeping the multivariable eQTL effect size distribution identical to the one observed for the real transcript (but they are assigned to other SNPs). Univariable DNAmtranscript MR analyses could then be conducted (Eq. (1)) for each hypothetical transcript j, by using β_{marginal,j} as the outcome effect size vector. Thus, we generated MR estimates (α_{EM,j} and P_{EM,j}) for 100,000 (N_{sim}) hypothetical transcripts for 100 randomly selected DNAmtranscript pairs throughout the genome. The simulation pvalue was then derived as P_{sim} = #(P_{EM,j} < P_{EM})/N_{sim}.
DNAmtotrait mediation analysis
A diagram of the workflow with each of the following steps is shown in Supplementary Fig. 1. First, univariable MRs were conducted to estimate the total causal effect \({\hat{\theta }}_{T}\) of the DNAm sites on each trait. We assessed the impact of ~ 50,000 DNAm probes with ≥ 5 nearindependent (r^{2} < 0.05) mQTLs after harmonization of the datasets. DNAm probes significantly associated to the outcome (P_{T} < 0.05/50000=1e6) were clumped based on the pvalue of the total causal effect \({\hat{\theta }}_{T}\), P_{T} (distancepruning at 1 Mb), to be independent of each other.
Second, MVMR analyses were performed to estimate the direct effect \({\hat{\theta }}_{D}\). Selected transcripts (see “Mediator selection threshold P_{EM}") were included as mediators as well as their associated SNPs as additional instruments. Steiger filtering on mediatorassociated IVs was applied using the same t_{rev} threshold as for exposureassociated IVs. Remaining IVs were then clumped based on a rank score determined as follows: 1) for each mediator, IVs were ranked according to their association pvalue to the mediator and assigned an integer score, 2) for each IV, a final score was calculated as the sum of its individual mediator scores. Following the establishment of the B effect size matrix, \({\hat{\theta }}_{D}\) was calculated, as well as \({\hat{\theta }}_{D,{{{{{{{\rm{top}}}}}}}}}\) which was estimated from a MVMR model that includes the transcript with the lowest P_{EM,k} as sole mediator. If no transcript causally associated with the DNAm probe, mediation is not detectable, and hence \({\hat{\theta }}_{D}\) was set to \({\hat{\theta }}_{T}\) for that probe (inclusion of such probes in MP calculation was termed “overall mediation proportion"). As the Steiger filter removed exposureassociated instruments with larger mediator than exposure effects (see “Univariable and multivariable Mendelian randomization"), the number of initial exposureassociated instruments (m_{E} ≥ 5) could decrease. Therefore, to avoid scenarios of reverse causality where the mediator exerts an effect on the outcome through the exposure, we required ≥ 3 exposureassociated IVs.
We additionally conducted mediation analyses on independent mediators. To this end, selected mediators (those that passed P_{EM}) were clumped at various correlation thresholds R_{med} (default R_{med} < 0.3, with 0.2 and 0.1 being tested as well). Correlations among mediators were calculated based on QTL effect sizes of independent exposure and mediator IVs and priority was given to the mediator with the lowest P_{EM,k}.
Estimating and comparing mediation proportions
Mediation proportions (MPs) were estimated on sets of DNAmtrait pairs with significant total causal effects \({\hat{\theta }}_{T}\), either grouped by trait (if there were at least 10 such pairs within a given trait), trait category (e.g. hepatic traits, inflammatory traits/diseases) or combining all pairs together. MPs were then calculated by regressing \({\hat{\theta }}_{D}\) on \({\hat{\theta }}_{T}\) (without intercept) to estimate for the unmediated proportion, \(\hat{\gamma }\), which after correcting for regression dilution bias^{31} (Eq. (6)):
yielded \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) = \(1{\hat{\gamma }}_{{{{{{{{\rm{cor}}}}}}}}}\) for a defined set of DNAmtrait pairs, together with a standard error. For individual DNAmtrait pairs, we report the \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) as \(1{\hat{\theta }}_{D}/{\hat{\theta }}_{T}\), without providing its variance estimate since this would require individuallevel data^{26}. Note that \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) is an estimator of the true underlying MP and values outside the expected [01] range can be observed, especially if \({\hat{\theta }}_{D}\) and \({\hat{\theta }}_{T}\) estimates are of opposite sign. Such situations are expected to be rare in our analysis, as the total effect would be expected to be small and hence nondetectable.
In our approach, indirect effects θ_{M} are estimated by subtracting direct effects from total effects, which is also referred to as the difference in coefficients method^{26}. Alternatively, the indirect effect can be estimated by the product of coefficients method^{26}, where univariable MR estimates from the exposure on the mediator are multiplied with the direct effects of the mediator on the outcome (Eq. (3)) and summed across mediators. Direct effects of the exposure on the outcome can then be obtained by the difference between the total and indirect effect. As demonstrated earlier^{26}, the two approaches yield highly concordant results (Supplementary Fig. 20).
To test the statistical significance between \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s estimated on two different sets of exposuretrait pairs (e.g. \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\) of a given physiological category vs all categories combined) or on the same exposuretrait pairs, but with different parameter settings (e.g. changing P_{EM}), we made use of \(\hat{\gamma }\) and its corresponding standard error \({{{{{{{\rm{se}}}}}}}}(\hat{\gamma })\) obtained from regressing \({\hat{\theta }}_{D}\) on \({\hat{\theta }}_{T}\) (both of which being corrected for regression dilution bias (Eq. (6))) to yield \({\hat{\gamma }}_{{{{{{{{\rm{cor}}}}}}}}}\) and \({{{{{{{\rm{se}}}}}}}}(\hat{\gamma })\). We then performed a twosided ztest based on the following test statistic:
Significant difference between \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\)s was defined by a twosided pvalue ≤ 0.05. Of note, this ztest assumes independence between \({\hat{\gamma }}^{(1)}\) and \({\hat{\gamma }}^{(2)}\) which is not always guaranteed (i.e., when comparing P_{EM} thresholds), hence the resulting pvalues may be lenient.
Omics and trait summary statistics
We used mQTL data from the GoDMC consortium (n=32,851)^{6}, which contains > 170,000 whole blood DNAm sites with at least one significant cismQTL (P < 1e6, < 1 Mb from the DNAm site, n > 5000). CiseQTL data were taken from the eQTLGen consortium (n = 31,684)^{8} which includes ciseQTLs (< 1 Mb from gene center, 2cohort filter) for 19,250 transcripts (16,934 with at least one significant ciseQTL at FDR < 0.05 corresponding to P < 1.8e05).
GWAS summary statistics for outcome traits came from the largest (n_{average} > 320,000), predominantly Europeandescent, publicly available studies, as listed in Supplementary Data 1. Thirtyseven out of the 50 traits were continuous biomarkers or continuous physical measures with the GWAS conducted on the UK Biobank^{69} (http://www.nealelab.is/ukbiobank). Remaining GWAS data came mostly from case/control studies made available by the consortium of the respective disease. For binary outcome traits, logodds ratios were used as effect sizes and results should be interpreted on the liability scale.
Prior to each mediation analysis, exposure and mediator omics, GWAS and the reference panel data were harmonized. The analysis was conducted on autosomal chromosomes, and palindromic single nucleotide variants (SNPs), as well as SNPs with an allele frequency difference > 0.05 between any pairs of datasets were removed. If allele frequencies were not reported by the GWAS summary statistics, allele frequencies from the UK Biobank were used. Zscores of summary statistics (molecular and outcome GWAS) were standardized by the square root of the sample size to be on the same SD scale.
DNAmtotranscript MR analysis
As followup analyses, we calculated MR causal effects between all available DNAm sites and transcripts in cis ( ± 500 kb) following the same procedure as in the univariable MR to obtain total effects \({\hat{\theta }}_{T}\). First, nearindependent (r^{2} < 0.05) and significant (P < 1e6) exposure IVs were selected and IVs not passing the aforementioned Steiger filter were discarded. MR causal effects were then computed based on Eq. (1) for pairs with ≥3 exposure IVs.
Pearson correlation coefficient with previously reported DNAmtranscript correlations^{37} was calculated on common DNAmtranscript pairs to explore agreement. DNAm probe annotations with respect to the assessed transcript were from the IlluminaHumanMethylation450kanno.ilmn12.hg19 R package (v0.6.1)^{70}.
Simulation studies
We conducted simulation studies to assess the robustness of our model and to identify sources of bias in the estimated MP. Simulation settings were set up posthoc to replicate mediation results obtained for real data (Supplementary Figs. 45; Supplementary Table 1).
We considered an exposure with heritability \({h}_{E}^{2}\) and m_{E} independent IVs. Effect sizes \({\beta }_{i}^{E}\) for m_{E} IVs were drawn from a normal distribution \({\beta }_{i}^{E} \sim {{{{{{{\mathcal{N}}}}}}}}(0,\sqrt{{h}_{E}^{2}/{m}_{E}})\) and rescaled to total \({h}_{E}^{2}\). N_{med,pot} potential mediators were simulated, among which N_{med} were contributing to the indirect effect θ_{M}. Each mediator k associated with m_{M} IVs with direct effects \({\beta }_{{{{{{{{\rm{direct}}}}}}}},i}^{{M}_{k}} \sim {{{{{{{\mathcal{N}}}}}}}}(0,\sqrt{{h}_{M,{{{{{{{\rm{direct}}}}}}}},k}^{2}/{m}_{M}})\) rescaled to \({h}_{M,{{{{{{{\rm{direct,}}}}}}}}k}^{2}\), the direct heritability of the mediator that does not take into account the additional heritability coming through the exposure. Causal effects of the exposure on the mediator (α_{EM,k}) and of the mediator on the outcome (α_{MY,k}) for N_{med} mediators were drawn from a bivariate normal distribution \({\alpha }_{{{{{{{{\rm{EM,}}}}}}}}k},{\alpha }_{{{{{{{{\rm{MY,}}}}}}}}k} \sim {{{{{{{\mathcal{N}}}}}}}}({{{{{{{\bf{0}}}}}}}},{{{{{{{\boldsymbol{\Sigma }}}}}}}})\) with Σ the covariance matrix:
where ρ is the correlation between α_{EM,k} and α_{MY,k}. For the remaining N_{med,pot}  N_{med} mediators, α_{EM,k} and α_{MY,k} causal effects were set to zero. The vector of effect sizes \({{{{{{{{\boldsymbol{\beta }}}}}}}}}^{{M}_{k}}\) of size m_{E} + N_{med} ⋅ m_{M} for each mediator k was constructed to have effect sizes equalling \({\beta }_{i}^{E}\cdot {\alpha }_{{{{{{{{\rm{EM,}}}}}}}}k}\) for m_{E} exposure SNPs and effect sizes equalling \({\beta }_{{{{{{{{\rm{direct,}}}}}}}}i}^{{M}_{k}}\) for m_{M} mediatorassociated SNPs. The effect sizes of remaining IVs associated to mediators i ≠ k were set to zero. Likewise, effect sizes of the N_{med} ⋅ m_{M} IVs on the exposure in the β^{E} vector were set to zero.
The indirect effect θ_{M}, direct effect θ_{D} and total effect θ_{T} were calculated as:
These quantities allowed to generate the outcome effect size vector β^{Y}:
For each scenario, we simulated 500 data sets to each time get β^{E}, \({{{{{{{{\boldsymbol{\beta }}}}}}}}}^{{M}_{k}}\) and β^{Y}. Normally distributed noise, as a function of the sample size N, \({\epsilon }_{i}^{E} \sim {{{{{{{\mathcal{N}}}}}}}}(0,1/{N}_{E})\), \({\epsilon }_{i}^{M} \sim {{{{{{{\mathcal{N}}}}}}}}(0,1/{N}_{M})\) and \({\epsilon }_{i}^{Y} \sim {{{{{{{\mathcal{N}}}}}}}}(0,1/{N}_{Y})\) was added to each simulated vector. To approximate our real data, exposure effect sizes of SNPs serving as mediator instruments were set to zero again. We then estimated for each model \({\hat{\theta }}_{T}\) and \({\hat{\theta }}_{D}\) by including mediators that satisfied P_{EM} (pvalue of the causal effect from the exposure on the mediator) denoted N_{med,sig}. Causal effects \({\hat{\theta }}_{D}\) were regressed on \({\hat{\theta }}_{T}\) to estimate the coefficient \(\hat{\gamma }\) which after accounting for regression dilution (Eq. (6)) allowed to obtain the estimated \(\widehat{{{{{{{{\rm{MP}}}}}}}}}\).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Methylation QTLs used in this study are from the GoDMC mQTL metaanalysis and are available on the GoDMC Consortium website (http://mqtldb.godmc.org.uk/downloads). Expression QTLs are from the eQTLGen eQTL metaanalysis and are available on the eQTLGen Consortium website (https://www.eqtlgen.org/ciseqtls.html). The list of GWAS summary statistics used in this study is in Supplementary Data 1, all of which are publicly available. UK10K individuallevel data are available upon request (https://www.uk10k.org/data_access.html). Source data are provided with this paper.
Code availability
Software to conduct univariable MRIVW (molecular trait → outcome, molecular trait 1 → molecular trait 2) and multivariable MRIVW (molecular trait 1 → molecular trait 2 → outcome) is available at https://github.com/masadler/smrivw(https://doi.org/10.5281/zenodo.7324709^{71}). Source code (C++, released under GPL v2 license) and executable file (for Linux platforms, released under MIT license) are provided which rely on functionalities and the data management architecture of the SMR software v1.03 (https://cnsgenomics.com/software/smr^{24}). The provided documentation hosted on the GitHub repository guides users in reproducing the mediation results and conducting univariable and multivariable MR on their own combinations of QTL and GWAS datasets.
References
Buniello, A. et al. The NHGRIEBI GWAS catalog of published genomewide association studies, targeted arrays and summary statistics 2019. Nucl. Acids Res. 47, D1005–D1012 (2019).
Tam, V. et al. Benefits and limitations of genomewide association studies. Nat. Rev. Genet. 20, 467–484 (2019).
Maurano, M. T. et al. Systematic localization of common diseaseassociated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Nicolae, D. L. et al. Traitassociated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).
Hasin, Y., Seldin, M. & Lusis, A. Multiomics approaches to disease. Genome Biol. 18, 1–15 (2017).
Min, J. L. et al. Genomic and phenotypic insights from an atlas of genetic effects on DNA methylation. Nat. Genet. 53, 1311–1321 (2021).
Consortium, G. et al. Genetic effects on gene expression across human tissues. Nature 550, 204 (2017).
Võsa, U. et al. Largescale cisand transeQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 53 1300–1310 (2021).
Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018).
Folkersen, L. et al. Genomic and drug target evaluation of 90 cardiovascular proteins in 30,931 individuals. Nat. Metab. 2, 1135–1148 (2020).
Ferkingstad, E. et al. Largescale integration of the plasma proteome with genetics and disease. Nat. Genet. 53,1712–1721 (2021).
Shin, S.Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).
Lotta, L. A. et al. A crossplatform approach identifies genetic regulators of human metabolism and health. Nat. Genet. 53, 54–64 (2021).
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Hormozdiari, F. et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet. 99, 1245–1260 (2016).
Gusev, A. et al. Integrative approaches for largescale transcriptomewide association studies. Nat. Genet. 48, 245–252 (2016).
Barbeira, A. N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9, 1–20 (2018).
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
Porcu, E. et al. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nat. Commun. 10, 1–12 (2019).
Porcu, E. et al. Differentially expressed genes reflect diseaseinduced rather than diseasecausing changes in the transcriptome. Nat. Commun. 12, 5647 (2021).
Burgess, S., Small, D. S. & Thompson, S. G. A review of instrumental variable estimators for Mendelian randomization. Stat. Methods Med. Res. 26, 2333–2355 (2017).
Giambartolomei, C. et al. A Bayesian framework for multiple trait colocalization from summary association statistics. Bioinformatics 34, 2538–2545 (2018).
Gleason, K. J., Yang, F., Pierce, B. L., He, X. & Chen, L. S. Primo: integration of multiple GWAS and omics QTL summary statistics for elucidation of molecular mechanisms of traitassociated snps and detection of pleiotropy in complex traits. Genome Biol. 21, 1–24 (2020).
Wu, Y. et al. Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits. Nat. Commun. 9, 1–14 (2018).
Hannon, E. et al. Leveraging DNAmethylation quantitativetrait loci to characterize the relationship between methylomic variation, gene expression, and complex traits. Am. J. Hum. Genet. 103, 654–665 (2018).
Carter, A. R. et al. Mendelian randomisation for mediation analysis: current methods and challenges for implementation. Eur. J. Epidemiol. 36, 465–478 (2021).
Sanderson, E. Multivariable Mendelian randomization and mediation. Cold Spring Harb. Perspect. Med. 11, a038984 (2021).
Burgess, S. et al. Dissecting causal pathways using Mendelian randomization with summarized genetic data: application to age at menarche and risk of breast cancer. Genetics 207, 481–487 (2017).
Burgess, S. & Thompson, S. G. Bias in causal estimates from Mendelian randomization studies with weak instruments. Stat. Med. 30, 1312–1323 (2011).
Zuber, V., Colijn, J. M., Klaver, C. & Burgess, S. Selecting likely causal risk factors from highthroughput experiments using multivariable Mendelian randomization. Nat. Commun. 11, 1–11 (2020).
Knuiman, M. W., Divitini, M. L., Buzas, J. S. & Fitzgerald, P. E. Adjustment for regression dilution in epidemiological regression analyses. Ann. Epidemiol. 8, 56–63 (1998).
Howe, K. L. et al. Ensembl 2021. Nucl. Acids Res. 49, D884–D891 (2021).
Sanderson, E., Spiller, W. & Bowden, J. Testing and correcting for weak and pleiotropic instruments in twosample multivariable Mendelian randomization. Stat. Med. 40, 5434–5452 (2021).
Hemani, G. et al. The MRBase platform supports systematic causal inference across the human phenome. elife 7, e34408 (2018).
Bird, A. DNA methylation patterns and epigenetic memory. Genes Dev. 16, 6–21 (2002).
Wan, J. et al. Characterization of tissuespecific differential DNA methylation suggests distinct modes of positive and negative gene expression regulation. BMC Genomics 16, 1–11 (2015).
Grundberg, E. et al. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to diseaseassociated variants in distal regulatory elements. Am. J. Hum. Genet. 93, 876–890 (2013).
Rauluseviciute, I., Drabløs, F. & Rye, M. B. DNA hypermethylation associated with upregulated gene expression in prostate cancer demonstrates the diversity of epigenetic regulation. BMC Med. Genomics. 13, 1–15 (2020).
RuizArenas, C. et al. Identification of autosomal cis expression quantitative trait methylation (cis eQTMs) in children’s blood. eLife 11, e65310 (2022).
Lippai, R. et al. Immunomodulatory role of Parkinson’s disease 7 in inflammatory bowel disease. Sci. Rep. 11, 14582 (2021).
Di Narzo, A. F. et al. Highthroughput identification of the plasma proteomic signature of inflammatory bowel disease. J. Crohn’s Colitis. 13, 462–471 (2019).
Singh, Y. et al. DJ1 (Park7) affects the gut microbiome, metabolites and the development of innate lymphoid cells (ILCs). Sci. Rep. 10, 1–19 (2020).
Zhang, J. et al. Deficiency in the antiapoptotic protein DJ1 promotes intestinal epithelial cell apoptosis and aggravates inflammatory bowel disease via p53. J. Biol. Chem. 295, 4237–4251 (2020).
Moschen, A. R. et al. Lipocalin 2 protects from inflammation and tumorigenesis associated with gut microbiota alterations. Cell Host Mcrobe. 19, 455–469 (2016).
Gieger, C. et al. New gene functions in megakaryopoiesis and platelet formation. Nature 480, 201–208 (2011).
Nürnberg, S. T. et al. A GWAS sequence variant for platelet volume marks an alternative DNM3 promoter in megakaryocytes near a MEIS1 binding site. Blood, J. Am. Soc. Hematol. 120, 4859–4868 (2012).
Watkins, N. A. et al. A HaemAtlas: characterizing gene expression in differentiated human blood cells. Blood, J. Am. Soc. Hematol. 113, e1–e9 (2009).
BielczykMaczyńska, E. et al. A loss of function screen of identified genomewide association study loci reveals new genes controlling hematopoiesis. PLoS Genet. 10, e1004450 (2014).
Liang, Y. et al. Demethylation of the FCER1G promoter leads to FcεRI overexpression on monocytes of patients with atopic dermatitis. Allergy 67, 424–430 (2012).
PairoCastineira, E. et al. Genetic mechanisms of critical illness in Covid19. Nature 591, 92–98 (2021).
Initiative, C.. H. G. et al. Mapping the human genetic architecture of COVID19. Nature 600:472–477 (2021).
Smieszek, S. P. & Polymeropoulos, M. H. Loss of Function Mutations in the IFNAR2 in COVID19 Severe Infection Susceptibility. J. Glob. Antimicrob. Resist. 26, 239–240 (2021).
Zhu, H., Wang, G. & Qian, J. Transcription factors as readers and effectors of DNA methylation. Nat. Rev. Genet. 17, 551–565 (2016).
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).
Jjingo, D., Conley, A. B., Soojin, V. Y., Lunyak, V. V. & Jordan, I. K. On the presence and role of human genebody DNA methylation. Oncotarget 3, 462 (2012).
Richmond, R. C., Hemani, G., Tilling, K., Davey Smith, G. & Relton, C. Challenges and novel approaches for investigating molecular mediation. Hum. Mol. Genet. 25, R149–R156 (2016).
Verbanck, M., Chen, C.y, Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 50, 693–698 (2018).
PellegrinoCoppola, D. et al. Correction for both common and rare cell types in blood is important to identify genes that correlate with age. BMC Genomics. 22, 1–12 (2021).
Merkin, J., Russell, C., Chen, P. & Burge, C. B. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338, 1593–1599 (2012).
GarridoMartín, D., Borsari, B., Calvo, M., Reverter, F. & Guigó, R. Identification and analysis of splicing quantitative trait loci across multiple tissues in the human genome. Nat. Commun. 12, 1–16 (2021).
Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from gwas summary data. Nat. Commun. 9, 1–12 (2018).
UK10K et al.The UK10K project identifies rare variants in health and disease. Nature 526, 82 (2015).
Hemani, G., Tilling, K. & Davey Smith, G. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 13, e1007081 (2017).
Lynch, M., Walsh, B. et al. Genetics and analysis of quantitative traits, vol. 1 (Sinauer Sunderland, MA, 1998).
GutierrezArcelus, M. et al. Passive and active DNA methylation and the interplay with genetic variation in gene regulation. eLife 2, e00523 (2013).
Sönmez Flitman, R. et al. Untargeted metabolomeand transcriptomewide association study suggests causal genes modulating metabolite concentrations in urine. J. Proteome Res. 20, 5103–5114 (2021).
Burgess, S., Bowden, J., Fall, T., Ingelsson, E. & Thompson, S. G. Sensitivity analyses for robust causal inference from Mendelian randomization analyses with multiple genetic variants. Epidemiol. 28, 30 (2017).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Hansen, K. IlluminaHumanMethylation450kanno. ilmn12. hg19: annotation for Illumina’s 450k methylation arrays. R. Package version 0. 6. 0 10, B9 (2016).
Sadler, M. Quantifying the role of transcript levels in mediating DNA methylation effects on complex traits and diseases (2022). Masadler/smrivw, https://doi.org/10.5281/zenodo.7324709.
Acknowledgements
This work was supported by the Swiss National Science Foundation (310030_189147) to Z.K. L.D. was calculated based on the UK10K data resource (EGAD00001000740, EGAD00001000741). Computations we performed on the JURA cluster of the University of Lausanne. We have used RNAseq data from 557 CoLaus participants to compute genegene correlations, which were kindly made available by Sven Bergmann.
Author information
Authors and Affiliations
Contributions
M.C.S., E.P. and Z.K. conceived and designed the study. M.C.S. performed statistical analyses. K.L. contributed to the statistical analyses. EP provided guidance on statistical analyses. Z.K. supervised all statistical analyses. All the authors contributed by providing advice on interpretation of results. C.A. contributed with the biological interpretation of the results. M.C.S., E.P. and Z.K. drafted the manuscript. C.A. contributed to the writing of specific sections. All authors read, approved, and provided feedback on the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Jean Morrison and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sadler, M.C., Auwerx, C., Lepik, K. et al. Quantifying the role of transcript levels in mediating DNA methylation effects on complex traits and diseases. Nat Commun 13, 7559 (2022). https://doi.org/10.1038/s41467022351963
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467022351963
This article is cited by

Breaking down causes, consequences, and mediating effects of telomere length variation on human health
Genome Biology (2024)

Causalityenriched epigenetic age uncouples damage and adaptation
Nature Aging (2024)

Epigenomic insights into common human disease pathology
Cellular and Molecular Life Sciences (2024)

Longitudinal multiomics study reveals common etiology underlying association between plasma proteome and BMI trajectories in adolescent and young adult twins
BMC Medicine (2023)

Epigenetics insights from perceived facial aging
Clinical Epigenetics (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.