Mendelian randomization while jointly modeling cis genetics identifies causal relationships between gene expression and lipids

van der Graaf, Adriaan; Claringbould, Annique; Rimbert, Antoine; Westra, Harm-Jan; Li, Yang; Wijmenga, Cisca; Sanna, Serena

doi:10.1038/s41467-020-18716-x

Download PDF

Article
Open access
Published: 01 October 2020

Mendelian randomization while jointly modeling cis genetics identifies causal relationships between gene expression and lipids

Nature Communications volume 11, Article number: 4930 (2020) Cite this article

8929 Accesses
16 Citations
30 Altmetric
Metrics details

Subjects

Abstract

Inference of causality between gene expression and complex traits using Mendelian randomization (MR) is confounded by pleiotropy and linkage disequilibrium (LD) of gene-expression quantitative trait loci (eQTL). Here, we propose an MR method, MR-link, that accounts for unobserved pleiotropy and LD by leveraging information from individual-level data, even when only one eQTL variant is present. In simulations, MR-link shows false-positive rates close to expectation (median 0.05) and high power (up to 0.89), outperforming all other tested MR methods and coloc. Application of MR-link to low-density lipoprotein cholesterol (LDL-C) measurements in 12,449 individuals with expression and protein QTL summary statistics from blood and liver identifies 25 genes causally linked to LDL-C. These include the known SORT1 and ApoE genes as well as PVRL2, located in the APOE locus, for which a causal role in liver was not known. Our results showcase the strength of MR-link for transcriptome-wide causal inferences.

Differentially expressed genes reflect disease-induced rather than disease-causing changes in the transcriptome

Article Open access 24 September 2021

Eleonora Porcu, Marie C. Sadler, … Zoltán Kutalik

Gene-lifestyle interactions in the genomics of human complex traits

Article Open access 22 March 2022

Vincent Laville, Timothy Majarian, … Hugues Aschard

Variance-quantitative trait loci enable systematic discovery of gene-environment interactions for cardiometabolic serum biomarkers

Article Open access 09 July 2022

Kenneth E. Westerman, Timothy D. Majarian, … Joanne B. Cole

Introduction

Mendelian randomization (MR) is a method that can infer causal relationships between two heritable complex traits from observational studies^1,2. In recent years, MR has gained popularity in the epidemiological field and its application has provided valuable insights into the risk factors that cause diseases and complex traits^1,2,3. MR studies have, for example, successfully identified causal relationships between low-density lipoprotein cholesterol (LDL-C) and coronary artery disease, in turn informing therapeutic strategies^4,5. MR studies have also shown that a causal relationship between high-density lipoprotein cholesterol (HDL-C) and coronary artery disease is unlikely, which is in contrast to previous epidemiological associations⁶. The same approach has been applied to identify molecular marks that are causal to disease^7,8,9,10. Since gene expression is one of these marks, investigating its causal role in complex traits is of particular interest given that complex trait loci are enriched for expression quantitative trait loci (eQTLs)¹¹.

MR infers a causal relationship between an exposure (e.g., a risk factor) and an outcome (e.g., a complex trait) by leveraging QTL variants of the exposure as instrumental variables (IVs). The mathematical model behind MR relies on three main assumptions to correctly infer causality: the IVs have to be (i) associated with the exposure, (ii) independent of any confounder of the exposure-outcome association, and (iii) conditionally independent of the outcome given the exposure and confounders. One major challenge of applying MR to gene expression is correcting for deviations from the third assumption, which can occur in the presence of linkage disequilibrium (LD) between the eQTL variants used as IVs, or in the presence of pleiotropy, i.e., when IVs affect the outcome through pathways other than the exposure of interest. Accounting for LD is necessary when gene expression is the exposure trait in MR because, in contrast to the majority of complex traits, the genetic architecture of gene expression is characterized by the presence of strong-acting eQTLs located proximal to their transcript (in cis), which are often correlated through LD^12,13. On top of this, the presence of pleiotropy cannot be excluded a priori given that the majority of variants in our genome are likely to affect one or multiple phenotypes^14,15,16. There are MR methods^{7,17,18,19,20,21} that extend standard MR analysis to correct for LD and pleiotropy, however, the application of these methods is not optimal because they require either the removal of pleiotropic IVs from the statistical model^7,19,20, that all sources of pleiotropy are measured and incorporated into the model^22,23, or that both the exposure and the outcome are measured in the same cohort²¹. These constraints limit robust inference of gene-expression traits as there are often only a limited number of IVs (i.e., eQTL variants) available, and subsequent removal of outliers will substantially reduce power. Likewise, it is not always possible to measure all sources of pleiotropy because it could come from expression of a gene in a different tissue or even from other unobserved molecular marks or phenotypes.

Here we introduce MR-link, an MR method that allows for causal inference in the presence of LD and an unobserved pleiotropic effect, without requiring the removal of pleiotropic IVs or measuring all sources of pleiotropy. MR-link uses summary statistics of an exposure combined with individual-level data on the outcome to estimate the causal effect of an exposure from IVs (i.e., eQTLs if the exposure is gene expression), while at the same time correcting for pleiotropic effects using genetic variants that are in LD with these IVs (cis-genetics) (Fig. 1).

**Fig. 1: Graphical representation of the study.**

We assess the performance of MR-link using simulated data in 100 different scenarios that mimic the genetic architecture of gene expression. We derive this information from eQTL association patterns in a large cohort of samples with genetic and transcriptomics data¹³. Subsequently, we apply MR-link to individual-level data for LDL-C measurements in 12,449 individuals with four different eQTL summary statistic datasets: blood eQTLs identified in the BIOS cohort (Fig. 1) and eQTLs from blood, liver, and cerebellum from the GTEx Consortium²⁴ (Fig. 1). We further explore the performance of MR-link on another molecular layer, protein levels, through the application of MR-link on protein quantitative trait loci (pQTL) summary statistics from Sun et al. combined with our LDL-C measurements²⁵. Our results in simulated and real data show that MR-link can robustly identify causal relationships between molecular traits—such as gene expression and protein levels—and an outcome (e.g., a complex trait), even when the information for causal inference is very limited.

Results

eQTL variants between different genes are often in LD

In a standard MR analysis, IVs need to be independent (not in LD) and have to affect the outcome only through the exposure (absence of pleiotropy). Even in absence of pleiotropy, correlated IVs in the cis locus may negatively influence an MR analysis (Fig. 2a). In the presence of pleiotropy, we distinguish two scenarios: (i) pleiotropic variants that are in LD with an IV (pleiotropy through LD, Fig. 2b) and (ii) when the IV and the pleiotropic variant are the same and affect the outcome through two distinct mechanisms (pleiotropy through overlap Fig. 2c). If pleiotropy through LD is prevalent, genetic variants in the cis-region other than those selected as IVs can be used to explain the pleiotropic effects. Incorporating these variants in an MR model can then account for this pleiotropy through LD (Fig. 2b).

**Fig. 2: Typical scenarios of pleiotropy in causal inference of gene expression changes as an exposure.**

We investigated how often pleiotropy through LD occurs in gene expression by looking at how frequently eQTL variants are shared between genes in cis. Using data from the BIOS Consortium, a cohort of 3503 Dutch individuals whose genome and whole-blood transcriptome has been characterized (Fig. 1), we searched for eQTLs located 1.5 megabases (Mb) on both sides of the translated region of 19,960 genes (see “Methods”)¹³. We then applied a summary statistics-based stepwise linear regression approach (GCTA-COJO) to identify jointly significant variants, e.g., one or more variants that jointly associate significantly with expression changes of a gene²⁶ (“Methods”). We observed that 54% of the genes with an eQTL at p < 5 × 10⁻⁸ (13,778 genes) had two or more jointly significant eQTL variants at p < 5 × 10⁻⁸ (“Methods”) (Fig. 1 and Fig. 2a). These genetic effects were mostly non-overlapping: only 13.4% of the genes have overlapping (r² > 0.99) top eQTL variants. In contrast, genetic variants regulating gene expression of a gene were very often in LD with other eQTLs: 40.6% of top variants are in LD (r² > 0.5) between genes, and this percentage increased to 60.3% if all jointly significant eQTL variants were considered (“Methods”).

To strengthen our inferences on the genetic regulation of gene expression in cis, we performed statistical fine-mapping using FINEMAP v1.3.1²⁷ on 13,276 genes (“Methods”). Only 373 (2.8%) genes have full eQTL overlap (all variants in the top configuration of a gene are identical or in high LD (r² > 0.99)), while 33.2% of the genes have at least one variant in r² > 0.5 LD with a variant in the top configuration of another gene. These percentages are higher for configurations with larger posterior inclusion probabilities (“Methods”) (Supplementary Data 1), but overall the results are similar to our observations from the GCTA-COJO analysis, i.e., the genetics of gene expression in whole blood is mostly regulated by variants that do not overlap but are in moderate LD with variants associated with gene expression changes of another gene. Based on these results, it seems likely that pleiotropy through LD is more common than pleiotropy through overlap in gene-expression traits.

MR-link outperforms other methods in discriminative ability

We have developed an MR method, MR-link, that uses the genetic region surrounding IVs as a covariate to correct for pleiotropic effects (“Methods”, Fig. 2 and Supplementary Note 1). The model underlying MR-link is informed by the observation that the genetic regulation of gene expression is characterized mostly by eQTLs that are in LD, but not overlapping, between genes. This suggests that the variants in the genetic vicinity of the IVs can be used to correct for pleiotropic effects.

MR-link gathers information from all genetic variants in LD with an IV to jointly model the outcome through the IVs and their genetic vicinity (“Methods”). Compared to other MR methods that require summary statistics of both the exposure and the outcome (two-sample MR), our approach adds a requirement of individual-level data for the outcome, but has the advantage that it can perform causal inference even when only a single IV is available. Strictly speaking, MR-link corrects for pleiotropy under the assumption that pleiotropy can be better explained by variants in LD with the IV (pleiotropy through LD) (Fig. 2b) and that pleiotropy through overlap is absent (Fig. 2c). In the case of a single IV, this assumption needs to be fully accounted for, but when multiple IVs are available, this assumption can be relaxed somewhat. Differences in effect sizes between IVs can be used to distinguish the causal effect of interest from a pleiotropic effect in the same way that multivariable MR corrects for pleiotropy²². Of note, MR-link does not require the source of pleiotropy to be specified in the model; MR-link can account for pleiotropic effects arising from, for instance, gene expression in other tissues or from other molecular layers or phenotypes.

We assessed the performance of MR-link under different scenarios and compared it to four other MR methods: Inverse variance weighting (IVW), which assumes the absence of LD and pleiotropy, and the pleiotropy-robust methods MR-Egger, LDA-MR-Egger, and MR-PRESSO (Table 1)^17,18,19,28. In addition, we compared MR-link to the widely used Bayesian colocalization method coloc²⁹, although this is not a formal test for assessing causal relationships, but rather a way to evaluate if two traits share the same causal variant(s) in a locus²⁹.

Table 1 MR methods assessed in this study.

Full size table

We simulated causal relationships between an exposure and an outcome in a 5 Mb region, based on LD structure estimated for 403 European samples from the 1000 Genomes project³⁰ (“Methods”). All tested MR methods were assessed in 1500 simulated datasets for 100 different scenarios that varied with respect to the absence or presence of causality, the absence or presence of pleiotropy, and the number of causal eQTL variants. We initially evaluated two approaches to select QTL variants as IVs: GCTA-COJO (v1.26.0) and p value clumping (“Methods”)^26,31. We observed that GCTA-COJO was best suited for IV selection because: (i) the median number of IVs identified by GCTA-COJO better represented the number of simulated causal variants (Supplementary Data 2) and (ii) the false-positive rates (FPRs) in the MR analysis using the IVW method were lower (median FPR was 0.057 using GCTA-COJO versus 0.115 using clumping) (Supplementary Fig. 1 and Supplementary Data 2). We therefore selected IVs for the exposure using the GTCA-COJO approach in subsequent analyses.

When we simulated pleiotropy through LD with no causal effect of the known exposure on the outcome (Figs. 2b, 3a, Supplementary Data 3 and “Methods”), all existing MR-methods showed inflated FPRs (up to 0.71, 0.15, 0.13, and 0.27 for IVW, MR-Egger, LDA-MR-Egger, and MR-PRESSO, respectively), whereas MR-link presented an FPR close to expectation (median: 0.05, maximum: 0.058). In addition, for LDA-MR-Egger, MR-Egger, and MR-PRESSO, the FPR was undesirably dependent on the number of causal SNPs simulated (Fig. 3a).

**Fig. 3: Relative performance of different MR methods.**

In the scenarios of pleiotropy through LD and non-null causal effects (b_E = 0.05, b_E = 0.1, b_E = 0.2, and b_E = 0.4), MR-link has high detection power (up to 0.89) and strongly outperforms all other pleiotropy-robust methods (maximum detected power was 0.28 for MR-Egger, 0.26 for LDA-MR-Egger and 0.65 for MR-PRESSO) (Fig. 3b, c, Supplementary Data 3 and “Methods”). Among all the methods tested, including MR-link, and for all scenarios, IVW had the greatest detection power but also an inflated FPR (minimum FPR: 0.63), making this MR method unreliable in such pleiotropic scenarios (“Methods”).

When we simulated increasing levels of pleiotropy through overlap (Fig. 2c and “Methods”), a situation we expect to be rare in real-world scenarios based on our observation in the BIOS cohort, we observed that all methods including MR-link have increased FPRs (up to 0.22 for MR-link, 0.77 for IVW, 0.10 for LDA-MR-Egger, 0.13 for MR-Egger, and 0.30 for MR-PRESSO) (Supplementary Data 4). Nonetheless, MR-link remains a powerful method when a causal effect is simulated: maximum power was 0.79 for MR-link, 0.98 for IVW, 0.29 for MR-Egger, 0.28 for LDA-MR-Egger, and 0.65 for MR-PRESSO (Supplementary Data 4). Although IVW again had the highest power (0.98) here, the FPR was likewise highly inflated (0.77).

Finally, we compared MR-link to the coloc package using the area under the receiver operator characteristic curve (AUC) metric as well as FPRs and power (calculated using coloc PP4 > 0.9 as a threshold) (“Methods”). We used the AUC metric because coloc provides posterior probabilities of causal variant sharing and not p values (“Methods”). As coloc assumes that the exposure and the outcome share only one causal variant, we also included the recently implemented coloc variations (coloc-cond and coloc-masked) in our comparison. These variations are expected to perform better in scenarios with multiple causal variants³². When comparing MR-link to the coloc variations through the AUC metric, we find that MR-link consistently outperforms coloc and coloc-masked in all scenarios, and coloc-cond in pleiotropic scenarios. In non-pleiotropic scenarios, MR-link and coloc-cond have approximately the same performance (Supplementary Fig. 2 and Supplementary Data 5). As expected, coloc-cond has better discriminative performance compared to the original coloc when multiple causal variants are simulated (Supplementary Fig. 2 and Supplementary Data 5).

To illustrate detection rates in standard coloc settings as they may be used in a real-world analysis, we determined power and FPR for all coloc variations at a PP4 threshold of > 0.9 (Supplementary Fig. 3 and Supplementary Data 6). In the non-pleiotropic case, coloc and coloc-cond have the best detection power (up to 0.79 for coloc and 0.76 for coloc-cond), combined with near zero FPRs (max: 0 for coloc and 0.0006 for coloc-cond) while coloc-masked has lower power (up to 0.40) with a zero FPR (Supplementary Fig. 3a–c) (Supplementary Data 6). In simulations of pleiotropy through LD, all coloc methods have increased FPRs (medians: 0.026 for coloc, 0.142 for coloc-cond, and 0.0037 for coloc-masked) with a decrease in power relative to the non-pleiotropic simulations (max: 0.37 for coloc, 0.43 for coloc-cond, and 0.14 for coloc-masked) (Supplementary Fig. 3d–f and Supplementary Data 6). These patterns were even more apparent in cases of pleiotropy through overlap (Supplementary Fig. 3g–i and Supplementary Data 6). This comparison through FPRs and power indicates again that MR-link has superior discriminative ability over coloc variations, especially in the presence of pleiotropy.

MR-link identifies gene expression causal to LDL-C levels

We applied MR-link to four separate summary statistics-based eQTL datasets combined with individual-level genotype data and LDL-C measurements in 12,449 individuals from the Lifelines cohort³³ (Fig. 1). We assessed the causal effect of gene expression changes in (i) whole blood (using eQTLs from BIOS (n = 3503) and GTEx (n = 369)), (ii) liver as the main tissue important for cholesterol metabolism (using eQTLs from GTEx, n = 153), and (iii) cerebellum tissue (using eQTLs from GTEx, n = 154) as a tissue not involved in cholesterol metabolism but with similar sample size (and thus power) to liver tissue^24,34.

Transcriptome-wide application of MR-link to these eQTL datasets identified 24 significant genes whose variation in blood (18 using BIOS eQTLs, 2 using GTEx eQTLs) or liver (4 genes) was causally related to LDL-C (Tables 2, 3, Supplementary Tables 1 and 2). No significant genes were found in the cerebellum (Supplementary Table 2).

Table 2 MR-link results using BIOS blood eQTLs.

Full size table

Table 3 MR-link results using GTEx liver eQTLs.

Full size table

MR analysis that used whole-blood eQTLs from GTEx was, as expected, underpowered compared to the analysis using BIOS eQTLs. Only two genes were found to be significant here, but they were not significant in the analysis that used BIOS eQTLs, where a more robust estimate could be made thanks to higher number of IVs identified (Supplementary Fig. 4a). Despite the limited power, we observed high concordance between effect sizes from the two analyses for all genes that showed nominal significance (p < 0.05) in the analysis that used BIOS eQTLs, with 94.8% of genes showing the same effect direction (Supplementary Fig. 4b).

Several genes located in genome-wide association study (GWAS) loci for cholesterol metabolism were found significant in the MR analysis that used blood eQTLs from BIOS, using a Bonferroni threshold that accounted for 13,778 genes being tested (0.05/13778 = 3.6 × 10⁻⁶). These include ABO, located in a LDL-C locus, AOC1, TMEM176A, and TMEM176B, which are all located in the same HDL-C-associated locus^35,36, and SYCP2L, which is located in a GWAS locus for polyunsaturated fatty acids and related to LDL-C levels^37,38. For the other genes identified, there was no evidence in the literature for a direct role in cholesterol metabolism, although some interesting patterns were evident. For example, we observed multiple genes involved in immunoglobulin production (IGLC5, IGLC6, IGLV4-69, and IGLVI-70) and insulin metabolism (UNC5B, DEPP1), mechanisms that are consistent with the role of cholesterol in inflammation and insulin resistance^39,40. For all 18 genes, the effect direction estimated by MR-link was concordant with the direction estimated by other MR-methods when they were available, except in the case of MSLN, where only LDA-MR-Egger gave discordant results compared to all other methods (Table 1, Supplementary Fig. 5, and Supplementary Table 3). Interestingly, 17 of the 18 genes did not pass significance after multiple testing correction using the other tested methods: only ABO passed Bonferroni significance and only when using the IVW method (Table 1, Supplementary Fig. 5, and Supplementary Table 3). In 13 genes, a causal effect could not be estimated by MR-Egger, LDA-MR-Egger, and MR-PRESSO because there were too few IVs. Furthermore, MR-PRESSO did not make a causal estimate in the remaining 5 genes as it identified too many outliers (Table 1, Supplementary Fig. 5, and Supplementary Table 3).

In the MR analysis using eQTLs from liver, all the genes identified at the Bonferroni significance level of 3.2 × 10⁻⁵ (0.05/1557) fall within LDL-C GWAS loci. Among these, we found a negative causal effect for the well-known SORT1 gene (MR-link calibrated two-sided p = 5.9 × 10⁻⁹). Multiple functional studies have shown that this gene encodes the protein Sortilin (encoded by SORT1) and that it affects plasma LDL-C levels by acting on clearance of LDL-C and on secretion of very-LDL (VLDL) by the liver^41,42,43 (Table 3 and Supplementary Table 2). We also found two other genes in the same GWAS locus, PSRC1, and CELSR2, but the IV (only one was found) for these genes was identical to that of SORT1 due to the high correlation between expression levels of these genes. Full overlap of a single IV in this locus makes it is impossible to discern causal from pleiotropic genes using MR-methods, including MR-link. The fourth gene found to be significant using liver eQTLs is PVRL2 (MR-link calibrated two-sided p = 3 × 10⁻¹⁴), which is located in the APOE locus associated to LDL-C (Table 3)^35,36. For PVRL2, we estimated a positive causal effect; higher expression of PVRL2 is causally related to higher LDL-C (Table 3). PVRL2 is 17.5 kb downstream of the APOE gene, and two common missense polymorphisms in APOE account for a large fraction of the association signal^36,44. Interestingly, in the most recent GWAS meta-analysis for lipids, 19 jointly significant LDL-C variants were found spanning a 162 kb region that encompasses PVRL2³⁶. This indicates that, while missense mutations in APOE play a major role, other genes in this locus are also likely involved in LDL-C regulation and that pleiotropic effects are to be expected. Our analyses indicate that PVRL2 is one of the causal genes at this locus. The positive effect of PVRL2 on LDL-C was also seen in the analysis that used blood eQTLs from BIOS (MR-link calibrated two-sided p = 4.3 × 10⁻⁵), although it did not pass our significance threshold in that analysis. Likewise, variation in gene expression of PVRL2 in blood has been found to be associated with LDL-C in a transcriptome-wide association analysis carried out in a very large genetic association study³⁶. Of note, since the LD between IVs used in the analysis of blood and liver eQTLs was low (r² < 0.2), the results potentially indicate a dual causal role for PVRL2 across these two tissues.

PVRL2 has mostly been studied in the context of atherosclerosis, where it has been shown to act as cholesterol-responsive gene involved in trans-endothelial migration of leukocytes in vascular endothelial cells, a key feature in atherosclerosis development^45,46,47. Our results indicate a role for PVRL2 in modulating plasma levels of LDL-C via its expression variation in the liver. Biologically the role in liver could be explained by increased production of very-LDL or decreased LDL-C uptake (Fig. 4). In line with this hypothesis, a siRNA screen in hepatic cell lines of genes in the APOE locus showed that downregulation of PVRL2 gene expression promotes LDL-C uptake⁴⁸ (Fig. 4). Overall, our results and existing functional evidence support that PVRL2 expression is correlated with LDL-C levels and show a causal effect in liver (Fig. 4).

**Fig. 4: Biological interpretation of *PVRL2*.**

MR-link confirms ApoE changes affect LDL-C levels

To assess the effectiveness of MR-link in proteomics measurements, we combined the aforementioned LDL-C measurements in the Lifelines cohort with cis-pQTL summary statistics of 471 plasma protein measurements (measured using the SOMAscan platform in a cohort of 3301 individuals) (“Methods”)^25,49. One protein passes the Bonferroni multiple testing threshold (p < 1.05 × 10⁻⁴): ApoE3, an isoform of ApoE (causal effect: 0.40 (+/−0.13 s.e.), MR-link calibrated two-sided p = 4.65 × 10⁻⁵, SOMAmer ID: APOE.2937.10.2). pQTLs were also available for ApoE2 (SOMAmer ID: APOE.5312.49.3), another isoform of ApoE but the causal effect was weaker and did not pass the Bonferroni threshold (causal effect = 0.56 (+/−0.24 s.e.), MR-link calibrated two-sided p = 0.002)⁴⁴. These results are in line with the well-known causal relationship between increased ApoE plasma levels and LDL-C, and the widely described stronger impact of the E3 isoform compared to the E2 isoform⁴⁴. Interestingly, MR-link did not estimate BGAT, the protein product of ABO, to be significant in this dataset (SOMAmer ID: ABO.9253.52.3, MR-link calibrated two-sided p = 0.18) We compared the IVs identified for BGAT (rs9411463 and rs72775494) with those used in the ABO blood eQTL analysis and found that only one IV for the BGAT protein was in LD (rs9411463) with any of the four IVs for ABO expression in BIOS. This scenario is in line with the overall patterns observed in the proteomics study—only a small fraction of eQTLs in blood also affect protein levels, but our results could also reflect targeting of the SOMAmer to a specific ABO protein isoform²⁵. Unfortunately, further isoform information for BGAT was not available in the original study.

Discussion

Identification of genes whose changes in expression are causally linked to a phenotype is crucial for understanding the mechanisms behind complex traits. While several methods exist that infer causal relationships between two phenotypes, these rely on a set of assumptions that are often violated when gene expression is the exposure. Specifically, the presence of LD and pleiotropy between the genetic variants chosen as IVs are the main cause of violations of such assumptions^17,18,19,28. Here we interrogated a large gene-expression dataset and showed that the eQTLs of a gene, which can be used as IVs, are very likely to be in LD, but not overlapping, with eQTLs of other genes, indicating that potential sources of pleiotropy in transcriptome-wide MR analyses are likely to come from variants in LD with the IVs.

We therefore developed MR-link, a causal inference method that is robust to unobserved pleiotropy. Our in silico results show that MR-link has the best discriminative ability compared to all other MR methods we tested, as well as to the Bayesian colocalization method coloc. MR-link jointly models the outcome using jointly significant eQTLs as IVs, combined with variants in LD, to correct for all potential sources of pleiotropy. To our knowledge, this approach has never been used in a causal inference method.

We applied MR-link to real data by applying it to LDL-C cholesterol measurements and eQTLs derived from blood, cerebellum and liver. This identified known and previously unknown causal genes within and outside GWAS loci. For example, in liver we identified the well-known negative causal relationship between expression of SORT1 in liver and LDL-C^41,42,43. In liver, and suggestively in blood, we detected a causal effect for PVRL2, a gene located in the APOE locus. While a role for this gene is mostly known for immune and endothelial cells and in the context of atherosclerosis^45,47, our results indicate that regulation of expression of this gene in both blood and liver causally affects LDL-C levels. Given its established role in atherogenesis, PVRL2 has been proposed as a potential therapeutic target for atherosclerosis. Our study indicates that such strategies should not only take into account the effect on atherosclerotic plaques, but also consider the hepatic function of PVRL2 in regulating plasma LDL-C levels in humans.

All the genes identified in the analyses that used eQTLs from blood were different from those identified using eQTLs from liver. While this is partly due to statistical power, as the BIOS cohort is more than 20 times larger than the GTEx cohort used to derive eQTLs in liver, this may also be related to tissue-specific mechanisms. We expect that causal genes found in whole blood will affect LDL-C through pathways that signal for lipid changes or regulate lipid binding to erythrocytes, as hypothesized for the ABO gene, whereas genes found in liver are more likely to be involved in lipid metabolism^50,51.

MR-link has several advantages over other recent MR methods developed to overcome bias from LD and pleiotropy^17,23. First, MR-link can model unobserved pleiotropy, whereas sources of pleiotropy need to be specified in multivariate MR methods. This is particularly important because sources of pleiotropy may be context-dependent and may arise from a phenotype other than those being measured in a cohort^14,34. Second, MR-link can derive robust causal estimates even when only one or two IVs are available. The majority of genes tested in our large eQTL dataset have fewer than three IVs (68%), which makes it impossible for MR-PRESSO, MR-Egger, and LDA-MR-Egger to make causal estimates^17,18,19.

One of the MR-link assumptions is that the IVs affect the outcome only through the exposure, conditional on the unmeasured pleiotropic effect. This assumption is violated when the IVs of the exposure and of the pleiotropic effect are fully overlapping. This assumption must not be violated when a single IV is available, but can be relaxed when multiple IVs are used in the model, as the relative effects of the IVs help to discriminate between a true causal effect and a pleiotropic effect, similar to multivariable Mendelian randomization methods²². In the case of multiple IVs that are fully overlapping, we have shown that MR-link has an increased FPR, yet still maintains higher power compared to other MR-methods and superior discriminative ability compared to coloc.

The application of MR-link is not restricted to gene expression or proteomics datasets; it can also be applied to other molecular layers that are known to have a similar genetic architecture to gene expression, such as metabolites. Given the increases in sharing of summary statistics from functional genomics QTL studies, coupled with the development of very large biobanks such as the UK biobank, the Estonian Biobank, the Lifelines cohort study, and the Million Veteran Program cohort^33,52,53,54, we foresee many opportunities for applications of MR-link to individual-level data for the identification of the molecular mechanisms underlying complex traits. Of note, while we have limited our simulations to quantitative traits as an outcome in this paper, MR-link could be applied to binary traits such as human diseases. However, we have not investigated its performance in detail for binary outcome phenotypes. Furthermore, as for all MR studies, our method can be applied to populations of any ethnicity, provided that the summary statistics of the exposure are derived from a population that is ethnically-matched with the outcome cohort.

We foresee that many causal relationships will be discovered if highly powered causal inference methods such as MR-link are applied to many human traits. This could make it possible to build extensive causal networks similar in size and complexity to metabolic networks of small molecules, which would provide valuable insights into the mechanisms behind human traits and diseases.

Methods

BIOS consortium cohort genotype and expression analysis

We used genotype and expression measurements on 3746 Dutch individuals from the Biobank-based Integrative Omics Study (BIOS; http://www.bbmri.nl/acquisition-use-analyze/bios/), a collection of six different data cohorts: Lifelines DEEP⁵⁵, Prospective ALS Study Netherlands⁵⁶, Leiden Longevity Study⁵⁷, Netherlands Twin Registry⁵⁸, The Cohort on Diabetes and Atherosclerosis Maastricht⁵⁹, and the Rotterdam Study⁶⁰. All cohorts from the BIOS consortium were approved by their ethical committees, as follows: the LLDEEP was approved by the medical ethics committee of the University Medical Center Groningen; the Prospective ALS Study Netherlands was conducted with the approval of the institutional review board of the University Medical Centre Utrecht; the Leiden Longevity Study was approved by the Medical Ethical Committee of the Leiden University Medical Center; the Netherlands Twin Registry was approved by Central Ethics Committee on Research Involving Human Subjects of the VU University Medical Center, Amsterdam, an Institutional Review Board certified by the US Office of Human Research Protections (IRB number IRB-2991 under Federal-wide Assurance-3703; IRB/institute codes, NTR 03-180); the Rotterdam Study was approved by the institutional review board (Medical Ethics Committee) of the Erasmus Medical Center and by the review board of The Netherlands Ministry of Health, Welfare and Sports; the CODAM study was approved by the medical ethics committee of Maastricht University. An informed consent form was obtained from all the participants. Genotyping was performed separately per cohort (see references). All combined genotypes were imputed to the Haplotype reference consortium dataset⁶¹ using the Michigan imputation server⁶². We retained only biallelic SNPs and confined our analyses to variants with minor allele frequency (MAF) > 0.01, Hardy–Weinberg equilibrium (HWE) p value >10⁻⁶ and an imputation quality RSQR > 0.8. A genetic relationship matrix (GRM) was derived based on LD-pruned genotypes using the Plink 1.9 command --indep 50 5 2, and one individual was kept from all pairs of individuals that had a GRM value > 0.1 using the --rel-cutoff Plink 1.9 command³¹. Population outliers were identified using a principal component analysis of the GRM, and individuals more distant than three standard deviations from the mean of principal component 1 and principal component 2 were removed.

RNA-seq gene-expression quality control and processing are the same as those of Zhernakova et al.¹³. RNA extracted from whole blood was paired-end sequenced using the Illumina HiSeq 2000 instrument. RNA-seq read alignment was performed using STAR (version 2.3.0e)⁶³. During alignment, variants with MAF < 0.01 from the Genome of the Netherlands were masked⁶⁴. Gene expression was quantified using HTSeq (version v0.6.1p1)⁶⁵. Samples with < 80% of reads mapping to exons were considered of low quality and removed. Samples were also removed if they had < 85% of mapped reads, or if they had a median 3′ bias larger than 70% or smaller than 45%. To further account for unobserved confounders, the expression matrix was corrected for the first 25 principal components as well as 5′ bias, 3′ bias, GC content, intron base-pair percentage, and sex following the procedure of Zhernakova et al.¹³. After genotype and expression quality control filters, 3503 individuals with expression data of 19,960 transcripts and genotype information of 7,838,327 SNPs were available for analyses. In this set, 57% were female and the average age was 52.8 years (±16.0 Stand. Dev.). eQTL association analysis was performed for SNPs located ±1.5 Mb of the transcript using Plink 1.9 and the --assoc command³¹. For 13,778 genes, at least one eQTL at p < 5 × 10⁻⁸ was identified, and those genes were used for all the analyses described in this manuscript.

We quantified how many genetic variants are necessary to explain gene expression using a conditional joint analysis approach. We identified jointly significant eQTLs by applying GCTA-COJO (v1.26.0)²⁶ to eQTL summary statistics, using the BIOS cohort as LD reference panel, and selecting jointly significant variants that showed a p < 5 × 10⁻⁸ in this analysis step. To infer how often eQTLs are shared between genes, we assessed the percentage of genes with top eQTLs (or jointly significant variants) that have LD r² > 0.99. We used the r² > 0.5 threshold to see how often eQTL variants were in LD with each other.

We performed statistical fine-mapping of all genes using the FINEMAP v1.3.1 program²⁷. First, we searched for associated eQTL variants (p < 5 × 10⁻⁸) in the cis-associated region. We then padded the associated regions with 100 kb and only looked for variants in this extended region. FINEMAP requires the same number of individuals across all variants, therefore we analyzed only the genes with the associated variants available in all subcohorts. We ran FINEMAP on these genes with the --sss option, using LD computed with Plink v1.9, with the --r command. Furthermore, genes were not run if they had less than 25 variants available in the region, or if a combination of variants led to an invalid posterior probability, leaving 13,276 genes which were successfully fine-mapped.

FINEMAP provides several configurations of statistically fine-mapped variants, along with their posterior probability of being causal. Studies that identify causal variants usually use a high posterior inclusion probability of multiple causal variant configurations to make sure the causal variant is captured in analysis. In MR studies it is not necessary to identify true causal variants, as the IV only needs to explain the exposure signal the best. In our analysis of LD between FINEMAP variants, we have therefore only considered the most likely configuration identified by FINEMAP, as these variants better explain the exposure variation.

Lifelines cohort genotype data and LDL-C levels

Lifelines is a multi-generational cohort study of 167,000 individuals from the north of The Netherlands. It was approved by the medical ethics committee of the University Medical Center Groningen and conducted in accordance with Helsinki Declaration Guidelines. All participants signed an informed consent form prior to enrollment. A subset of 13,436 Lifelines samples were genotyped with the cytoSNP array and underwent the quality control steps described in Scholtens et al.³³: Genotyped variants were retained based on three criteria: MAF > 0.001, HWE p > 10⁻⁴, and a genotyping call rate > 0.95. After genotype quality control, samples were imputed using the Genome of the Netherlands reference panel⁶⁴ and Minimac version 2012.10.3⁶⁶. Variants were further excluded if they were of bad imputation quality (RSQR < 0.3), showed deviation from HWE (p < 10⁻⁶), or if they were absent in the set of quality controlled genotyped and imputed variants of the BIOS cohort.

Low-density lipoprotein cholesterol (LDL-C) was estimated using the Friedewald equation⁶⁷, based on triglycerides, high-density lipoprotein, and total cholesterol levels³³. Total cholesterol levels of individuals who were prescribed cholesterol-lowering medication were divided by 0.8 prior to calculating LDL-C. Individuals with >4.52 mmol per liter total triglycerides were removed⁶⁷. In addition, LDL-C levels were corrected for age, age squared, and sex. After genotype and LDL-C quality control, 12,449 individuals (of which 58.8% were female and the average age was 48.7 years (±11.5 Stand. Dev.)) and 7,336,374 variants remained for analyses. Association analysis for additive effects on LDL-C was performed using linear regression on standardized genotypes, e.g., transforming genotypes into a distribution with mean 0 and variance 1. Summary statistics of this analysis were used to perform MR analyses using the existing MR methods listed in Table 1.

GTEx download and analysis

We downloaded GTEx version 7 eQTL summary statistics, including non-significant results, from the GTEx website (https://gtexportal.org/home/datasets/)²⁴. For every gene with at least one eQTL at p < 5 × 10⁻⁸, conditional analysis using GCTA-COJO was performed to select secondary variants at the same threshold, using the BIOS cohort as an LD reference. This resulted in 4028, 1557, and 1726 genes with at least one jointly significant eQTL for whole blood, liver, and brain (cerebellum) tissues, respectively.

pQTL summary statistics download and analysis

We downloaded the proteomics summary statistics of Sun et al.²⁵ from the GWAS catalog (ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/SunBB_29875488_GCST005806). We isolated cis-regions by selecting variants within +/−1.5 Mb from each transcript. These variants already passed the quality control steps of Sun et al.²⁵: (i) INFO score > = 0.7; (ii) minor allele count > = 8; (iii) Hardy–Weinberg equilibrium p > = 5 × 10⁻⁶. For all these variants we used UK10K minor allele frequencies (ftp://ngs.sanger.ac.uk/production/uk10k/UK10K_COHORT/REL-2012-06-02/UK10K_COHORT.20160215.sites.vcf.gz) as this information was not provided in the summary statistics but it is required for GCTA-COJO IV selection. We selected IVs using Lifelines genotypes as an LD reference³³. To run MR-link, we first selected proteins with significantly (p < 5 × 10⁻⁸) associated variants that were shared between the cis summary statistics and the Lifelines cohort. This resulted in 471 proteins with significantly associated variants (p < 5 × 10⁻⁸) that are overlapping with the variants in the Lifelines cohort and for which GCTA-COJO was able to identify IVs.

Simulation of genotypes

Four hundred and three non-Finnish European individuals were isolated from the 1000 Genomes phase 3 release and used as a starting point for genotype simulation³⁰. We simulated genotype data for 25,000 individuals in a chromosomal region (Chromosome 2, 100–105 Mb, human genome build 37) using the HAPGEN2 program (v.2.2.0), combined with interpolated HAPMAP3 recombination rates⁶⁸. The region was then reduced to 1 Mb in length: between 102 Mbp and 103 Mb. Only biallelic SNPs with MAF < 0.01 were retained from simulated genotypes, leaving 3101 variants in this region. Simulated individuals were separated into an outcome cohort of 15,000 individuals, and into an exposure cohort and an LD reference cohort of 5000 individuals each. These cohort sizes were chosen to roughly represent the sizes of BIOS and Lifelines cohorts.

Simulation of phenotypes

We simulated quantitative phenotypes representing the exposures by randomly selecting SNPs from the simulated genetic region, and subsequently assigning these an effect. Causal SNPs were selected to represent both pleiotropy through LD (Fig. 2b) and pleiotropy through overlap (Fig. 2c). For the scenario of pleiotropy through LD (Fig. 2b), one to ten causal SNPs (subset s_E) for the exposure were randomly selected from the entire simulated genetic region, and the same number of causal SNPs (subset s_U) for the unobserved (pleiotropic) exposure was randomly selected from all SNPs in moderate LD (0.25 < r² < 0.95) with SNPs in s_E.

When pleiotropy through overlap was simulated (Fig. 2c), the causal SNPs for the observed and unobserved exposure were selected to be identical: s_E = s_U. A combination of pleiotropy through overlap and pleiotropy through linkage was simulated by choosing some or all of the SNPs of the unobserved exposure (subset s_U) to be overlapping and some being in LD (0.25 < r² < 0.95) with SNPs in s_E.

The mathematical framework for the simulation of phenotypes is as follows. For each selected causal SNP of the exposure (subset s_E), we simulated an effect-size from the uniform distribution U(−0.5,0.5) and then simulated the observed exposure y_E as:

$${\mathbf{y}}_{\mathrm{E}} = {\mathbf{X}}{\mathbf{\beta}}_{\mathrm{E}} + {\mathbf{C}} + {\mathbf{\epsilon }}_{\mathrm{E}},$$

(1)

where X is a genotype matrix of size n × m, with n being the number of individuals (5000) and m the number of variants in the region (3101 in the simulated data), β_E is the vector of effects

${\mathbf{\beta}}_{{\mathrm{E}},j} = \left\{ {\begin{array}{*{20}{c}} { \sim U\left( { - 0.5,0.5} \right)} & {{\mathrm{if}}\,j \in {\mathbf{s}}_{\mathrm{E}}} \\ 0 & {{\mathrm{otherwise}}} \end{array}} \right.,\forall j \in \{ 1, \ldots ,m\}$, and C ~ N(0,0.5)ⁿ is an n-vector of independent scalar draws from N(0,0.5), representing a cohort-specific confounder value per individual. Finally, ${\mathbf{\epsilon }}_{\mathrm{E}} \sim N\left( {0,1} \right)^n$ is an n-vector of the measurement error of the exposure. Similarly, the unobserved exposure y_U was simulated as:

$${\mathbf{y}}_{\mathrm{U}} = {\mathbf{X}}{\mathbf{\beta}}_{\mathrm{U}} + {\mathbf{C}} + {\mathbf{\epsilon }}_{\mathrm{U}},$$

(2)

where β_U is the vector of effects defined as: ${\mathbf{\beta}}_{{\mathrm{U}},j} = \left\{ {\begin{array}{*{20}{c}} { \sim U\left( { - 0.5,0.5} \right)} & {{\mathrm{if}}\;j \in {\mathbf{s}}_{\mathrm{U}}} \\ 0 & {{\mathrm{otherwise}}} \end{array}} \right.,\forall \;j \in \{ 1, \ldots ,m\}$, s_U is the selection of SNPs for the unobserved exposure and ${\mathbf{\epsilon }}_{\mathrm{U}}$ are measurement errors distributed as ${\mathbf{\epsilon }}_{\mathrm{E}}$. The outcome phenotype y_o was then simulated as a linear combination of the observed and unobserved exposures:

$${\mathbf{y}}_{\mathrm{O}} = {\mathbf{y}}_{\mathrm{E}}b_{\mathrm{E}} + {\mathbf{y}}_{\mathrm{U}}b_{\mathrm{U}} + {\mathbf{C}} + {\mathbf{\epsilon }}_{\mathrm{O}},$$

(3)

where the causal effect of interest is parameterized per simulation run as $b_{\mathrm{E}} \in \{ 0,0.05,0.1,0.2,0.4\}$ and the (unknown) pleiotropic effect is the parameter $b_{\mathrm{U}} \in \left\{ {0,0.4} \right\}$ reflecting absence and presence of a pleiotropic effect in a locus. Again, the measurement error ${\mathbf{\epsilon }}_{\mathrm{O}}$ is drawn from N(0,1)ⁿ.

The genetic variants of the exposures (s_E, s_U) and their effect sizes β_E, β_U were drawn and used in both cohorts (exposure and outcome), while the other random variables C, ${\mathbf{\epsilon }}_{\mathrm{U}},{\mathbf{\epsilon }}_{\mathrm{E}},{\mathbf{\epsilon }}_{\mathrm{O}}$ were randomly drawn in a cohort-specific manner. Since our model was built to account for unobserved pleiotropy, the observed and unobserved exposure were used to generate the outcome phenotype as in Eq. (3), but only the outcome phenotypes and the summary statistics of the (observed) exposure phenotype were used in the causal inference analysis.

Simulation parameters and scenarios

We simulated 1500 runs per scenario, each with a unique outcome (O) and two exposures (E and U). The scenarios differed in the number of causal SNPs (which varied from one to ten for both the observed and unobserved exposure), the strength of the causal relationship of interest (varied from no causal effect up to a large effect ($b_{\mathrm{E}} \in \left\{ {0,0.05,0.1,0.2,0.4} \right\}$) and the presence (b_U = 0.4) or absence ((b_U = 0.0) of the pleiotropic effect. This resulted in 10 × 5 × 2 = 100 different scenarios.

In certain cases, an estimate cannot be made by an MR method, for instance when insufficient IVs are identified or a solution is not found in the estimation method. As a result, there are sometimes fewer estimates than expected in the final results. To ensure the stability of our FPR and power estimates, we have only reported results for a MR method in a specific scenario if we had more than 100 estimates out of the 1500 simulated runs.

Instrumental variable selection

IV selection can be difficult when there is LD between association signals. In simulations, we used two IV selection techniques: GCTA-COJO²⁶ and p value clumping, using standard settings of Plink 1.9 except for the r² threshold, which was set to 0.1³¹. Both selection methods used a p value threshold of p < 5 × 10⁻⁸. When selecting IVs for BIOS and GTEX, we only used the GCTA-COJO technique.

MR-link

MR-link is a method for causal inference that is robust to the presence of LD and unobserved pleiotropy. It is an MR approach that requires individual-level data from the outcome cohort and summary statistics (effect sizes, standard errors and MAFs) from an exposure. Conceptually, MR-link jointly models a known exposure with SNPs that are in LD with the exposure IVs (tag-SNPs). Tag-SNPs are used to account for the unobserved pleiotropic effect present in a locus.

We defined our model in the following manner. Let X be a genotype matrix of n × m where n is the number of individuals in the outcome study and m are all the SNPs in a cis-region around the transcript (±1.5 Mb of the transcript), in which SNPs at indices s_E are the causal genetic variants (IVs) for the exposure E. If we define the exposure E and the unobserved (pleiotropic) exposure U as in Eqs. (1) and (2), then the outcome phenotype y_o from Eq. (3) can be represented as a function of E and U with the following equation:

$${\mathbf{y}}_{\mathrm{O}} = {\mathbf{X}}{\mathbf{\beta }}_{\mathrm{E}}b_{\mathrm{E}} + {\mathbf{X}}{\mathbf{\beta }}_{\mathrm{U}}b_{\mathrm{U}} + {\mathbf{C}}_{\mathrm{O}} + {\mathbf{\epsilon }}_{\mathrm{O}},$$

(4)

where b_E is the causal effect of interest of the exposure on the outcome, b_U is the causal effect of the unobserved exposure, C_o is a n-vector of independent scalars representing specific confounder per individual and ${\mathbf{\epsilon }}_{\mathrm{O}}$ is the measurement error of the outcome. In the hypothetical case that the genetic effects for both the exposure E and the pleiotropic exposure U are known, we can estimate b_E by solving Eq. (4) in an analysis that is similar to multivariate MR²². In a real-world scenario, only the IV(s) for the exposure are known, while the variants that contribute to the unobserved (pleiotropic) exposure and their effect on the outcome are unknown.

Under Eq. (4), MR-link relies on the assumption that SNPs on s_E influence the outcome y_O only through their effect on y_E, when conditioning on s_U.

MR-link uses the following procedure to estimate causal effects:

(1)
A selection ${\hat{\mathbf{s}}}_{\mathrm{E}}$ of IVs for the exposure and conditional effect sizes $\widehat {\mathbf{\beta }}_{\mathrm{E}}$ for these IVs are determined using the GCTA-COJO method²⁶. A vector of effect sizes $\widehat {\mathbf{\beta }}_{\mathrm{E}}$ for all SNPs in the region is thus defined as: $\widehat {\mathbf{\beta }}_{{\mathrm{E}},j} = \left\{ {\begin{array}{*{20}{c}} { \ne 0} & {{\mathrm{if}}\,j \in {\hat{\mathbf{s}}}_{\mathrm{E}}} \\ 0 & {{\mathrm{otherwise}}} \end{array}} \right.,\forall j \in \{ 1, \ldots ,m\}$.
(2)
All SNPs in LD 0.1 < r² < 0.99 with the exposure IVs are potential tag-SNPs. These variants are iteratively pruned for high LD so that tag-SNPs, s_T, are always r² < 0.95 with each other in order to reduce collinearity and computation time.
(3)
The following equation is solved for b_E using ridge regression:
$$y_O = \left( {\begin{array}{*{20}{c}} \vdots & \vdots \\ {\frac{{{\mathbf{X}}\widehat {\mathbf{\beta }}_{\mathrm{E}}}}{{m_{\mathrm{E}}}}} & {\frac{{{\mathbf{X}}_{\mathrm{T}}}}{{\surd m_{\mathrm{T}}}}} \\ \vdots & \vdots \end{array}} \right)\left( {\begin{array}{*{20}{c}} {b_{\mathrm{E}}} \\ \vdots \\ {{\mathbf{\beta }}_{\mathrm{U}}b_{\mathrm{U}}} \\ \vdots \end{array}} \right) + {\it{\epsilon }},$$
(5)
where X_T is the genotype matrix of the outcome containing only tagging variants as defined in step (2), m_T is the number of tagging variants and is used to normalize for the number of tags in the region, and m_E represents the number of IVs selected by the selection method and is a parameter used to remove the dependency of the model on the number of IVs. The resulting coefficient vector contains the causal effect of interest b_E, and the vector β_Ub_U of length m_T is a nuisance parameter that captures pleiotropic effects.

Because individual-level data of the outcome is modeled by MR-link, MR-link does not use any summary statistics of the outcome.

We also considered solving the Eq. (5) using ordinary least squares (OLS). However, due to the multicollinear nature of the $\left( {\begin{array}{*{20}{c}} \vdots & \vdots \\ {\frac{{{\mathbf{X}}\widehat {\mathbf{\beta }}_{\mathrm{E}}}}{{m_{\mathrm{E}}}}} & {\frac{{{\mathbf{X}}_{\mathrm{T}}}}{{\surd m_{\mathrm{T}}}}} \\ \vdots & \vdots \end{array}} \right)$ matrix, this approach leads to very low detection power (Supplementary Figs. 6–9; Supplementary Data 2–4, 7, and Supplementary Note 1). We therefore applied ridge regression to solve the equation and determined a T statistic and subsequent Wald test two-sided p value for ridge regression⁶⁹. Due to the over-conservative nature of the resulting p value in simulations and real data (Supplementary Figs. 6–8, 10; Supplementary Data 2–4, 7, and Supplementary Note 1), we calibrated the p value distribution of each different scenario by fitting a beta distribution to null estimates to derive the final p values (Supplementary Note 1). When we report results for MR-link, it is these calibrated p values that we are referring to.

Mendelian randomization analyses

Causal relationships were estimated with MR-link and four other existing methods: Inverse variance weighting (IVW)²⁸, LDA-MR-Egger regression¹⁷, MR-Egger regression¹⁸, and MR-PRESSO¹⁹. All methods were (re-)implemented in Python and compared to present equal results when compared with their original implementation. The corresponding code is available at https://github.com/adriaan-vd-graaf/genome_integration.

The IVW method is a weighted meta-analysis of causal estimates from single IVs. Specifically, a causal estimate b_i for an IV i is estimated as $b_{\mathrm{i}}^\prime = \frac{{\beta _{{\mathrm{E}},i}^\prime }}{{\beta _{{\mathrm{O}},i}^\prime }}$, where β′_O,i is the marginal effect of SNP i on the outcome and β′_E,i is the marginal effect of the exposure. For the estimation of the causal effect, single IV causal estimates are combined using weights proportional to the inverse variance of such estimates using the two-terms definition of standard error: $se\left( {b_i^\prime } \right) = \sqrt {\frac{{se\left( {\beta _{{\mathrm{O,}}i}^\prime } \right)^2}}{{\beta _{{\mathrm{E}},i}^{\prime 2}}} + \frac{{\beta _{{\mathrm{O}},i}^{\prime 2}se\left( {\beta _{E,i}^\prime } \right)^2}}{{\beta _{{\mathrm{E}},i}^{\prime 2}}}}$ as following Burgess and Thompson⁷⁰.

MR-Egger regression adjusts for average pleiotropy by fitting a weighted linear regression between the exposure SNP-effects and the outcome SNP-effects¹⁸. It assumes that <50% of the variants have a pleiotropic effect. MR-Egger can be applied when three or more instruments are available.

LDA-MR-Egger is similar to MR-Egger but also recognizes LD. LDA-MR-Egger can only be used when LD information between the IVs is available^17,18.

MR-PRESSO is a method of causal inference that implements an approach to identify and remove outliers from the IVW framework¹⁹. It assumes that <50% of the variants have a pleiotropic effect. MR-PRESSO is unable to adjust for the presence of pleiotropy if fewer than three IVs are available, of if fewer than two IVs are left after outlier correction.

We applied these four methods to both simulated and real data. For real data, we used the LDL-C full GWAS summary statistics derived from the association carried out in the Lifelines study, as described above.

Prior to MR analyses, for each IV, we select the allele with positive effect on the exposure.

Colocalization analyses

We have run colocalization analyses on the simulated data using the R package coloc v4, git commit 6f3cbb1e5e90f07de772339d6e4af362140affc3, specifically its coloc.abf() function for the original coloc functionality and the coloc.signals() function for the masked (coloc-masked) and conditional (coloc-cond) estimates^29,32. We used marginal effect sizes, standard errors and the MAFs as input that were calculated separately for the exposure and outcome. The LD for the conditional and masked coloc analysis was derived from the simulated reference cohort. For original coloc, we used the H4 test statistic of the coloc.abf() function as our result metric, which provides the posterior probability of sharing of the causal variants between the two traits being tested. For the coloc-cond and coloc-masked results, we have used the maximum PP4 reported by the coloc.signals() function, as this represents the largest posterior probability that a causal variant is shared between traits. We compared the discriminative ability of all coloc variations with that of MR-link using (i) false-positive rate and power when using a PP4 > 0.9 to declare colocalization and (ii) an area under the curve (AUC) statistic of the receiver operator curve, where scenarios with b_E = 0 (null causal effect of the exposure) were considered true negative observations and $b_{\mathrm{E}} \ne 0$ were considered the true positive observations. We determined the AUC using the sklearn library⁷¹.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Individual-level data of Lifelines cohorts are available to all bona-fide researchers upon request to the Lifelines biobank (https://www.lifelines.nl/researcher). Individual-level data (genotypes and RNA-seq data) of the BIOS Consortium cohorts can be downloaded by researchers of Dutch Institutes, or analyzed (but not downloaded) by any non-Dutch researcher in a Cloud environment (https://www.bbmri.nl/acquisition-use-analyze/bios). GTEx summary statistics can be downloaded from the GTEx website (https://gtexportal.org/home/datasets). pQTLs summary statistics can be downloaded from GWAS Catalog (ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/SunBB_29875488_GCST005806/). Simulated data can be recreated using the code at the link provided in Code availability statement. Imputation with the Haplotype Reference Consortium dataset can be done at the following link: https://imputationserver.sph.umich.edu/index.html#!. Raw data used to draw Fig. 3 can be found in Supplementary Data 3.

Code availability

An implementation of MR-link, the methods to recreate the simulated data, and instructions on usage can be found at https://github.com/adriaan-vd-graaf/genome_integration. This repository also includes implementation of the other MR methods and the coloc method used in this paper.

References

Burgess, S., Foley, C. N. & Zuber, V. Inferring causal relationships between risk factors and outcomes from genome-wide association study data. Annu. Rev. Genomics Hum. Genet. 19, 303–327 (2018).
Article CAS PubMed PubMed Central Google Scholar
Pingault, J. B. et al. Using genetic data to strengthen causal inference in observational research. Nat. Rev. Genet. 19, 566–580 (2018).
Article CAS PubMed Google Scholar
Evans, D. M. & Davey Smith, G. Mendelian randomization: new applications in the coming age of hypothesis-free causality. Annu. Rev. Genomics Hum. Genet. 16, 327–350 (2015).
Article CAS PubMed Google Scholar
Ference, B. A. et al. Effect of long-term exposure to lower low-density lipoprotein cholesterol beginning early in life on the risk of coronary heart disease: a Mendelian randomization analysis. Ration. Pharmacother. Cardiol. 9, 90–98 (2013).
Google Scholar
Ference, B. A. et al. Association of genetic variants related to CETP inhibitors and statins with lipoprotein levels and cardiovascular risk. JAMA - J. Am. Med. Assoc. 318, 947–956 (2017).
Article CAS Google Scholar
Voight, B. F. et al. Plasma HDL cholesterol and risk of myocardial infarction: a mendelian randomisation study. Lancet 380, 572–580 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
Article CAS PubMed Google Scholar
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Article CAS PubMed PubMed Central Google Scholar
Luijk, R. et al. Genome-wide identification of directed gene networks using large-scale population genomics data. Nat. Commun. 9, 3097 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Dobbyn, A. et al. Landscape of conditional eQTL in dorsolateral prefrontal cortex and co-localization with schizophrenia GWAS. Am. J. Hum. Genet. 102, 1169–1184 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zhernakova, D. V. et al. Identification of context-dependent expression quantitative trait loci in whole blood. Nat. Genet. 49, 139–145 (2017).
Article CAS PubMed Google Scholar
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, B., Gloudemans, M. J., Rao, A. S., Ingelsson, E. & Montgomery, S. B. Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 51, 768–769 (2019).
Article CAS PubMed PubMed Central Google Scholar
Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell 177, 1022–1034.e6 (2019).
Article CAS PubMed PubMed Central Google Scholar
Barfield, R. et al. Transcriptome-wide association studies accounting for colocalization using Egger regression. Genet. Epidemiol. 42, 418–433 (2018).
Article PubMed PubMed Central Google Scholar
Bowden, J., Smith, G. D. & Burgess, S. Mendelian randomization with invalid instruments: Effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015).
Article PubMed PubMed Central Google Scholar
Verbanck, M., Chen, C. Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet. 50, 693–698 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nat. Commun. 9, 224 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Berzuini, C., Guo, H., Burgess, S. & Bernardinelli, L. A Bayesian approach to Mendelian randomization with multiple pleiotropic variants. Biostatistics 21, 86–101 (2020).
Article MathSciNet PubMed Google Scholar
Burgess, S. & Thompson, S. G. Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects. Am. J. Epidemiol. 181, 251–260 (2015).
Article PubMed PubMed Central Google Scholar
Porcu, E. et al. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nat. Commun. 10, 377267 (2019).
Article CAS Google Scholar
Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article ADS Google Scholar
Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Yang, J. et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369–375 (2012).
Article CAS PubMed PubMed Central Google Scholar
Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).
Article CAS PubMed PubMed Central Google Scholar
Burgess, S., Butterworth, A. & Thompson, S. G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet. Epidemiol. 37, 658–665 (2013).
Article PubMed PubMed Central Google Scholar
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Article PubMed PubMed Central CAS Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
ADS PubMed Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article PubMed PubMed Central CAS Google Scholar
Wallace, C. Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses. PLoS Genet 16, e1008720 (2020).
Article CAS PubMed PubMed Central Google Scholar
Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015).
Article PubMed Google Scholar
Ongen, H. et al. Estimating the causal tissues for complex traits and diseases. Nat. Genet. 49, 1676–1683 (2017).
Article CAS PubMed Google Scholar
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1285 (2013).
Article CAS PubMed PubMed Central Google Scholar
Klarin, D. et al. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat. Genet. 50, 1514–1523 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ander, B. P., Dupasquier, C. M. C., Prociuk, M. A. & Pierce, G. N. Polyunsaturated fatty acids and their effects on cardiovascular disease. Exp. Clin. Cardiol. 8, 164–172 (2003).
CAS PubMed PubMed Central Google Scholar
Lemaitre, R. N. et al. Genetic loci associated with plasma phospholipid N-3 fatty acids: a meta-analysis of genome-wide association studies from the charge consortium. PLoS Genet. 7, e1002193 (2011).
Article CAS PubMed PubMed Central Google Scholar
Barchetta, I. et al. Neurotensin is a lipid-induced gastrointestinal peptide associated with visceral adipose tissue inflammation in obesity. Nutrients 10, 526 (2018).
Earnest, C. P., Jordan, A. N., Safir, M., Weaver, E. & Church, T. S. Cholesterol-lowering effects of bovine serum immunoglobulin in participants with mild hypercholesterolemia. Am. J. Clin. Nutr. 81, 792–798 (2005).
Article CAS PubMed Google Scholar
Kjolby, M. et al. Sort1, encoded by the cardiovascular risk locus 1p13.3, is a regulator of hepatic lipoprotein export. Cell Metab. 12, 213–223 (2010).
Article CAS PubMed Google Scholar
Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466, 714–719 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, X. et al. Interrogation of the atherosclerosis-associated SORT1 (sortilin 1) locus with primary human hepatocytes, induced pluripotent stem cell-hepatocytes, and locus-humanized mice. Arterioscler. Thromb. Vasc. Biol. 38, 76–82 (2018).
Article CAS PubMed Google Scholar
Phillips, M. C. Apolipoprotein E isoforms and lipoprotein metabolism. IUBMB Life 66, 616–623 (2014).
Article CAS PubMed Google Scholar
Erbilgin, A. et al. Gene expression analyses of mouse aortic endothelium in response to atherogenic stimuli. Arterioscler. Thromb. Vasc. Biol. 33, 2509–2517 (2013).
Article CAS PubMed PubMed Central Google Scholar
Rossignoli, A. et al. Poliovirus receptor-related 2: a cholesterol-responsive gene affecting atherosclerosis development by modulating leukocyte migration. Arterioscler. Thromb. Vasc. Biol. 37, 534–542 (2017).
Article CAS PubMed Google Scholar
Skogsberg, J. et al. Transcriptional profiling uncovers a network of cholesterol-responsive atherosclerosis target genes. PLoS Genet. 4, e1000036 (2008).
Article PubMed PubMed Central CAS Google Scholar
Blattmann, P., Schuberth, C., Pepperkok, R. & Runz, H. RNAi-based functional profiling of loci from blood lipid genome-wide association studies identifies genes with cholesterol-regulatory function. PLoS Genet. 9, e1003338 (2013).
Article CAS PubMed PubMed Central Google Scholar
Candia, J. et al. Assessment of variability in the SOMAscan assay. Sci. Rep. 7, 1–13 (2017).
Article CAS Google Scholar
Klop, B. et al. Erythrocyte-bound apolipoprotein B in relation to atherosclerosis, serum lipids and ABO blood group. PLoS ONE 8, e75573 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
McLachlan, S. et al. Replication and characterization of association between ABO SNPs and red blood cell traits by meta-analysis in Europeans. PLoS ONE 11, e0156914 (2016).
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Article PubMed Google Scholar
Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).
Article PubMed Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Tigchelaar, E. F. et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open 5, e006772 (2015).
Article PubMed PubMed Central Google Scholar
Huisman, M. H. B. et al. Population based epidemiology of amyotrophic lateral sclerosis using capture-recapture methodology. J. Neurol. Neurosurg. Psychiatry 82, 1165–1170 (2011).
Article PubMed Google Scholar
Deelen, J. et al. Employing biomarkers of healthy ageing for leveraging genetic studies into human longevity. Exp. Gerontol. 82, 166–174 (2016).
Article CAS PubMed Google Scholar
Lin, B. D. et al. The genetic overlap between hair and eye color. Twin Res. Hum. Genet. 19, 595–599 (2016).
Article PubMed Google Scholar
van Greevenbroek, M. M. J. et al. The cross-sectional association between insulin resistance and circulating complement C3 is partly explained by plasma alanine aminotransferase, independent of central obesity and general inflammation (the CODAM study). Eur. J. Clin. Invest. 41, 372–379 (2011).
Article PubMed CAS Google Scholar
Hofman, A. et al. The Rotterdam Study: 2016 objectives and design update. Eur. J. Epidemiol. 30, 661–708 (2015).
Article PubMed PubMed Central Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article CAS PubMed PubMed Central Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Boomsma, D. I. et al. The genome of the Netherlands: design, and project goals. Eur. J. Hum. Genet. 22, 221–227 (2014).
Article CAS PubMed Google Scholar
Anders, S., Pyl, P. T. & Huber, W. HTSeq-A Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
CAS PubMed Google Scholar
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
Article CAS PubMed PubMed Central Google Scholar
Friedewald, W. T., Levy, R. I. & Fredrickson, D. S. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin. Chem. 18, 499–502 (1972).
Article CAS PubMed Google Scholar
Su, Z., Marchini, J. & Donnelly, P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304–2305 (2011).
Article CAS PubMed PubMed Central Google Scholar
Cule, E., Vineis, P. & De Iorio, M. Significance testing in ridge regression for genetic data. BMC Bioinforma. 12, 372 (2011).
Article Google Scholar
Burgess, S. & Thompson, S. G. Mendelian randomization: methods for using genetic variants in causal estimation. Mendelian Randomization: Methods for Using Genetic Variants in Causal Estimation, https://doi.org/10.1201/b18084 (CRC Press, 2015).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Hoffmann, T. J. et al. A large electronic-health-record-based genome-wide study of serum lipids. Nat. Genet. 50, 401–413 (2018).
Article CAS PubMed PubMed Central Google Scholar
Maisse, C. et al. Lipid raft localization and palmitoylation: Identification of two requirements for cell death induction by the tumor suppressors UNC5H. Exp. Cell Res. 314, 2544–2552 (2008).
Article CAS PubMed Google Scholar
Falk, J. et al. Functional mutation analysis provides evidence for a role of REEP1 in lipid droplet biology. Hum. Mutat. 35, 497–504 (2014).
Article CAS PubMed Google Scholar
Veniaminova, N. A. et al. Niche-specific factors dynamically regulate sebaceous gland stem cells in the skin. Dev. Cell 51, 326–340 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sugiura-Ogasawara, M. et al. The first genome-wide association study identifying new susceptibility loci for obstetric antiphospholipid syndrome. J. Hum. Genet. 62, 831–838 (2017).
Article CAS PubMed Google Scholar
Li, W. et al. DEPP/DEPP1/C10ORF10 regulates hepatic glucose andm fat metabolism partly via ROS-induced FGF21. FASEB J. 32, 5459–5469 (2018).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We are very grateful for the altruistic donation of biological materials and questionnaire data by our generous study participants, without them this study would not be possible. In addition, we thank the UMCG Genomics Coordination center and the UG Center for Information Technology, and their sponsors BBMRI-NL & TarGet, for storage and computing infrastructure. We thank BBMRI-NL for providing the transcriptome and genotyped data for the BIOS cohort. We thank P. Visscher, N. Wray, J. Yang, E. Lopera-Maya, O. Bakker, and N. de Klein for valuable advice during the development and writing of this work. We also thank K. McIntyre for editorial assistance and C. Benner for support on running FINEMAP. This work is financed by the Netherlands Organization for Scientific Research (NWO): NWO Spinoza Prize SPI 92-266 (to C.W.), by Fondation Lefoulon-Delalande (to A.R.), and by Radboud University Medical Centre Hypatia Grant 2018 (to Y.L.).

Author information

These authors jointly supervised this work: Yang Li, Cisca Wijmenga, Serena Sanna.

Authors and Affiliations

University of Groningen, University Medical Centre Groningen, Department of Genetics, Antonius Deusinglaan 1, 9713, Groningen, AV, The Netherlands
Adriaan van der Graaf, Annique Claringbould, Lude Franke, Harm-Jan Westra, Yang Li, Cisca Wijmenga & Serena Sanna
Oncode institute, Office Jaarbeurs Innovation Mile (JIM), Jaarbeursplein 6, 3521, Utrecht, AL, The Netherlands
Annique Claringbould, Lude Franke & Harm-Jan Westra
University of Groningen, University Medical Centre Groningen, Department of Pediatrics, Section Molecular Genetics, Antonius Deusinglaan 1, 9713, Groningen, AV, The Netherlands
Antoine Rimbert
Université de Nantes, CNRS, INSERM, l’institut du thorax, F-44000, Nantes, France
Antoine Rimbert
Department of Computational Biology for Individualised Infection Medicine, Centre for Individualised Infection Medicine (CiiM) & TWINCORE, joint ventures between the Helmholtz-Centre for Infection Research (HZI) and the Hannover Medical School (MHH), 30625, Hannover, Germany
Yang Li
Department of Internal Medicine and Radboud Center for Infectious Diseases, Radboud University Medical Center, 6525, Nijmegen, HP, The Netherlands
Yang Li
Istituto di Ricerca Genetica e Biomedica (IRGB), Consiglio Nazionale delle Ricerche (CNR), Cittadella Universitaria di Monserrato, 09042, Monserrato, Italy
Serena Sanna
Molecular Epidemiology, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 ZC, Leiden, The Netherlands
Bastiaan T. Heijmans
Department of Human Genetics, Leiden University Medical Center, Einthovenweg 20, 2333 ZC, Leiden, The Netherlands
Peter A. C.’t Hoen
Department of Internal Medicine, Erasmus MC, Dr. Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
Joyce B. J. van Meurs
Department of Psychiatry, VU University Medical Center, Neuroscience Campus Amsterdam, De Boelelaan 1118, 1081 HV, Amsterdam, The Netherlands
Rick Jansen

Authors

Adriaan van der Graaf
View author publications
You can also search for this author in PubMed Google Scholar
Annique Claringbould
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Rimbert
View author publications
You can also search for this author in PubMed Google Scholar
Harm-Jan Westra
View author publications
You can also search for this author in PubMed Google Scholar
Yang Li
View author publications
You can also search for this author in PubMed Google Scholar
Cisca Wijmenga
View author publications
You can also search for this author in PubMed Google Scholar
Serena Sanna
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

BIOS Consortium

Bastiaan T. Heijmans
, Peter A. C.’t Hoen
, Joyce B. J. van Meurs
, Rick Jansen
& Lude Franke

Contributions

A.v.d.G. conceived and designed MR-link with critical input from S.S.; A.v.d.G. performed all simulations and data analyses on the datasets used in this study; A.v.d.G. and A.C. performed quality control analyses on the BIOS and Lifelines cohorts; B.C. and C.W. provided access to the datasets used in this study; A.v.d.G. and S.S. wrote the paper with critical inputs from Y.L., H.J.W., A.R., and A.C.; Y.L., S.S., and C.W. supervised the study; C.W. provided funding for this study. All authors read and approved the paper.

Corresponding author

Correspondence to Serena Sanna.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Carlo Berzuini, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review

Reporting Summary

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Supplementary Data 6

Supplementary Data 7

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

van der Graaf, A., Claringbould, A., Rimbert, A. et al. Mendelian randomization while jointly modeling cis genetics identifies causal relationships between gene expression and lipids. Nat Commun 11, 4930 (2020). https://doi.org/10.1038/s41467-020-18716-x

Download citation

Received: 09 July 2019
Accepted: 08 September 2020
Published: 01 October 2020
DOI: https://doi.org/10.1038/s41467-020-18716-x

This article is cited by

Causal associations between liver traits and Colorectal cancer: a Mendelian randomization study
- Ying Ni
- Wenkai Wang
- Yun Jiang
BMC Medical Genomics (2023)
Genetically determined circulating resistin concentrations and risk of colorectal cancer: a two-sample Mendelian randomization study
- Thu Thi Pham
- Katharina Nimptsch
- Tobias Pischon
Journal of Cancer Research and Clinical Oncology (2023)
Mendelian randomization
- Eleanor Sanderson
- M. Maria Glymour
- George Davey Smith
Nature Reviews Methods Primers (2022)
Predicting causal genes from psychiatric genome-wide association studies using high-level etiological knowledge
- Michael Wainberg
- Daniele Merico
- Shreejoy J. Tripathy
Molecular Psychiatry (2022)
Clonal hematopoiesis of indeterminate potential, DNA methylation, and risk for coronary artery disease
- M d Mesbah Uddin
- Ngoc Quynh H. Nguyen
- Karen N. Conneely
Nature Communications (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

eQTL variants between different genes are often in LD

MR-link outperforms other methods in discriminative ability

MR-link identifies gene expression causal to LDL-C levels

MR-link confirms ApoE changes affect LDL-C levels

Discussion

Methods

BIOS consortium cohort genotype and expression analysis

Lifelines cohort genotype data and LDL-C levels

GTEx download and analysis

pQTL summary statistics download and analysis

Simulation of genotypes

Simulation of phenotypes

Simulation parameters and scenarios

Instrumental variable selection

MR-link

Mendelian randomization analyses

Colocalization analyses

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

BIOS Consortium

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links