Abstract
Mendelian randomization (MR) uses genetic variants as instrumental variables (IVs) to investigate causal relationships between traits. Unlike conventional MR, cisMR focuses on a single genomic region using only cisSNPs. For example, using cispQTLs for a protein as exposure for a disease opens a costeffective path for drug target discovery. However, few methods effectively handle pleiotropy and linkage disequilibrium (LD) of cisSNPs. Here, we propose cisMRcML, a method based on constrained maximum likelihood, robust to IV assumption violations with strong theoretical support. We further clarify the severe but largely neglected consequences of the current practice of modeling marginal, instead of conditional genetic effects, and only using exposureassociated SNPs in cisMR analysis. Numerical studies demonstrated our method’s superiority over other existing methods. In a drugtarget analysis for coronary artery disease (CAD), including a proteomewide application, we identified three potential drug targets, PCSK9, COLEC11 and FGFR1 for CAD.
Similar content being viewed by others
Introduction
Mendelian randomization is a widelyused approach that uses genetic variants as instrumental variables (IVs) to infer the causal relationship between a pair of traits, one called exposure and another outcome. Since genetic variants are randomly allocated and fixed at conception, it reduces the risk of confounding and reverse causation with observational data^{1,2}. Within the IV regression framework, MR requires three valid IV assumptions for valid inference: the IVs must be (1) associated with the exposure; (2) independent of any confounders of the exposureoutcome relationship; (3) not associated with the outcome conditional on the exposure and confounders. Subject to these assumptions, MR can provide evidence of a (putative) causal relationship between the exposure and the outcome, and the inverse variance weighting (IVW) method^{3} can be applied. However, only the first IV assumption can be tested and is relatively easy to be satisfied in practice by using genomewide significant SNPs associated with the exposure; in contrast, the second and third assumptions cannot be tested empirically and are likely to be violated due to the presence of widespread (horizontal) pleiotropy. Numerous MR methods have been proposed in the likely presence of horizontal pleiotropy^{4,5,6,7,8}, but most of them require the use of independent IVs as conducted in most MR analyzes.
Meanwhile, there has been a growing interest in MR studies focusing on a small genomic region using some local and correlated cisSNPs as IVs, known as cisMR. One of the most promising applications of cisMR is for drug target discovery, including drug target prioritization, validation or repositioning^{9,10,11}. A drugtarget MR analysis uses a protein (as a potential drug target) or its downstream biomarker as the exposure, and corresponding cisSNPs of the gene encoding the protein as IVs. Despite the significance of such an analysis, it still depends crucially on the three valid IV assumptions. While using proteins as exposures makes it less likely to violate the assumption of no horizontal pleiotropy, as proteins are causally upstream of many common risk factors used in traditional/polygenic MR^{9,12}, different biological mechanisms may still exist even among the cisSNPs in the same gene/protein region. For example, genetic variation in a transcription factor (TF)binding site will potentially influence the binding affinity or efficiency of the TF, which may subsequently affect the production of the associated RNA and proteins; given that most genes have multiple potential TFbinding sites, genetic variations in the cisregion of a gene may involve distinct biological mechanisms^{13}. One can first perform linkage disequilibrium (LD) clumping to obtain some (approximately) independent IVs before applying one or more of the existing robust MR methods based on independent IVs, however, it would lead to possibly severe loss of power due to only one or few independent SNPs remaining^{12}; in fact, with only one or two SNPs, many robust MR methods cannot be applied. Finally and more importantly, as will be shown in subsequent numerical studies, only using independent SNPs is highly likely to result in the absence of a valid IV in the analysis due to their correlations with other SNPs in the region. As an alternative, we would rather use multiple correlated IVs in cisMR. However, only few cisMR methods are robust to the violation of the IV assumptions. Perhaps the most widely used cisMR method is the generalized MRIVW^{14}, which uses generalized linear regression to account for LD (among correlated SNPs) but assumes all valid IVs. Similarly, a generalized version of MREgger^{15} and another closely related method, LDAEgger^{16}, require a stringent (socalled InSIDE) assumption on the relationship between the unknown IV strengths and pleiotropic effects; furthermore, more generally, MREgger is low powered and sensitive to the coding of the SNPs^{17}. There are several recently proposed methods to account for both LD and horizontal pleiotropy, such as MRLDP^{18}, MRCorr2^{19}, MRCUE^{20}, MRAID^{21} and RBMR^{22}. All these methods impose different modeling assumptions on the distribution of the latent/hidden pleiotropic effects, while some can only handle either correlated pleiotropy or uncorrelated pleiotropy, but not both. Furthermore, these methods are proposed in the context of using a complex trait as the exposure (or in a polygenic MR setup), including a relatively large number of SNPs across the wholegenome as IVs, and may not be most suitable for cisMR applications, as will be demonstrated later in our simulation studies.
In this work, we propose a robust cisMR method called cisMRcML, extending MRcML^{7} to allow for correlated SNPs as IVs. As its previous version with independent IVs, cisMRcML is robust to violation of any one, two or all three IV assumptions, imposing minimum modeling assumptions with strong theoretical support. Furthermore, we clarify two main differences between cisMRcML and MRcML. First, in cisMRcML we model conditional/joint SNP effects, instead of marginal effects as directly taken from GWAS summary data. Second, when selecting SNPs as IVs in the analysis, we include not only SNPs associated with the exposure, but also those associated with the outcome. These two differences are important: due to the use of correlated SNPs, failing to do so may lead to using all invalid IVs. These two differences have been largely neglected in the literature, but have severe consequences for any extension of other robust MR methods to cisMR with correlated IVs, such as the medianbased^{4} and modebased^{23} MR methods. We show the robustness of the proposed method to the presence of invalid IVs in simulation studies, and illustrate the severe consequence of using only SNPs that are (conditionally) associated with the exposure. Lastly, we demonstrate the effectiveness of the proposed method in two real data applications for drug target discovery for coronary artery disease (CAD). In the first application, we use downstream biomarkers to serve as a proxy for the perturbation of a drug target, while in the second one, we perform a proteomewide analysis to identify some proteins as potential drug targets for CAD.
Results
Overview of the proposed cisMRcML method
We propose cisMRcML to estimate the causal effect of an exposure (e.g. a gene or a protein) on an outcome in the possible presence of invalid IVs using publicly available GWAS summary data. It is an important extension of MRcML^{7} to allowing for correlated SNPs as IVs encountered with a molecular exposure such as a gene or a protein. By allowing for the use of correlated IVs, cisMRcML is suitable for cisMR analysis, while it may not be feasible to have at least three independent IVs within a cisregion for the use of MRcML. Additionally, it has two key distinctions from MRcML that enhance its robustness to invalid IVs. First, as depicted in the general causal model Fig. 1, cisMRcML models the conditional effects between SNPs and the exposure, and those between SNPs and the outcome, which differs from the conventional approach of modeling marginal GWAS estimates in MRcML and other MR methods. This distinction significantly mitigates the risk of introducing additional (and unnecessary) horizontal pleiotropy when dealing with correlated SNPs in cisMR (see Modeling conditional effects versus marginal effects). Second, unlike the usual practice of only using SNPs relevant to the exposure, cisMRcML uses variants that are jointly associated with either the exposure or the outcome as IVs, i.e., variants in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\). And we use a conditional and joint association analysis called GCTACOJO^{24} to select these variants. Properly accounting for outcomeassociated SNPs further helps to avoid additional horizontal pleiotropy (see Selection of genetic variants as IVs in cisMRcML). Although these two essential considerations proposed in cisMRcML may seem elementary statistically, they are often overlooked in current cisMR applications. For example, two most widelyused cisMR methods, generalized IVW and generalized Egger (see Methods), directly model the marginal GWAS estimates; and the recent two drugtarget applications (e.g. Zhao et al.^{10}; Zheng et al.^{25}) applied them followed by selecting conditionally independent pQTLs.
Once the genetic variants as candidate IVs are chosen, the LD matrix among these variants can be estimated using a publicly available reference panel. The marginal estimates from GWAS summary data are then converted into conditional GWAS estimates. Then under the twosample MR framework, cisMRcML is implemented in a maximum likelihood framework under a constraint on the number of invalid IVs with horizontal (correlated and/or uncorrelated) pleiotropy; the number of invalid IVs is selected consistently by the Bayesian Information Criterion (BIC). In short, cisMRcML selects valid IVs from the candidate set \({{{{{{{\mathcal{I}}}}}}}}_{X}\setminus {{{{{{{\mathcal{I}}}}}}}}_{Y}\) to infer the causal relationship from X to Y. We establish statistical theory to demonstrate some desirable properties of cisMRcML, such as its estimation consistency and asymptotic normality in the presence of invalid IVs with correlated or uncorrelated pleiotropy. Finally, we also implement a data perturbation (DP) approach to account for the uncertainty in model selection (see Methods).
Simulations: cisMRcML outperforms existing methods in the presence of invalid IVs
We conducted two sets of simulation studies to compare our proposed method (with data perturbation unless specified otherwise) and other existing cisMR methods including generalized IVW and Egger (GIVW and GEgger)^{14}, LDaware Egger^{16} (LEgger) (see Methods), and different implementations of these methods. In the first set of simulation studies, we directly generated GWAS summary statistics for 10 SNPs from an autoregressive LD pattern with a weak correlation (ρ = 0.2), a moderate correlation (ρ = 0.6), or a strong correlation (ρ = 0.8), and considered two scenarios: (1) all 10 SNPs had an effect on the exposure, i.e., \( {{{{{{{\mathcal{I}}}}}}}}_{X}=10\); (2) only half of the SNPs had an effect on the exposure, i.e., \( {{{{{{{\mathcal{I}}}}}}}}_{X}=5\), and \( {{{{{{{\mathcal{I}}}}}}}}_{Y}\setminus {{{{{{{\mathcal{I}}}}}}}}_{X}=2\). In both scenarios, we varied the number of invalid IVs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cap {{{{{{{\mathcal{I}}}}}}}}_{Y}\), denoted as K_{1}. We performed several methods: cisMRcML and LEgger with the conditional estimates calculated based on all 10 SNPs; GIVW and GEgger with the marginal GWAS estimates. In scenario (1), we also selected independent IVs (with r^{2} < 0.001) and applied the independent versions of IVW, Egger and MRcML with marginal GWAS estimates. We referred to these implementations as IVWIND, EggerIND and cMLIND. We further applied four polygenic MR methods that can account for LD but were not specifically proposed for cisMR analysis, including MR.LDP, MR.Corr2, MR.CUE and MRAID. In scenario (2), we additionally investigated the performance of different cisMR methods using only SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\). Specifically, we applied cisMRcML and LEgger with the conditional estimates calculated only based on the 5 SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\); and applied GIVW and GEgger with the GWAS summary data of the 5 SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\). We referred to these implementations as cisMRcMLX, LEggerX, GIVWX and GEggerX respectively.
In the first scenario, where all 10 IVs had effects on the exposure (\( {{{{{{{\mathcal{I}}}}}}}}_{X}=10\)), representative results were presented in Fig. 2, and full results can be found in Supplementary Section S3.1. Throughout our simulations, typeI error was evaluated at the significance level of 5%. First, when all 10 IVs were valid (Fig. 2A and Supplementary Table S1), all methods yielded wellcontrolled typeI error rates. cisMRcML (with the data perturbation implementation) was more conservative than other methods in this ideal scenario with no invalid IV, which was similar to what was observed before in MRcML^{7} and MVMRcML^{26}. It is also noted that, even in such an ideal scenario, GEgger had a relatively larger root mean squared error (and less precise estimates) than the other three cisMR methods, which may be due to the allele orientation step implemented in the method (see Lin et al.^{17} for more discussion on this issue). IVWIND, EggerIND and cMLIND all had lower power than their correlated version counterparts, namely GIVW, GEgger and cisMRcML. In the presence of 4 invalid IVs (Fig. 2B), only cisMRcML could control the typeI error and at the same time maintain high power. Furthermore, it had a much lower RMSE than the other three methods. On the other hand, GIVW, GEgger and LEgger had increasingly inflated typeI errors as the correlations among SNPs increased. The three approaches using independent IVs also had highly inflated typeI errors as the direct effects of invalid IVs were absorbed in the marginal GWAS effects used in the analysis due to LD. As shown in Supplementary Table S2, the four polygenic MR methods exhibited unstable performance, with either extremely low power, inflated typeI errors, or unsuccessful convergence. Finally, we further investigated the use of marginal GWAS estimates in cisMRcML as shown in Eq. (15) (referred to as cisMRcMLMarg). As detailed in Supplementary Section S3.2, cisMRcMLMarg had highly inflated typeI errors due to the violation of plurality assumption in the marginal model, as discussed in Modeling conditional effects versus marginal effects.
In the second scenario with only 5 SNPs having effects on the exposure, we further examined the performance of the four methods using only the data from these 5 SNPs, as illustrated by the suffix ‘X’ in Fig. 3. When K_{1} = 0 (Fig. 3A, Supplementary Table S7), it seemed that all IVs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) were valid. However, due to their correlations with those in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\), they absorbed the direct effects of the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) on the outcome if we failed to include the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\). Therefore, all the IVs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) became invalid and the plurality condition was violated in cisMRcMLX, which yielded highly inflated typeI errors. Similarly, GIVWX and GEggerX, only using SNPs conditionally associated with the exposure also yielded inflated typeI errors. On the other hand, cisMRcML using all 10 SNPs yielded wellcontrolled typeI errors, high power and the smallest RMSE across all scenarios. Through this example, we can see the importance of including the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) besides those in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) when calculating the conditional estimates, because otherwise, the plurality condition required by cisMRcML (or more generally by model identification) may be violated (unless there was no or little LD between the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and \({{{{{{{\mathcal{I}}}}}}}}_{Y}\), which will be considered later). Additionally, it is worth mentioning that when K_{1} = 1 in Fig. 3B (Supplementary Table S8), among the 10 IVs used in cisMRcML, some violated only the ‘relevance’ assumption, some violated only the ‘no horizontal pleiotropy’ assumption, but some violated both assumptions. Notably, the proposed method performed robustly in the presence of different types of invalid IVs, producing unbiased estimates and wellcontrolled typeI errors.
In the second scenario, we also considered two other ways of selecting the set of SNPs to be used in cisMRcML. Specifically, we compared mistakenly including some irrelevant IVs (i.e., the previous implementation of using all 10 SNPs including the three SNPs in \({({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y})}^{C}\)) versus only using SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\). We also considered the situation where we failed to include some relevant SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) by using \({{{{{{{\mathcal{I}}}}}}}}_{{X}_{s}}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\), where \({{{{{{{\mathcal{I}}}}}}}}_{{X}_{s}}\subset {{{{{{{\mathcal{I}}}}}}}}_{X}\) and \( {{{{{{{\mathcal{I}}}}}}}}_{{X}_{s}}=3\). We found that although all three approaches yielded unbiased estimates and wellcontrolled typeI errors, using SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\) gave the highest power while using irrelevant SNPs or omitting some true (conditionally) relevant SNPs led to a loss of power. More details and results are provided in Section S3.5 in the Supplementary.
We further evaluated the performance of different approaches under a few more scenarios (see Simulations with autoregressive correlation structure for simulation details). First, when only weak IVs were available, which were simulated by reducing the genetic effect sizes on the exposure or the exposure GWAS sample size in scenario 1 with K_{1} = 0, all methods could control typeI error. Particularly, only the proposed cisMRcML yielded unbiased estimates, while other methods exhibited different degrees of bias (Supplementary Tables S10, 11). Second, we considered the scenario where the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) were weakly invalid with pleiotropic effects \({r}_{i}=\kappa /\sqrt{{N}_{Y}}\), i.e., pleiotropic effects decreased as the outcome GWAS sample size increased at rate \(1/\sqrt{{N}_{Y}}\). We varied the value of κ and N_{Y} while keeping N_{X} the same. We observed that as κ increased, cisMRcMLBIC had improved performance in selecting out invalid IVs based on BIC. When κ = 1, cisMRcMLBIC barely selected out any invalid IVs and thus had inflated typeI errors; however, cisMRcMLDP (with data perturbation) still performed reasonably well with controlled typeI errors and nearly unbiased estimates (Table S12). When κ = 5, cisMRcMLBIC oftentimes failed to select out all 4 invalid IVs, and cisMRcMLDP also yielded biased estimates and slightly inflated typeI errors (Table S13). When κ became large (κ = 20), cisMRcMLBIC often successfully identified all 4 invalid IVs and had wellcontrolled typeI errors (Table S14). We also explored the performance of cisMRcMLBIC based on asymptotic inference when both N_{X} and N_{Y} increased and r_{i} diminished at different rates. Results are discussed in Supplementary Section S1 and Tables S16–S18. Third, when SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) were not correlated with SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\), all examined approaches, including cisMRcML, cisMRcMLX, GIVWX, GEggerX and LEggerX, performed well. This was expected since no pleiotropy was introduced via LD and the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) were still valid IVs. However, in the case that the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) were weakly correlated with the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\), cisMRcMLX and GIVWX had highly inflated typeI errors, which again highlighted the importance of including the SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) in the cisMR analysis (Supplementary Table S15). Detailed results for the above three additional scenarios are provided in Supplementary Section S3.6.
Next, we turned to the second set of simulation studies, in which more realistic LD patterns were examined by generating data based on our real data analysis. Briefly, we started by generating the exposure and outcome data from the UK Biobank individual genotypes in the cisregion of a given gene/protein, then performed exposure and outcome GWAS on two nonoverlapping samples. We then applied GCTACOJO on the exposure and the outcome GWAS to select SNPs jointly associated with X (denoted as \({{{{{{{\mathcal{C}}}}}}}}_{X}\)) and jointly associated with Y (denoted as \({{{{{{{\mathcal{C}}}}}}}}_{Y}\)) respectively, using a third nonoverlapping sample as the reference LD panel. We considered two scenarios: in scenario 1, 50 proteins with at least five SNPs associated with the exposure and one SNP associated with the outcome based on the COJO analysis in the proteomewide application were randomly selected. Then we specified SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) and corresponding genetic effect sizes exactly based on the realdata COJO results and real pQTL/GWAS data. In scenario 2, 50 proteins were randomly selected, and SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) and their corresponding genetic effect sizes were generated randomly. See Simulations with real LD patterns derived from UK Biobank data for datagenerating details.
We evaluated the performance of cisMRcML and LEgger using the SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\cup {{{{{{{\mathcal{C}}}}}}}}_{Y}\) selected by COJO based on the simulated data, and the common practice with cisMRcMLX, GIVWX and GEggerX only using the SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\) for cisMR analysis. We applied IVWIND with the set of independent SNPs marginally associated with the exposure after LD clumping, as well as the four polygenic MR methods (MR.LDP, MR.Corr2, MR.CUE and MRAID) with the set of correlated SNPs marginally associated with the exposure (See Simulations with real LD patterns derived from UK Biobank data for implementation details). Figure 4 shows the results for 500 replicates (10 replicates per gene) in scenarios 1 (top row) and 2 (bottom row) respectively. Among the methods assessed, cisMRcML was the only method that effectively controlled the typeI error and yielded the smallest RMSE. In contrast, LEgger demonstrated generally low power, and other approaches using only SNPs associated with the exposure in the analysis had unsatisfactory performance with inflated typeI errors. See Supplementary Table S19 for full results of all examined methods.
Causal effects of downstream biomarkers on CAD
In the first real data application, we applied cisMR in a setup where we used a downstream biomarker as a proxy of protein concentration and activity (see Drugtarget MR application with the use of downstream biomarkers). Our analysis here mainly aims to illustrate how to apply cisMRcML using a downstream biomarker of the target protein to confirm/replicate some wellestablished results.
Specifically, we assessed the causal relationship of lowdensity lipoprotein cholesterol (LDL) on coronary artery disease (CAD) using the genetic variants restricted to the PCSK9 region. PCSK9 can bind to and break down LDL receptors, therefore decreasing the clearance of LDL cholesterol. PCSK9 inhibitors are a new type of drug that can lower LDL levels by blocking PCSK9 protein from breaking down LDL receptors. The causal effect of LDL on CAD has been extensively studied by randomized trials and MR^{27}. In particular, Ference et al.^{28} found a protective effect on CAD of lowering LDL using a weighted PCSK9 genetic score to mimic the effect of PCSK9 inhibitor. In our analysis of LDL and CAD, GCTACOJO selected 9 SNPs located in the PCSK9 region, 8 of which were associated with LDL, and one was associated with CAD. Both cisMRcML, LEgger, GIVWX and GEggerX suggested a significant positive causal effect of LDL on CAD risk, with pvalues 7.3 × 10^{−4}, 9.2 × 10^{−3}, 8.6 × 10^{−8}, 0.02 respectively.
Following Gkatzionis et al.^{11}, we also assessed the causal relationship of testosterone level on CAD using the genetic variants in the SHBG region. While an association between low testosterone level and CAD risk has been reported in some observational studies, its causal relationship is still unclear. Sex hormonebinding globulin (SHBG) can bind to sex hormones in the blood and help control the amount of sex hormones. Multiple variants in this region have been demonstrated to be associated with testosterone. In the analysis of testosterone level and CAD, GCTACOJO selected 14 SNPs associated with testosterone, and no SNP associated with CAD. Using the 14 variants in the SHBG region, no method identified any significant causal effect of testosterone on CAD risk, which was consistent with previous findings in Burgess et al.^{12}; Schooling et al.^{29}; Gkatzionis et al.^{11}.
Proteomewide analysis for CAD risk
In the second real data application, we used protein expression data as the exposure, which was a more direct proxy of the drug target, and we assessed their causal effects on the risk of CAD. Specifically, we did a proteomewide scan using the pQTL summary data derived from ARIC European ancestry (EA) cohort with sample size N_{X} = 7213^{30}. After data preprocessing (see A proteomewide application to CAD), in total 773 proteins were analyzed. Among the 773 proteins, 183 proteins had at least one SNP in the cisregion associated with the outcome according to the COJO analysis (i.e., \( {{{{{{{\mathcal{C}}}}}}}}_{Y}\setminus {{{{{{{\mathcal{C}}}}}}}}_{X} \ge 1\)), and cisMRcMLBIC detected over 98.7% of such invalid IVs. Furthermore, 30 proteinCAD pairs had one invalid IV in \({{{{{{{\mathcal{C}}}}}}}}_{Y}\cap {{{{{{{\mathcal{C}}}}}}}}_{X}\) detected by cisMRcMLBIC. It is also noted that, with a finite sample size, the selection of invalid IVs based on BIC may not be perfect, especially in the presence of weak invalid IVs. With data perturbation, cisMRcML has detected 310 proteinCAD pairs with at least one invalid IV in \({{{{{{{\mathcal{C}}}}}}}}_{Y}\cap {{{{{{{\mathcal{C}}}}}}}}_{X}\) over 10% of the time during the 100 data perturbations.
We used the BenjaminiHochberg approach to account for multiple testing in our proteomewide analysis, and reported significant MR findings with a false discovery rate (FDR) less than 0.05. We also conducted colocalization analysis on the significant proteins with an FDRadjusted pvalue less than 0.05 (see Methods). cisMRcML identified three proteins with putative causal effect on CAD risk, including PCSK9, COLEC11 and FGFR1. Using a threshold of H4PP ≥0.7, there was colocalization evidence for both PCSK9 and COLEC11. As discussed in the previous application, PCSK9 inhibitors can lower LDL levels, which is a major risk factor for CAD. Several trials found that evolocumab, a PCSK9 inhibitor, can significantly lower LDL levels and cardiovascular disease risk^{31,32,33}. COLEC11 is involved in lectin complement activation pathway and plays an important role in the innate immune system. The vital role of the complement system in heart diseases has been studied, including promoting inflammation, tissue damage, etc.^{34,35}. While complement inhibitors have been suggested as a potential therapeutic target for heart disease, more studies on the relationship between COLEC11 and CAD are warranted. As for FGFR1, colocalization only identified the causal variant for the protein with H1PP ≈ 96%. This was the scenario with insufficient evidence for association with CAD in the CAD GWAS data^{36}. FGF/FGFR signalling plays an important role in cell proliferation and angiogenesis, and several FGFR1 inhibitors have been used to treat various types of cancer^{37}. While overexpression of FGFR1 may play a role in the development of cardiac hypertrophy^{38,39}, it is also likely that FGFR expression pattern is altered in response to cardiac stress and injury and facilitate cardiac remolding^{40,41}. Further studies are needed to fully understand their complex relationship.
On the other hand, GIVWX identified 18 proteins, and four of them had colocalization evidence, including BMP1, COLEC11, PCSK9, and ERAP2. However, there were five proteins with an H3PP greater than 0.7, including SWAP70, HTRA1, CXCL12, PDE5A, and ITIH3, which suggested that each protein and CAD might have distinct causal variants that were in linkage disequilibrium, and thus MR assumptions may be violated. In fact, for all these five proteins, COJO selected at least one SNP associated with CAD in the corresponding cisregion. cisMRcML yielded nonsignificant pvalues for four of them, except for PDE5A, which had a marginally significant pvalue of 0.003 by cisMRcML. Interestingly, PDE5 inhibitors have been found in some animal studies to have a potential cardioprotective effect^{42}, and a recent transcriptomewide association analysis (TWAS) has also identified a positive effect of PDE5A on CAD in the aorta tissue^{43}. However, in our proteomewide analysis, both GIVWX and cisMRcML suggested a negative effect size of PDE5A on CAD. We further found that the QTL pairs, eQTLs from the GTEx (V8) aorta tissue and pQTLs from plasma used in our analysis, had opposite directions of effects. Such a discordance has been systematically investigated in Robinson et al.^{44}, which may be partially due to the difference in tissues. Nonetheless, such discordance may also be informative in drug target validation and the mechanism of PDE5 inhibitors on CAD is worth more investigation. Similarly, GEggerX identified seven proteins, 2 of which had colocalization evidence including PCSK9 and TIRAP, and two had an H3PP greater than 0.7. We note that this could be the scenario we’ve seen in our simulation studies, where only using SNPs conditionally associated with the exposure yielded inflate typeI error in GIVWX and GEggerX. Such significant MR findings may be attributable to genetic confounding through a variant in linkage disequilibrium as suggested by a high H3PP. Waldratio test using the most significant pQTL identified 24 proteins, 4 of which had colocalization evidence, including PCSK9, TIRAP, COLEC11, and ERAP2, but 10 of them had evidence of H3PP greater than 0.7. And lastly, LEgger didn’t identify any significant results. We show the QQ plots of all methods in Fig. 5, in which we can see that in the left tail, cisMRcML and LEgger had good alignment with the identity line, while GIVWX, GEggerX and Waldratio were inflated. The inflation factor for cisMRcML was 1.00 (rounded to the second decimal), suggesting that the TypeI error was controlled satisfactorily; while LEgger, GEggerX, GIVWX and Waldratio test yielded inflation factors 0.95, 1.14, 1.57 and 1.54 respectively.
Finally, as a sensitivity analysis, we performed the analysis using different pvalue thresholds in COJO to select IVs. As shown in the Supplementary, using 5 × 10^{−8} led to a slightly inflated inflation factor of 1.08 for cisMRcML, which may be partially due to the omission of SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\), potentially causing cisMRcML to violate the plurality condition as discussed in Selection of genetic variants as IVs in cisMRcML. On the other hand, using some loose thresholds resulted in deflated values of inflation factor in the QQ plot, indicating that the proposed method may suffer from a loss of power, which could be partially due to the inclusion of many irrelevant (and highly correlated) SNPs. Therefore, as a tradeoff we recommend 5 × 10^{−6} as a starting point in COJO to select SNPs to be used in cisMRcML. Lastly, regardless of the threshold used, alternative methods that utilized the marginal effect estimates of SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\), i.e., GIVWX, GEggerX, and Waldratio test, all yielded inflations.
Discussion
We have proposed a robust cisMR method called cisMRcML, which uses correlated SNPs in a local genomic region to infer the causal relationship between a molecular exposure (e.g. a protein) and an outcome (e.g. CAD), and is robust to invalid IVs. It is an important extension of the existing MRcML method, which has been shown to have good performance in practice but requires the use of independent SNPs^{7}. While such an extension may seem straightforward at first glance by incorporating LD information in the likelihood, we have pointed out several important implementation details with significant implications for final results. To prevent inducing pleiotropy via LD, we adopted conditional association estimates, instead of marginal estimates, by suitably transforming GWAS summary data in cisMRcML. We also discussed and demonstrated the importance of using SNPs associated with the outcome, in addition to those associated with the exposure, in cisMRcML, which is in stark contrast to the common practice of only using SNPs associated with the exposure in MR, e.g. MRcML. These caveats are expected to be applicable to the extensions of other existing robust MR methods only requiring the majority or plurality condition, such as weightedmedian and modebased methods. While we have mainly focused on the application of cisMR using one genomic region in this paper, our method can be generalized to multiple independent LD blocks and serve as a useful MR method accounting for both LD and horizontal pleiotropy.
In our simulation studies, we have showcased much better performance of the proposed cisMRcML over several commonly used cisMR methods, including generalized IVW and Egger, and LDAEgger. We have also compared different choices of the set of SNPs used in different methods. In particular, we have found that applying generalized IVW and Egger with only SNPs conditionally associated with the exposure may yield false positive findings, partly due to that some outcomeassociated SNPs not included in the model are in LD with the SNPs in the model, thus leading to the violation of nopleiotropy assumption. This was confirmed in both simulation studies and real data applications, and we hope to raise attention to this largely neglected issue in future applications.
We note some unique challenges with cisMR (as compared to the standard polygenic MR) due to the use of correlated SNPs in the same cisregion as IVs. First, if an invalid IV is not included in the model, its correlations with all other IVs being included in the model would induce all other IVs being invalid. Second, if there is only one causal SNP and if it is already included as an IV, we would have only one valid IV. In both situations, the plurality condition will be violated. Our proposed approach of selecting and including any outcomeassociated SNPs in the model while conducting model selection and using conditional effect estimates would alleviate, but not necessarily eliminate, the problem.
There are several limitations in this work. First, while cisMRcML imposes minimum modeling assumptions, especially no additional assumption on the distribution of pleiotropic effects (except that their sizes are in the order of O(1)), it depends critically on the plurality condition, which depends on which SNPs are used in the model due to the correlations among all the SNPs, either selected or not, in a local region. We currently use GCTACOJO to select and include the SNPs that are associated with either the exposure or the outcome, which has been previously found to perform better than pvalue clumping^{45}. Moreover, clumping is based on the marginal genetic association. In cisMR, many marginal signals may stem from the same SNP, and when modeled in the conditional framework, their effect sizes may shrink to zero, potentially introducing more noises into the proposed approach. However, COJO is by no means the only method for conditional analysis. SNPs selection in cisMR analysis is still an ongoing research topic (see Gkatzionis et al.^{11} for a detailed review, and Schmidt et al.^{9} for another example); there seems no consensus yet. How to incorporate other robust SNP selection techniques or develop new ones in cisMRcML (or similar extensions) is of interest for future work. Moreover, while the current algorithm works reasonably well in our experience, an algorithm with global optimization is desired to bridge the gap between the implementation and the theory. We also observed similarly conservative performance of cisMRcML with the data perturbation implementation as its independentIV (MRcML^{7,46}) and multivariable MR (MVMRcML^{26}) versions, which should be investigated further in the future. Second, since individuallevel genotypes in the exposure and outcome GWAS data are often unavailable, as in most applications, we propose using a reference panel of similar ancestry to approximate an LD matrix. Such an approximation is known to introduce extra variation that is not taken into account in our method (and almost all methods); using a larger reference panel, such as the UK Biobank samples as used in our analysis, is expected to alleviate the problem^{47}. Third, we have considered only the twosample design with two independent GWAS datasets for the exposure and outcome. To account for overlapping samples between the two GWAS datasets, we may model the exposure estimates and the outcome estimates jointly with a multivariate normal distribution, instead of treating them as independent. Fourth, the proteomewide application presented in this work is based on the protein levels measured in plasma samples, while using diseaserelevant tissue samples may be preferred in drugtarget MR. Since largescale tissuespecific pQTL data are not yet available, one alternative is to use tissuespecific eQTL data as a proxy^{9}. Relatedly, as shown in our example for protein PDE5A, discordance between eQTL data and pQTL data could happen, and only using one type of molecular data may yield misleading results. It is therefore recommended to combine and consolidate evidence from eQTLs and pQTLs for drug target validation^{44}. Finally and importantly, triangulation with evidence from applying different cisMR methods and colocalization analysis to observational data^{36}, and direct experimental studies when possible are warranted for more reliable causal inference.
Methods
Model
Based on Fig. 1 with m SNPs (G_{1},…,G_{m}), assuming that both the genotypes and the traits have been mean centered and standardized, the true models for the exposure X and the outcome Y are
where ϵ_{X} and ϵ_{Y} are random error terms independent of SNPs \({\{{{{{{{{\bf{G}}}}}}}}_{i}\}}_{i=1}^{m}\). In general, ϵ_{X} and ϵ_{Y} are correlated due to the presence of hidden confounding. Plugging Eq. (1) in Eq. (2), we have
where \({\{{b}_{Xi}\}}_{i=1}^{m}\) and \({\{{b}_{Yi}\}}_{i=1}^{m}\) are the joint/conditional effects of the m SNPs on the exposure and the outcome respectively. Note that in Fig. 1, for SNPs \(i\in {{{{{{{\mathcal{I}}}}}}}}_{X}\), we have b_{Xi} ≠ 0; for SNPs \(i\in {{{{{{{\mathcal{I}}}}}}}}_{Y}\), we have r_{i} ≠ 0. Based on the three valid IV assumptions, a valid IV is a SNP with b_{Xi} ≠ 0 and r_{i} = 0, i.e. \(i\in {{{{{{{\mathcal{I}}}}}}}}_{X}\setminus {{{{{{{\mathcal{I}}}}}}}}_{Y}\). As discussed in Theorem 1 in Guo et al.^{48}, Eq. (4) is identifiable if and only if the valid IVs form the largest group of IVs sharing the same causal parameter value (i.e., the plurality condition).
Eqs. (1) and (2) are joint models for conditional association of m SNPs on the exposure and on the outcome respectively, while typically in GWAS, the marginal associations of each SNP with the exposure and with the outcome are modeled:
Accordingly, we denote such GWAS summary statistics of the exposure and the outcome as \({\{{\hat{\beta }}_{Xi}^{ * },{\hat{\beta }}_{Yi}^{ * },{\sigma }_{Xi}^{ * },{\sigma }_{Yi}^{ * }\}}_{i=1}^{m}\).
Transformation between marginal and conditional SNPeffect estimates
From Eq. (1), we have \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}={({{{{{{{\bf{G}}}}}}}}_{X}^{T}{{{{{{{\bf{G}}}}}}}}_{X})}^{1}({{{{{{{\bf{G}}}}}}}}_{X}^{T}{{{{{{\bf{X}}}}}}})\), where G_{X} is the N_{X} × m standardized genotype matrix and X is the standardized phenotype vector of length N_{X}. Then \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}={{{{{{{\bf{R}}}}}}}}_{X}^{1}{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * }\), where R_{X} is the LD matrix and \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * }\) is the GWAS estimate. Denote the (estimated) covariance matrix of the joint effect estimate (\({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}\)) as \({{{{{{{\mathbf{\Sigma }}}}}}}}_{X}={{{{{{{\bf{R}}}}}}}}_{X}^{1}{{{{{{{\boldsymbol{\Omega }}}}}}}}_{X}{{{{{{{\bf{R}}}}}}}}_{X}^{1}\), where \({{{{{{{\boldsymbol{\Omega }}}}}}}}_{X}={{{{{{{\bf{R}}}}}}}}_{X}\cdot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{X}^{ * }{{{{{{{\boldsymbol{\sigma }}}}}}}}_{X}^{{ * }^{T}}\) is the covariance matrix of the marginal effect estimate \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * }\), “ ⋅ ” is the elementwise multiplication. In practice, if the individual genotype matrix for calculating R_{X} is not available, the LD matrix can be estimated using some publicly available reference panel denoted as R, where R ≈ R_{X}. Furthermore, if the GWAS estimates are not calculated on the standardized genotypes and phenotype, they can be approximated as \({\hat{\beta }}_{Xi}^{ * }/\sqrt{{\hat{\beta }}_{Xi}^{{ * }^{2}}+({N}_{X}2)\cdot {\sigma }_{Xi}^{{ * }^{2}}}\)^{49}. Similarly we can estimate \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}\) and Σ_{Y} from \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * }\) and \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{ * }\).
In the following sections, unless specified otherwise, we assume (asymptotic) normal distributions for the conditionaleffect estimates \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X} \sim {{{{{{\mathcal{MVN}}}}}}}({{{{{{{\bf{b}}}}}}}}_{X},{{{{{{{\mathbf{\Sigma }}}}}}}}_{X})\) and \({\hat{{{{{{{\mathbf{\beta }}}}}}}}}_{Y} \sim {{{{{{\mathcal{MVN}}}}}}}({{{{{{{\bf{b}}}}}}}}_{Y},{{{{{{{\mathbf{\Sigma }}}}}}}}_{Y})\) with \({{{{{{{\bf{b}}}}}}}}_{X}={({b}_{X1},\ldots,{b}_{Xm})}^{T}\) and \({{{{{{{\bf{b}}}}}}}}_{Y}={({b}_{Y1},\ldots,{b}_{Ym})}^{T}\), which is reasonable given large sample sizes of GWAS data.
MRIVW and MRegger
The inversevariance weighted method (MRIVW)^{3} and MREgger regression^{50} are two of the most widely used MR methods. These two methods are most often discussed in the context of independent SNPs/IVs, where MRIVW and MREgger can be regarded as weighted linear regression (with weights equal to \({\sigma }_{Yi}^{{ * }^{2}}\)) of \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * }\) on \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * }\), without and with the intercept term respectively. Both methods have previously been extended to account for correlated IVs and we will next give a brief overview of the existing methods, while details can be found in their corresponding references.
Generalized IVW and egger
To account for the correlations among IVs, MRIVW and MREgger have been extended based on generalized weighted linear regression^{14,15}, and we refer to them as GIVW and GEgger throughout the paper. The GIVW and GEgger estimators are:
where \({{{{{{{\boldsymbol{\Omega }}}}}}}}_{Y}={{{{{{\bf{R}}}}}}}\cdot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{ * }{{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{{ * }^{T}}\) is the covariance matrix of \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * }\). When the SNPs are independent, i.e., Ω_{Y} becomes a diagonal matrix with the ith diagonal element \({\sigma }_{Yi}^{{ * }^{2}}\), GIVW and GEgger become the original MRIVW and MREgger. The GIVW and GEgger methods are implemented in the R package MendelianRandomization^{51}.
LDAware (LDA) IVW and Egger
LDAware (LDA) MRIVW and MREgger are two other variants of MRIVW and MREgger proposed by Barfield et al.^{16} to account for LD among IVs, and we refer them to LIVW and LEgger throughout. These LDAestimators are very similar to the generalized MRIVW (GIVW) and MREgger (GEgger), except that the input data is the conditional estimates \(\{{\hat{\beta }}_{X},{\hat{\beta }}_{Y}\}\), instead of the marginal estimates \(\{{\hat{\beta }}_{X}^{ * },{\hat{\beta }}_{Y}^{ * }\}\):
with \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}={{{{{{{\bf{R}}}}}}}}^{1}{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * }\), \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}={{{{{{{\bf{R}}}}}}}}^{1}{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * }\) and Σ_{Y} = R^{−1}Ω_{Y}R^{−1}. Comparing Eq. (6) with Eq. (9), and Eq. (8) with Eq. (10), we see that \({\hat{\theta }}_{LIVW}={\hat{\theta }}_{GIVW}\), but in general \({\hat{\theta }}_{LEgger}\ne {\hat{\theta }}_{GEgger}\). The LDAEgger method can be implemented with the code provided by the original authors (https://rbarfield.github.io/Barfield_website/pages/Rcode.html).
While the extensions of MRIVW and MREgger allow for correlations among IVs, they inherit the same limitations in their corresponding original versions. For example, both MRIVW and MREgger may yield biased causal inference unless all IVs are valid or under some stringent (socalled InSIDE) condition between the instrument strengths and their direct effects^{15,17}.
New method: cisMRcML
We propose a robust cisMR method accounting for possible violations of any invalid IV assumptions, called cisMRcML. Suppose we have the estimated joint/conditional associations of the m SNPs with the exposure as \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}={({\hat{\beta }}_{X1},\ldots,{\hat{\beta }}_{Xm})}^{T}\) and its covariance matrix Σ_{X}, and those with the outcome as \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}={({\hat{\beta }}_{Y1},\ldots,{\hat{\beta }}_{Ym})}^{T}\) and Σ_{Y}, which can be calculated from the GWAS summary statistics and LD matrix. The model for the proposed cisMRcML is
where Σ_{X} and Σ_{Y} are the covariance of \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}\) and \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}\) respectively, θ is the causal effect of interest, b_{X} is a vector of the unknown joint effects of m SNPs on the exposure, and r is a vector of the unknown direct effects on the outcome not mediated through the exposure. Note that r captures both the correlated and uncorrelated (horizontal) pleiotropic effects. Assuming the independence between the exposure GWAS dataset and the outcome GWAS dataset, we have the loglikelihood for the proposed model Eq. (11) (up to some constants):
Under the constraint that the number of invalid IVs is 0≤K < m − 1, we obtain the constrained maximum likelihood estimator (cMLE) by solving
where I( ⋅ ) is the indicator function, and K is a tuning parameter representing the unknown number of invalid IVs to be determined by a model selection criterion to be discussed. The plurality valid condition implies K < m − 1.
For any candidate value of K, we use Box 1 to estimate the set of invalid IVs \({{{{{{\mathcal{A}}}}}}}:=\{i:{r}_{i}\ne 0\}\) and the causal parameter θ. Solving Eq. (13) is a bestsubset selection problem, for which it is computationally too demanding to obtain a global solution for a moderate to large number of candidate IVs (i.e. m). We develop a heuristic algorithm based on iteratively ranking the variants based on their estimated pleiotropic effects. The algorithm is computationally efficient, but may not converge to the global solution. As an alternative, we tried a recently proposed splicing algorithm that was proven to provide a global solution with a high probability^{52}; it performed no better than the current algorithm in our simulations, though computationally more demanding. Alternatively, we also tried to use the Lasso penalty to shrink r_{i} to zero for valid IVs (see Supplementary Section S2), which was also observed to have suboptimal performance in simulations (Supplementary Table S6). Hence we decided to use the heuristic and fast algorithm, which is to be shown to perform well in our numerical examples. Denote \({{{{{{{\bf{a}}}}}}}}^{{{{{{{\mathcal{A}}}}}}}}\) the vector whose ith entry is a_{i} if \(i\in {{{{{{\mathcal{A}}}}}}}\), and is zero otherwise. The algorithm is given in Box 1.
As in the independentIV case^{7}, it is notable that at the convergence the (estimated) invalid IVs (with \({\hat{r}}_{i}\ne 0\)) do not contribute to estimating θ, and the resulting cMLE of θ is the same as the maximum (profile) likelihood estimator being applied to all (selected) valid IVs. In practice, besides the default starting value of \({\theta }^{(0)}=0,{{{{{{{\bf{b}}}}}}}}_{X}^{(0)}={{{{{{\bf{0}}}}}}}\), we can use multiple random starts \({\theta }^{(0)},{{{{{{{\bf{b}}}}}}}}_{X}^{(0)}\) and take the estimate with the largest likelihood among those from the multiple starting points as the cMLE under the constraint of K invalid IVs.
Denote the estimates for a given K as \(\hat{\theta (K)},{\hat{{{{{{{\bf{b}}}}}}}}}_{X}(K),\hat{{{{{{{\bf{r}}}}}}}(K)},\hat{{{{{{{\mathcal{A}}}}}}}}(K),\hat{{{{{{{\mathcal{I}}}}}}}}(K)\). We select K from a candidate set \({{{{{{\mathcal{K}}}}}}}\subseteq \{0,1,\ldots,m2\}\) based on the following Bayesian information criterion (BIC):
where \(N=\min ({N}_{X},{N}_{Y})\). Then \(\hat{K}=\arg {\min }_{K\in {{{{{{\mathcal{K}}}}}}}}{{{{{{\rm{BIC}}}}}}}(K)\), \(\hat{{{{{{{\mathcal{I}}}}}}}}=\hat{{{{{{{\mathcal{I}}}}}}}}(\hat{K})\), and the final causal estimate of Eq. (13) is \(\hat{\theta }=\hat{\theta }(\hat{K})\). In the proposed algorithm, the resulting constrained maximum likelihood estimator is the same as the maximum profile likelihood estimator being applied to all IVs in \(\hat{{{{{{{\mathcal{I}}}}}}}}\). The standard error of \(\hat{\theta }\) can be estimated based on the observed Fisher information from the profile likelihood with IVs in \(\hat{{{{{{{\mathcal{I}}}}}}}}\). With \(\hat{\theta }\) and its corresponding standard error, the statistical inference is drawn based on the standard normal distribution, the theory of which is to be established in Theory.
The validity of the above inference relies on the selection consistency of invalid IVs (with r_{i} ≠ 0), which may not always be realized with finite samples. Instead, to account for the uncertainty/variation in model selection, we will use data perturbation as before for better finitesample statistical inference^{7}. As shown in Lin et al.^{53}, the data perturbation procedure (on a GWAS summary dataset) is equivalent to bootstrapping the corresponding individuallevel data. Briefly, for b = 1, …, B, we generate perturbed conditional estimates \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{(b)} \sim {{{{{{\mathcal{MVN}}}}}}}({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X},{{{{{{{\mathbf{\Sigma }}}}}}}}_{X})\) and \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{(b)} \sim {{{{{{\mathcal{MVN}}}}}}}({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y},{{{{{{{\mathbf{\Sigma }}}}}}}}_{Y})\), and apply the estimation procedure above on the perturbed data to obtain \({\hat{\theta }}^{(b)}\). And we use the sample mean and sample standard deviation of \({\hat{\theta }}^{(1)},\ldots,{\hat{\theta }}^{(B)}\) as the final causal estimate and its corresponding standard error.
Modeling conditional effects versus marginal effects
A possible and seemingly effective alternative as in the current practice of MR analysis is to model marginal effects, instead of conditional effects, of SNPs (Eq. (11)). That is, we have
where r^{*} = Rr. We can also have a similar relationship \({b}_{Yi}^{ * }=\theta {b}_{Xi}^{ * }+{r}_{i}^{ * }\) as in Eq. (4). However, one pitfall is that, IVs without horizontal pleiotropy in the conditional model (i.e. with r_{i} = 0) may have \({r}_{i}^{ * }\ne 0\) in the marginal model. For example, let \({{{{{{\bf{r}}}}}}}={({r}_{1},0,\ldots,0)}^{T}\) and r_{1} ≠ 0, then r^{*} = Rr will have nonzero elements for all m SNPs when they are all correlated with the first SNP (i.e., the first column of R are all nonzeros). Hence, although the plurality condition holds in the conditional model, it is violated in the marginal model. In general, the plurality condition is more likely to hold in the conditional model than in the marginal model. Therefore, in cisMRcML, we use the joint/conditional effect estimates instead of the marginal effect estimates.
Selection of genetic variants as IVs in cisMRcML
In this section, we discuss which SNPs should be used in the proposed method and how to select them. First, it is crucial to include all m SNPs associated with either the exposure or outcome, i.e. those in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\) in Fig. 1, not any of their proper subsets, and calculate their joint estimates with the exposure and the outcome. This is in striking contrast with the current practice of MR with independent IVs, where only SNPs significantly associated with the exposure (i.e., SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\)) are used^{7}. This is because, as shown in Fig. 1, conditional on G_{k}, G_{i} in \({{{{{{{\mathcal{I}}}}}}}}_{X}\setminus {{{{{{{\mathcal{I}}}}}}}}_{Y}\) does not have a direct path to the outcome; but if we do not include SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) (e.g. G_{k}), then it will open alternative paths of all other correlated SNPs (with G_{k}) to the outcome not through the exposure. This will in turn break the plurality condition required by model identifiability since all SNPs will have direct effects on Y. On the other hand, such an issue is unlikely to occur when SNPs are all independent. We also note that, when we include SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\setminus {{{{{{{\mathcal{I}}}}}}}}_{X}\), cisMRcML is expected to select them out as invalid IVs.
In practice, to select these m SNPs, we apply the COJO (Conditional and Joint association analysis) method^{24} on the exposure and the outcome respectively to select SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and \({{{{{{{\mathcal{I}}}}}}}}_{Y}\). COJO is suitable in our application since it can identify SNPs that jointly are significantly associated with the phenotype via a stepwise selection procedure. It is applicable to both quantitative traits and casecontrol studies. Furthermore, it only uses GWAS summary statistics and an estimated LD matrix from a reference panel as in cisMRcML.
Theory
The proposed cisMRcML enjoys nice asymptotic properties, including selection consistency of the proposed BIC and asymptotic normality of the cMLE. Here we state the assumptions and main conclusions with the proofs are relegated to the Supplementary.
Assumption 1
(Plurality valid condition.) Suppose that \({{{{{{{\mathcal{A}}}}}}}}_{0}=\{i:{r}_{i} \, \ne \, 0\}\) is the index set of the true invalid IVs with a nonzero horizontalpleiotropy effect, and \({K}_{0}= {{{{{{{\mathcal{A}}}}}}}}_{0}\). For any \({{{{{{\mathcal{A}}}}}}}\subseteq \{1,\ldots,m\}\) and \( {{{{{{\mathcal{A}}}}}}}={K}_{0}\), if \({{{{{{\mathcal{A}}}}}}} \, \ne \, {{{{{{{\mathcal{A}}}}}}}}_{0}\), then there does not exist any constant \(\tilde{\theta } \, \ne \, \theta\) such that \({b}_{Yi}=\tilde{\theta }{b}_{Xi}\) for all \(i\in {{{{{{{\mathcal{A}}}}}}}}^{C}\).
Assumption 2
The joint effect estimates \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X} \sim {{{{{{\mathcal{MVN}}}}}}}({{{{{{{\bf{b}}}}}}}}_{X},{{{{{{{\mathbf{\Sigma }}}}}}}}_{X})\) and \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y} \sim {{{{{{\mathcal{MVN}}}}}}}({{{{{{{\bf{b}}}}}}}}_{Y},\, {{{{{{{\mathbf{\Sigma }}}}}}}}_{Y})\) with the known covariance matrices Σ_{X} and Σ_{Y}.
Assumption 3
(Orders of the variances and sample sizes.) Let \(N=\min ({N}_{X},{N}_{Y})\), there exist positive constants c_{1}, c_{2} such that \({c}_{1}/N\le {({{{{{{{\mathbf{\Sigma }}}}}}}}_{X})}_{ij}\le {c}_{2}/N\) and \({c}_{1}/N\le {({{{{{{{\mathbf{\Sigma }}}}}}}}_{Y})}_{ij}\le {c}_{2}/N\) for i = 1, …, m, j = 1, …, m, i.e., Σ_{X} and Σ_{Y} are Θ(1/N).
Assumption 1 is the plurality condition, which is equivalent to that in Theorem 1 of Guo et al.^{48}, a sufficient and necessary condition for the identifiability of model Eq. (4). Assumption 2 and 3 are reasonable given that GWAS summary data are usually based on large sample sizes. Then the following theorem gives the selection consistency and asymptotic normality and consistency of the proposed estimator.
Theorem 1
With Assumption 1 to 3 satisfied, if \({K}_{0}\in {{{{{{\mathcal{K}}}}}}}\), we have \(P(\hat{K}={K}_{0})\to 1\) and \(P({\hat{{{{{{{\mathcal{A}}}}}}}}}_{\hat{K}}={{{{{{{\mathcal{A}}}}}}}}_{0})\to 1\) as N → ∞. And the constrained maximum likelihood estimator \(\hat{\theta }\) of Eq. (13), combined with the use of the BIC selection criterion, is consistent for the true causal effect size θ_{0}, and
where V is the expected Fisher information for the profile loglikelihood with all IVs in \({{{{{{{\mathcal{A}}}}}}}}_{0}^{C}\) that can be consistently estimated by its sample version.
We note that, as implied by the constraint we use in Eq. (13), the invalid IVs in the proposed method are referred to as those in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) with a nonzero direct effect on the outcome (r_{i} ≠ 0), which can be consistently selected out by the proposed BIC. On the other hand, although an irrelevant IV with b_{Xi} = r_{i} = b_{Yi} = 0 is also considered invalid, cisMRcML will not select it out but including such an IV will not affect the validity of our inference as long as the conditions of Theorem 1 are satisfied. In summary, cisMRcML is highly robust in the sense of allowing the presence of some invalid IVs violating any of the three valid IV assumptions; these invalid IVs can be more than half of all the IVs used.
Simulations with autoregressive correlation structure
In this simulation study, we simulated the GWAS summary statistics largely following the simulation procedure used in the LDAEgger paper^{16}:

1.
Generated the true joint effect of \( {{{{{{{\mathcal{I}}}}}}}}_{X}\) SNPs on the exposure \({b}_{Xi} \sim {{{{{{\mathcal{N}}}}}}}(0,1)\), for \(i\in {{{{{{{\mathcal{I}}}}}}}}_{X}\), and b_{Xi} = 0 for \(i\notin {{{{{{{\mathcal{I}}}}}}}}_{X}\); rescaled the effects according to the proportion of variability in exposure due to SNPs: \({{{{{{{\bf{b}}}}}}}}_{X}=\sqrt{{h}_{X}^{2}/({{{{{{{\bf{b}}}}}}}}_{X}^{T}{{{{{{\bf{R}}}}}}}{{{{{{{\bf{b}}}}}}}}_{X})}{{{{{{{\bf{b}}}}}}}}_{X}\), where \({h}_{X}^{2}=0.05\), and R was the LD matrix generated from an autoregressive model with Σ_{ij} = ρ^{∣i−j∣};

2.
Generated the direct effects of \( {{{{{{{\mathcal{I}}}}}}}}_{Y}\) SNPs on the outcome \({r}_{i} \sim {{{{{{\mathcal{N}}}}}}}(0,1)\) iid, with K_{1} SNPs from \({{{{{{{\mathcal{I}}}}}}}}_{X}\cap {{{{{{{\mathcal{I}}}}}}}}_{Y}\) and K_{2} SNPs from \({{{{{{{\mathcal{I}}}}}}}}_{Y}\setminus {{{{{{{\mathcal{I}}}}}}}}_{X}\); rescaled the direct effects according to the proportion of variability in outcome due directly to SNPs: \({{{{{{\bf{r}}}}}}}=\sqrt{{h}_{Y}^{2}/({{{{{{{\bf{r}}}}}}}}^{T}{{{{{{\bf{R}}}}}}}{{{{{{\bf{r}}}}}}})}{{{{{{\bf{r}}}}}}}\), where \({h}_{Y}^{2}=0.05\);

3.
Generated the true joint effects of SNPs on the outcome b_{Y} = θb_{X} + r;

4.
Generated the observed exposure GWAS estimates \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * } \sim {{{{{{\bf{R}}}}}}}{{{{{{{\bf{b}}}}}}}}_{X}+{{{{{{{\bf{L}}}}}}}}^{T}{{{{{{{\boldsymbol{\epsilon }}}}}}}}_{X}\), \({{{{{{{\boldsymbol{\epsilon }}}}}}}}_{X} \sim {{{{{{\mathcal{N}}}}}}}({{{{{{\bf{0}}}}}}},\frac{1{h}_{X}^{2}}{{N}_{X}}{{{{{{{\bf{I}}}}}}}}_{m})\), where L was the Cholesky decomposition of the LD matrix R, and N_{X} = 10000. Note \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{X}^{ * }=\sqrt{\frac{1{h}_{X}^{2}}{{N}_{X}}}{{{{{{{\bf{1}}}}}}}}_{m}\);

5.
Generated the observed outcome GWAS estimates \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * } \sim {{{{{{\bf{R}}}}}}}{{{{{{{\bf{b}}}}}}}}_{Y}+{{{{{{{\bf{L}}}}}}}}^{T}{{{{{{{\boldsymbol{\epsilon }}}}}}}}_{Y}\), \({{{{{{{\boldsymbol{\epsilon }}}}}}}}_{Y} \sim {{{{{{\mathcal{N}}}}}}}({{{{{{\bf{0}}}}}}},\frac{1{\theta }^{2}{h}_{X}^{2}{h}_{Y}^{2}}{{N}_{Y}}{{{{{{{\bf{I}}}}}}}}_{m})\), and N_{Y} = 50000. Note \({{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{ * }=\sqrt{\frac{1{\theta }^{2}{h}_{X}^{2}{h}_{Y}^{2}}{{N}_{Y}}}{{{{{{{\bf{1}}}}}}}}_{m}\).
In total m = 10 SNPs were generated. We considered two scenarios: (1) \( {{{{{{{\mathcal{I}}}}}}}}_{X}=10\); (2) \( {{{{{{{\mathcal{I}}}}}}}}_{X}=5\) and \({K}_{2}= {{{{{{{\mathcal{I}}}}}}}}_{Y}\setminus {{{{{{{\mathcal{I}}}}}}}}_{X}=2\). We note that in the second scenario, only 5 SNPs had effects on the exposure, 2 SNPs had effects on the outcome but not on the exposure, while 3 SNPs had no effect on either the exposure or the outcome. We would investigate the impact of only including the 5 SNPs in \( {{{{{{{\mathcal{I}}}}}}}}_{X}\) in the analysis. In both scenarios, we varied K_{1}, the number of invalid IVs in \( {{{{{{{\mathcal{I}}}}}}}}_{X}\cap {{{{{{{\mathcal{I}}}}}}}}_{Y}\).
Given the simulated GWAS summary statistics \(({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * },{{{{{{{\boldsymbol{\sigma }}}}}}}}_{X}^{ * },{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * },{{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{ * })\), we transformed them to the conditional estimates as \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}={{{{{{{\bf{R}}}}}}}}^{1}{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * }\), \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}={{{{{{{\bf{R}}}}}}}}^{1}{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * }\) and \({{{{{{{\mathbf{\Sigma }}}}}}}}_{X}={{{{{{{\bf{R}}}}}}}}^{1}({{{{{{\bf{R}}}}}}}\cdot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{X}^{ * }{{{{{{{\boldsymbol{\sigma }}}}}}}}_{X}^{{ * }^{T}}){{{{{{{\bf{R}}}}}}}}^{1}\), \({{{{{{{\mathbf{\Sigma }}}}}}}}_{Y}={{{{{{{\bf{R}}}}}}}}^{1}({{{{{{\bf{R}}}}}}}\cdot {{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{ * }{{{{{{{\boldsymbol{\sigma }}}}}}}}_{Y}^{{ * }^{T}}){{{{{{{\bf{R}}}}}}}}^{1}\). Note that in scenario (2), when we applied cisMRcML and LEgger with the conditional estimates calculated only based on the 5 SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\), it was different from only using the corresponding 5 elements in \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}\) and \({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}\) calculated based on all 10 SNPs.
Each simulation setup was repeated 500 times. Throughout this simulation, cisMRcML was implemented with 5 random starts where \({\theta }^{(0)} \sim {{{{{{\mathcal{U}}}}}}}(0.5,0.5)\) and \({{{{{{{\bf{b}}}}}}}}_{X}^{(0)} \sim {{{{{{\mathcal{N}}}}}}}({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X},{{{{{{{\bf{\Sigma }}}}}}}}_{X})\), and B = 100 data perturbations. IVWIND, EggerIND, cMLIND, GIVW and GEgger were implemented in the R package MendelianRandomization (v.0.9.0) with their default settings. LEgger was implemented in the R code provided at https://rbarfield.github.io/Barfield_website/pages/Rcode.html. MR.LDP (https://github.com/QingCheng0218/MR.LDP), MR.Corr2 (https://github.com/QingCheng0218/MR.Corr2), MR.CUE (https://github.com/QingCheng0218/MR.CUE) and MRAID were implemented using corresponding R packages on GitHub, with parameters provided in their examples.
To investigate the method’s robustness to weak IVs, we further considered the first scenario where \( {{{{{{{\mathcal{I}}}}}}}}_{X}=10\), K_{1} = 0 and ρ = 0.6, but we reduced the instrument strength by reducing either the magnitude or the precision of IVexposure association. To reduce the precision, we varied the sample size of the exposure GWAS N_{X} ∈ {500, 1000, 5000}. This corresponded to an average Fstatistics (across 500 replications) of 3.6, 6.3 and 27.3 respectively. To reduce the magnitude of b_{Xi}, we varied \({h}_{X}^{2}\in \{0.005,0.01\}\), corresponding to an average of Fstatistics of 6.0 and 11.1 respectively.
As suggested by a reviewer, we also simulated IVs that were only weakly invalid in the sense that the pleiotropic effects on the outcome decreased at the same rate as sampling error \(1/\sqrt{{N}_{Y}}\). Specifically, in the first scenario where \( {{{{{{{\mathcal{I}}}}}}}}_{X}=10\), K_{1} = 4 and a moderate LD of ρ = 0.6, we directly generated the direct effects \({r}_{i}=\kappa /\sqrt{{N}_{Y}}\) in step 2 of the above simulation procedure. We varied the sample size of the outcome GWAS N_{Y} ∈ {5e4, 1e5, 5e5} and κ ∈ {1, 5, 20}.
Finally, we considered an extreme scenario where outcomeassociated SNPs (\({{{{{{{\mathcal{I}}}}}}}}_{Y}\)) were uncorrelated or only weakly associated with exposureassociated SNPs (\({{{{{{{\mathcal{I}}}}}}}}_{X}\)). Specifically, we simulated 10 SNPs with 5 SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and 5 SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) (\( {{{{{{{\mathcal{I}}}}}}}}_{X}\cap {{{{{{{\mathcal{I}}}}}}}}_{Y}=0\)). SNPs within \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) were correlated with an autoregressive structure of ρ = 0.6 respectively, while SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) and SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) were uncorrelated (ρ_{XY} = 0) or weakly correlated (ρ_{XY} = 0.1). In other words, the LD matrix among SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\) was a block matrix, where the diagonal elements were two autoregressive LD blocks, and the offdiagonal elements were either all 0 or 0.1.
Simulations with real LD patterns derived from UK Biobank data
To mimic a realistic application of the proposed method, we generated data directly from UK Biobank individual genotypes in this simulation study. We simulated data for the jth individual as follows:
where \({U}_{j},{\epsilon }_{Xj},{\epsilon }_{Yj} \sim {{{{{{\mathcal{N}}}}}}}(0,1)\,{{{{{{\rm{independently}}}}}}},\theta \in \{0,0.05\}.\) We used the proteomewide application on CAD as a reference to generate the data. Specifically, we considered the following two setups:

S1.
50 proteins with \( {{{{{{{\mathcal{C}}}}}}}}_{X} \ge 5\) and \( {{{{{{{\mathcal{C}}}}}}}}_{Y} \ge 1\) based on our real data analysis were randomly selected. For each protein, we specified the sets of SNPs \({{{{{{{\mathcal{I}}}}}}}}_{X}={{{{{{{\mathcal{C}}}}}}}}_{X}\) and \({{{{{{{\mathcal{I}}}}}}}}_{Y}={{{{{{{\mathcal{C}}}}}}}}_{Y}\) based on the COJO result in the realdata analysis, and the corresponding effect sizes b_{Xi} and r_{i} were set as the corresponding GWAS effect sizes in the real pQTL and CAD GWAS datasets respectively.

S2.
50 proteins were randomly selected. For each protein, we randomly selected 7 SNPs in the cisregion (with minor allele frequency (MAF) greater than 5%) in \({{{{{{{\mathcal{I}}}}}}}}_{X}\), and randomly selected 2 nearby SNPs (with MAF greater than 5%) in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\), and ensured that the absolute pairwise correlations among SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\cup {{{{{{{\mathcal{I}}}}}}}}_{Y}\) were less than 0.95. The effects of SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) on X were generated from b_{Xi} ~ Unif(( − 0.2, − 0.1) ∪ (0.1, 0.2)) and the pleiotropic effects of SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) on Y were generated from r_{i} ~ Unif(0.1, 0.2).
After generating data for the exposure and the outcome, we randomly selected two nonoverlapping sets of individuals, with sizes N_{X} = 10000 and N_{Y} = 100000, to perform GWAS on X and Y respectively. Note that the GWASs were performed for all SNPs in a specific cisregion using PLINK 2.0^{54}. Additionally, we randomly selected a third nonoverlapping sample, with a size of N_{ref} = 5000, to mimic the use of an external reference panel for estimating LD structure.
Given the exposure and outcome GWAS datasets, we first applied GCTACOJO (https://yanglab.westlake.edu.cn/software/gcta/#COJO, version 1.92.3beta3) with a significance threshold of 5 × 10^{−6} to select SNPs jointly associated with X or Y, denoted as \({{{{{{{\mathcal{C}}}}}}}}_{X}\) or \({{{{{{{\mathcal{C}}}}}}}}_{Y}\) respectively. We then applied the proposed cisMRcML with SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\cup {{{{{{{\mathcal{C}}}}}}}}_{Y}\) with 5 random starts where \({\theta }^{(0)} \sim {{{{{{\mathcal{U}}}}}}}(0.1,0.1)\) and \({{{{{{{\bf{b}}}}}}}}_{X}^{(0)} \sim {{{{{{\mathcal{N}}}}}}}({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X},{{{{{{{\mathbf{\Sigma }}}}}}}}_{X})\), and B = 100 data perturbations. We also applied cisMRcMLX and the current common practice of GIVWX and GEggerX with SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\) only. To apply MR.LDP, MR.Corr2, MR.CUE and MRAID, we first extracted SNPs that were marginally associated with X with pvalue < 5 × 10^{−6} and pruned the SNPs to ensure that any pairwise Pearson’s absolute correlations were no more than 0.9. Finally, we considered the standard practice for polygenic MR by using only independent IVs associated with X. We performed LD clumping on the exposure GWAS dataset using the threshold of r^{2} = 0.001, the default value used in TwoSampleMR package. Since the number of independent IVs was usually less than three, in which the independent version of MRcML and Egger regression were not applicable, we only applied the IVW method, referred to as IVWIND. Simulations were repeated 10 times per gene, with a total of 500 replicates.
Reference panel used in real data applications
In the following two real data applications, we used the UK Biobank individuallevel genotype data^{55} as the reference panel. As the following analysis was based on GWAS datasets of (mostly) European ancestry, 337426 unrelated (field ‘22020’=1) and selfreported WhiteBritish individuals with similar genetic ancestry (field ‘22006’=1) in UK Biobank were used to calculate the LD matrix among SNPs.
Drugtarget MR application with the use of downstream biomarkers
GWAS summary data for both LDL cholesterol and testosterone were taken from the Neale Lab UK Biobank GWAS round 2 results (http://www.nealelab.is/ukbiobank/). And the GWAS summary data for CAD was obtained from CARDIoGRAMplusC4D Consortium^{56}. We first extracted genetic variants located 500 kb on both sides of a gene, and retained those present in both the biomarker and CAD GWAS data, and confined our analysis to variants with missing genotypes < 10%, minor allele frequency (MAF) > 0.01, HardyWeinberg equilibrium (HWE) p > 1 × 10^{−6} in the reference panel. Then we performed GCTACOJO on the exposure (or outcome) GWAS data to select SNPs jointly associated with the exposure (or the outcome) at p < 5 × 10^{−6}, denoted as \({{{{{{{\mathcal{C}}}}}}}}_{X}\) (or \({{{{{{{\mathcal{C}}}}}}}}_{Y}\)) respectively.
We transformed the marginal association estimates \(({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X}^{ * },{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y}^{ * })\) to the conditional estimates \(({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X},{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y})\) for the SNPs in set \({{{{{{{\mathcal{C}}}}}}}}_{X}\cup {{{{{{{\mathcal{C}}}}}}}}_{Y}\), and calculated the corresponding covariance matrices Σ_{X} and Σ_{Y} according to Model. Then we applied cisMRcML with B = 100 data perturbations with 5 random starts where \({\theta }^{(0)} \sim {{{{{{\mathcal{U}}}}}}}(0.1,0.1)\) and \({{{{{{{\bf{b}}}}}}}}_{X}^{(0)} \sim {{{{{{\mathcal{N}}}}}}}({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X},{{{{{{{\bf{\Sigma }}}}}}}}_{X})\), and LDAEgger with the conditional estimates. We also applied GIVWX and GEggerX only using the marginal association estimates of SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\).
A proteomewide application to CAD
We confined our analysis to a list of 1034 proteins with ≥3 identified pQTLs in the EA population according to Supplementary Table 6.1 in Zhang et al.^{30}. For CAD GWAS, we used the one with a larger sample size of N_{Y} = 547261, which was a metaanalysis result of UK Biobank and CARDIoGRAMplusC4D^{57}.
The data preprocessing step was similar to that in Drugtarget MR application with the use of downstream biomarkers, except that the LDL (or testosterone) GWAS data (i.e., exposure GWAS) was replaced by the pQTL dataset. We ran GCTACOJO with 5 × 10^{−6} as the pvalue threshold on both the pQTL data and CAD GWAS data with the UK Biobank data as the reference panel to obtain \({{{{{{{\mathcal{C}}}}}}}}_{X}\) and \({{{{{{{\mathcal{C}}}}}}}}_{Y}\). Here, we used a slightly less stringent pvalue threshold of 5 × 10^{−6}, rather than the usual/default 5 × 10^{−8}, because omitting some relevant SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{X}\) might lose power while omitting some SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\) would affect the validity of the proposed method as discussed previously in Selection of genetic variants as IVs in cisMRcML. We performed additional sensitivity analyzes using different pvalue thresholds, which were detailed in Supplementary Section S5. We retained proteins with ≥3 SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\), and excluded proteins with highly correlated (using “–cojocollinear 0.9”) SNPs in \({{{{{{{\mathcal{C}}}}}}}}_{X}\) and \({{{{{{{\mathcal{C}}}}}}}}_{Y}\). After the preprocessing step, 773 proteins remained to be analyzed next. We applied cisMRcML and LEgger with the conditional estimates \(({\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{X},{\hat{{{{{{{\boldsymbol{\beta }}}}}}}}}_{Y},{{{{{{{\mathbf{\Sigma }}}}}}}}_{X},{{{{{{{\mathbf{\Sigma }}}}}}}}_{Y})\) calculated on the SNPset \({{{{{{{\mathcal{C}}}}}}}}_{X}\cup {{{{{{{\mathcal{C}}}}}}}}_{Y}\), as well as GIVWX and GEggerX on the pQTLs that were conditionally associated with the proteins (i.e., those in \({{{{{{{\mathcal{C}}}}}}}}_{X}\))^{10,25}. We also applied the Waldratio test using the pQTL with the smallest marginal pvalue for each protein. The number of SNPs used in cisMRcML ranged from 3 to 20, with a mean of around 5; and running on an AMD 7763 processor, the computation times for cisMRcML ranged from 1.6 to 680 seconds, with a mean of 20 seconds.
We further conducted colocalization analysis on the significant proteins with an FDRadjusted pvalue less than 0.05 (using “p.adjust(method=‘fdr’)” in R). Colocalization analysis has been more regularly used and strongly recommended in practice following MR analysis^{36}. In this analysis, we used a Bayesian colocalization method called COLOC^{58}, where a high H4PP suggested the protein and CAD shared the same causal variant at the locus, while a high H3PP suggested the protein and CAD had different causal variants at the locus. The former case supported the significant result from MR, however, the latter case suggested the significant MR result may be driven by genetic confounding through LD between pQTLs and CADassociated SNPs, e.g. SNPs in \({{{{{{{\mathcal{I}}}}}}}}_{Y}\setminus {{{{{{{\mathcal{I}}}}}}}}_{X}\). COLOC was implemented with coloc.abf() in the R package coloc (v.5.2.3) with the default setting.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The GWAS summary datasets used in the real data analysis are all publicly available at the URLs below. ARIC pQTL, http://nilanjanchatterjeelab.org/pwas/; GWAS Catalog studies GCST003116 and GCST005194 for coronary artery disease, https://www.ebi.ac.uk/gwas/home; Neale lab UK Biobank round 2 GWAS of LDL and testosterone, https://www.nealelab.is/ukbiobank/. The UK Biobank individuallevel data are available under restricted access. Researchers can apply for access at https://www.ukbiobank.ac.uk/. Access to UK Biobank individuallevel data was approved through UKB Application #35107. The processed pQTL data used in the real data application are available at https://doi.org/10.6084/m9.figshare.25411957. Source data are provided with this paper.
Code availability
R code for simulation studies and real data analysis is available at https://github.com/ZhaotongL/cisMRpaper^{59}. The software for cisMRcML is publicly available on GitHub at https://github.com/ZhaotongL/cisMRcML^{60}.
References
Lawlor, D. A., Harbord, R. M., Sterne, J. A., Timpson, N. & Davey Smith, G. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Stat. Med. 27, 1133–1163 (2008).
Sanderson, E. et al. Mendelian randomization. Nat. Rev. Methods Prim. 2, 1–21 (2022).
Burgess, S., Butterworth, A. & Thompson, S. G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet. Epidemiol. 37, 658–665 (2013).
Bowden, J., Davey Smith, G., Haycock, P. C. & Burgess, S. Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet. Epidemiol. 40, 304–314 (2016).
Qi, G. & Chatterjee, N. Mendelian randomization analysis using mixture models for robust and efficient estimation of causal effects. Nat. Commun. 10, 1–10 (2019).
Morrison, J., Knoblauch, N., Marcus, J. H., Stephens, M. & He, X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genomewide summary statistics. Nat. Genet. 52, 740–747 (2020).
Xue, H., Shen, X. & Pan, W. Constrained maximum likelihoodbased Mendelian randomization robust to both correlated and uncorrelated pleiotropic effects. Am. J. Hum. Genet. 108, 1251–1269 (2021).
Boehm, F. J. & Zhou, X. Statistical methods for Mendelian randomization in genomewide association studies: a review. Computational Struct. Biotechnol. J. 20, 2338–2351 (2022).
Schmidt, A. F. et al. Genetic drug target validation using Mendelian randomisation. Nat. Commun. 11, 3255 (2020).
Zhao, H. et al. Proteomewide Mendelian randomization in global biobank metaanalysis reveals multiancestry drug targets for common diseases. Cell Genomics 2, 100195 (2022).
Gkatzionis, A., Burgess, S. & Newcombe, P. J. Statistical methods for cisMendelian randomization with twosample summarylevel data. Genet. Epidemiol. 47, 3–25 (2023).
Burgess, S., Zuber, V., ValdesMarquez, E., Sun, B. B. & Hopewell, J. C. Mendelian randomization with finemapped genetic data: choosing from large numbers of correlated instrumental variables. Genet. Epidemiol. 41, 714–725 (2017).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Burgess, S., Dudbridge, F. & Thompson, S. G. Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods. Stat. Med. 35, 1880–1906 (2016).
Burgess, S. & Thompson, S. G. Interpreting findings from Mendelian randomization using the MREgger method. Eur. J. Epidemiol. 32, 377–389 (2017).
Barfield, R. et al. Transcriptomewide association studies accounting for colocalization using Egger regression. Genet. Epidemiol. 42, 418–433 (2018).
Lin, Z., Pan, I. & Pan, W. A practical problem with Egger regression in Mendelian randomization. PLoS Genet. 18, e1010166 (2022).
Cheng, Q. et al. MRLDP: a twosample Mendelian randomization for GWAS summary statistics accounting for linkage disequilibrium and horizontal pleiotropy. NAR genomics Bioinforma. 2, lqaa028 (2020).
Cheng, Q. et al. MRCorr2: a twosample Mendelian randomization method that accounts for correlated horizontal pleiotropy using correlated instrumental variants. Bioinformatics 38, 303–310 (2022).
Cheng, Q., Zhang, X., Chen, L. S. & Liu, J. Mendelian randomization accounting for complex correlated horizontal pleiotropy while elucidating shared genetic etiology. Nat. Commun. 13, 6490 (2022).
Yuan, Z. et al. Likelihoodbased Mendelian randomization analysis with automated instrument selection and horizontal pleiotropic modeling. Sci. Adv. 8, eabl5744 (2022).
Wang, A., Liu, W. & Liu, Z. A twosample robust Bayesian Mendelian Randomization method accounting for linkage disequilibrium and idiosyncratic pleiotropy with applications to the COVID19 outcomes. Genet. Epidemiol. 46, 159–169 (2022).
Hartwig, F. P., Davey Smith, G. & Bowden, J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int. J. Epidemiol. 46, 1985–1998 (2017).
Yang, J. et al. Conditional and joint multipleSNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369–375 (2012).
Zheng, J. et al. Multiancestry Mendelian randomization of omics traits revealing drug targets of COVID19 severity. EBioMedicine 81, 104112 (2022).
Lin, Z., Xue, H. & Pan, W. Robust multivariable Mendelian randomization based on constrained maximum likelihood. Am. J. Hum. Genet. 110, 592–605 (2023).
Baigent, C. et al. Efficacy and safety of more intensive lowering of LDL cholesterol: a metaanalysis of data from 170,000 participants in 26 randomised trials. Lancet (Lond., Engl.) 376, 1670–1681 (2010).
Ference, B. A. et al. Variation in PCSK9 and HMGCR and risk of cardiovascular disease and diabetes. N. Engl. J. Med. 375, 2144–2153 (2016).
Schooling, C. M. et al. Genetic predictors of testosterone and their associations with cardiovascular disease and risk factors: A Mendelian randomization investigation. Int. J. Cardiol. 267, 171–176 (2018).
Zhang, J. et al. Plasma proteome analyses in individuals of European and African ancestry identify cispQTLs and models for proteomewide association studies. Nat. Genet. 54, 593–602 (2022).
Robinson, J. G. et al. Efficacy and safety of alirocumab in reducing lipids and cardiovascular events. N. Engl. J. Med. 372, 1489–1499 (2015).
Sabatine, M. S. et al. Evolocumab and clinical outcomes in patients with cardiovascular disease. N. Engl. J. Med. 376, 1713–1722 (2017).
Gaba, P. et al. Association between achieved lowdensity lipoprotein cholesterol levels and longterm cardiovascular and safety outcomes: an analysis of FOURIEROLE. Circulation 147, 1192–1203 (2023).
Lappegård, K. T. et al. A vital role for complement in heart disease. Mol. Immunol. 61, 126–134 (2014).
Shahini, N. et al. The alternative complement pathway is dysregulated in patients with chronic heart failure. Sci. Rep. 7, 1–10 (2017).
Zuber, V. et al. Combining evidence from Mendelian randomization and colocalization: Review and comparison of approaches. Am. J. Hum. Genet. 109, 767–782 (2022).
Katoh, M. FGFR inhibitors: Effects on cancer cells, tumor microenvironment and wholebody homeostasis. Int. J. Mol. Med. 38, 3–15 (2016).
Faul, C. et al. FGF23 induces left ventricular hypertrophy. J. Clin. Investig. 121, 4393–4408 (2011).
Freundlich, M. et al. Paricalcitol downregulates myocardial renin–angiotensin and fibroblast growth factor expression and attenuates cardiac hypertrophy in uremic rats. Am. J. hypertension 27, 720–726 (2014).
Khosravi, F., Ahmadvand, N., Bellusci, S. & Sauer, H. The multifunctional contribution of FGF signaling to cardiac development, homeostasis, disease and repair. Front. cell developmental Biol. 9, 672935 (2021).
Faul, C. Cardiac actions of fibroblast growth factor 23. Bone 100, 69–79 (2017).
Reffelmann, T. & Kloner, R. A. Phosphodiesterase 5 inhibitors: are they cardioprotective? Cardiovascular Res. 83, 204–212 (2009).
Hao, K. et al. Integrative prioritization of causal genes for coronary artery disease. Circulation: Genom. Precis. Med. 15, e003365 (2022).
Robinson, J. W.et al. Evaluating the potential benefits and pitfalls of combining protein and expression quantitative trait loci in evidencing drug targets. bioRxiv 2022–03 (2022).
van Der Graaf, A. et al. Mendelian randomization while jointly modeling cis genetics identifies causal relationships between gene expression and lipids. Nat. Commun. 11, 4930 (2020).
Lin, Z., Xue, H. & Pan, W. Combining Mendelian randomization and network deconvolution for inference of causal networks with GWAS summary data. PLoS Genet. 19, e1010762 (2023).
Xue, H., Shen, X. & Pan, W. Causal inference in transcriptomewide association studies with invalid instruments and GWAS summary data. J. Am. Stat. Assoc. 118, 1525–1537 (2023).
Guo, Z., Kang, H., Tony Cai, T. & Small, D. S. Confidence intervals for causal effects with invalid instruments by using twostage hard thresholding with voting. J. R. Stat. Soc. Ser. B: Stat. Methodol. 80, 793–815 (2018).
Xue, H. & Pan, W. Inferring causal direction between two traits in the presence of horizontal pleiotropy with GWAS summary data. PLoS Genet. 16, e1009105 (2020).
Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015).
Yavorska, O. O. & Burgess, S. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. Int. J. Epidemiol. 46, 1734–1739 (2017).
Zhu, J., Wen, C., Zhu, J., Zhang, H. & Wang, X. A polynomial algorithm for bestsubset selection problem. Proc. Natl Acad. Sci. 117, 33117–33123 (2020).
Lin, Z., Deng, Y. & Pan, W. Combining the strengths of inversevariance weighting and Egger regression in Mendelian randomization using a mixture of regressions model. PLoS Genet. 17, e1009922 (2021).
Chang, C. C. et al. Secondgeneration PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, s13742–015 (2015).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
the CARDIoGRAMplusC4D Consortium. A comprehensive 1000 Genomes–based genomewide association metaanalysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).
Van Der Harst, P. & Verweij, N. Identification of 64 novel genetic loci provides an expanded view on the genetic architecture of coronary artery disease. Circulation Res. 122, 433–443 (2018).
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Lin, Z. Simulation and real data analysis code for “A robust cisMendelian randomization method with application to drug target discovery” https://doi.org/10.5281/zenodo.12523227 (2024).
Lin, Z. ZhaotongL/cisMRcML https://doi.org/10.5281/zenodo.12523233 (2024).
Acknowledgements
This research was supported by NIH grants R01 AG065636 (to Z.L. and W.P.), R01 AG069895 (W.P.), RF1 AG067924 (W.P.), U01 AG073079 (W.P.), R01 AG074858 (W.P.), R01 HL116720 (W.P.) and R01 GM126002 (W.P.), and by the Minnesota Supercomputing Institute at the University of Minnesota (to Z.L. and W.P.). This manuscript is largely based on a chapter of the first author’s PhD dissertation at the University of Minnesota. There is no issue of copyright with the dissertation; see https://pqstaticcontent.proquest.com/collateral/media2/documents/umi_embargorest.pdf.
Author information
Authors and Affiliations
Contributions
Z.L. and W.P. conceived the methods and wrote the manuscript. Z.L. performed all the data analysis and simulation studies.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lin, Z., Pan, W. A robust cisMendelian randomization method with application to drug target discovery. Nat Commun 15, 6072 (2024). https://doi.org/10.1038/s4146702450385y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702450385y
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.