Mendelian Randomization Analysis Using Mixture Models (MRMix) for Genetic Effect-Size-Distribution Leads to Robust Estimation of Causal Effects

We propose a novel method for robust estimation of causal effects in two-sample Mendelian randomization analysis using potentially large number of genetic instruments. We consider a “working model” for bi-variate effect-size distribution across pairs of traits in the form of normal-mixtures which assumes existence of a fraction of the genetic markers that are valid instruments, i.e. they have only direct effect on one trait, while other markers can have potentially correlated, direct and indirect effects, or have no effects at all. We show that model motivates a simple method for estimating causal effect (θ) through a procedure for maximizing the probability concentration of the residuals, , at the “null” component of a two-component normal-mixture model. Simulation studies showed that MRMix provides nearly unbiased or/and substantially more robust estimates of causal effects compared to alternative methods under various scenarios. Further, the studies showed that MRMix is sensitive to direction and can achieve much higher efficiency (up to 3–4 fold) relative to other comparably robust estimators. We applied the proposed methods for conducting MR analysis using largest publicly available datasets across a number of risk-factors and health outcomes. Notable findings included identification of causal effects of genetically determined BMI and ageat-menarche, which have relationship among themselves, on the risk of breast cancer; detrimental effect of HDL on the risk of breast cancer; no causal effect of HDL and triglycerides on the risk of coronary artery disease; a strong detrimental effect of BMI, but no causal effect of years of education, on the risk of major depressive disorder.


Introduction
Discoveries of genetic susceptibility variants underlying complex traits continue to increase rapidly with ever growing size of genome-wide association studies 1-5 . Mendelian randomization (MR) -a form of instrumental variable analysis for the assessment of the causal effect of one trait on another -provides major opportunity for translation of the increasing knowledge of genetics to improve human health 6,7 . MR analysis has already been widely used for obtaining evidence for drug targets, causal basis for epidemiologic associations and cascading effects among complex molecular traits 8,9 . While MR was originally designed to be used with individual genetic instruments with known biologic functions, the recent trend has been to exploit the multitude of genetic variants emerging from large GWAS. The use of multiple genetic variants can allow to increase the power of MR analysis, correct for weak instrument bias and can provide evidence of causality across a broader set of underlying mechanisms for intervening on the traits 7,8 .
When all of the selected variants satisfy the key assumption of MR analysis, i.e. they only have direct effects on one trait, then the causal effect of that trait on the other can be efficiently estimated by meta-analysis of the well-known ratio estimates of causal effects across the different variants. It is, however, recognized that while availability of many variants provides an opportunity to strengthen MR analysis, there is also major potential for bias as the key assumption, i.e. the variants have no direct effect on the second trait, can be violated due to pleiotropy 7,8 . Indeed, recent empirical studies 3,10-15 have unequivocally shown that common variants have wide spread pleiotropic effects, and consequently, polygenic MR analysis can be susceptible to bias. Originally, some methods were developed that allow genetic variants to have pleiotropic effects, but they require the strong InSIDE assumption, i.e. the direct and indirect effects are uncorrelated [15][16][17][18] . Under this assumption, any genetic correlation between the two traits, as measure by the selected SNPs, can arise solely due to underlying causal relationship between the two traits.
The effects of genetic variants on multiple traits can be correlated when they are mediated through common causal factors, a scenario that is likely to be prevalent in practice. Thus, most recently there has been effort to develop methods that can allow for the presence of invalid instruments which may have complex, possibly correlated pleiotropic effects. In particular, median-and mode-based ratio estimators have been proposed for removing effects of invalid instruments for the estimation of causal effects under different assumptions 19,20 . Further, a recent study proposed the use of methods for outlier detection for conducting sensitivity analysis of ratio estimates to the presence of invalid instruments 21 . While these and other new methods present important progress, there remains important gaps as they can be susceptible to substantial residual bias in the presence of large number of invalid instruments or/and can be inefficient in the regard that they may produce estimates of causal effects with large uncertainty.
In this article, we propose a novel method for estimation of causal effects in multi-marker MR analysis by taking advantage of a working parametric model for the underlying bivariate effect-size distribution of the SNPs across pairs of traits. The model allows genetic correlation to arise from both causal and non-causal relationships, i.e. due to correlated pleiotropic effects. For estimation of causal effects, however, we do not use the likelihood of the data under the model itself to reduce sensitivity to the underlying assumptions.
Instead, we propose an estimating equation approach that essentially requires maximization of the probability concentration of the residuals, − , at the "null" component of a normal-mixture model, the form of which is derived from the proposed model. We use extensive simulation studies to show that the proposed method can provide much better trade-off between bias and variance than existing estimators even in a wide set of scenarios. We apply the proposed and existing methods for conducting MR analysis across a variety of exposures and health outcomes using publicly available summary-statistics from very large GWAS. Analysis reveals important differences across methods and new insights to causal relationships underlying some of these traits.

Model setup
We propose a method for two-sample MR analysis that requires only summary-level GWAS association statistics for a putative exposure (Trait 1) and the outcome (Trait 2) of interest from separate studies. We describe the proposed method in the context of independent SNPs. Let ( , , , ) , = , , … , denote the underlying true association coefficients for the SNPs for the exposure ( ) and the outcome .

Simulation setup
We conducted extensive simulation studies to evaluate the proposed method under different scenarios.
We simulated summary statistics of 200,000 independent SNPs. We first simulate the direct effect sizes ( , ) and compute the total effect sizes as: = and = + . Then we generated the summary statistics by simulating independently from ∼ ( , ) and ∼ ( , ), where and are the sample sizes for studies associated with and , respectively. This mimics the two-sample MR setup where the exposure and the outcome are measured on independent samples. In all our simulations, we set = where we vary to = , to reflect the fact that for many disease outcomes the effective sample size for can be expected to be lower than continuous exposures .
We simulated true effect-sizes for the SNPs under the hypothesized four component model as well as more complex models that includes additional mixture components.
Under Scenario A, we simulated effect sizes and from the four-component mixture model (Figure 1) In Scenario B, we simulated effect-sizes using more complex normal mixture model that allows existence of clusters of non-null SNPs with distinctly larger effects than others 4 .
In particular, we allow a fraction of causal SNPs for to have distinctly larger effects and we assume the SNPs that have larger effects are more likely to be valid IVs.
Similarly, among SNPs which have direct effect on , we allow existence of SNPs which have distinctly larger effects and these SNPs are less likely to have direct effect on .
We allow SNPs with distinct cluster of effect-sizes through incorporation of additional normal-mixture components with distinct variance-component parameters (see Supplementary Notes for more details).

Inclusion of SNPs with liberal threshold
While in our main analysis, we focused on analysis based on SNPs as "instruments" that have achieved genome-wide significance in the study associated with , we explored the ability of the method to handle additional SNPs below genome-wide significance. When SNPs are included using more liberal threshold, one would expect a fraction of these SNPs to be "null". In the presence of null SNPs, the probability concentration of the two-component mixture model for the residuals at the null component is + , where is the proportion of valid instruments and is the proportion of null SNPs. Thus, while inclusion of more SNPs as potential instruments could lead to increase in efficiency due to increase in the underlying valid instruments, if a very liberal threshold is used, then large value of can obscure the enrichment that is expected for the distribution of residual at the null component due to . Thus, one would expect there would be is an optimal threshold for SNP selection, as is typically observed for building polygenic risk scores for risk-prediction. We varied the p-value threshold for instrument selection to 0.005, 5×10 RV , 5×10 RW and studied the bias and standard errors for resulting MRMix estimates. Further, as the winner's curse problem can create bias when selection of SNPs and estimation of their effects are done based on the same study, we also studied the performance of MRMix when effect-sizes of the SNPs associated with are estimated based on an independent dataset than the one used to select the SNPs.

Summary level data
We applied MRMix for the analysis of publicly available GWAS summary level data to explore causal relationships underlying a variety of exposure-outcome pairs of interest.
We selected these pairs based on available sample sizes and number of underlying instruments, existing evidence of epidemiologic associations in the literature or/and evidence of causality from recent MR studies. On the exposure side, we accessed data for height and body mass index 5 , blood lipids 23 , education attainment 24 , blood pressure 25 and age at menarche 26,27 . For data analysis, we only selected SNPs to be potential "instruments" if they reach genome-wide significance (p-value < 5×10 RY ) in the respective studies. Further, we used LD-clumping with an threshold of 0.1 to select a set of independent instruments for each trait. The number of instruments across the different exposures varied between 70-3801, with the largest numbers being available for height (K=3801), BMI (K=973) and blood pressure (K=237) due to availability of results from the large UK Biobank study.
On the outcome side, we accessed data for coronary artery disease (CAD) 28 for its analysis in relationship to known major risk factors BMI, blood lipids and blood pressure [29][30][31][32][33][34][35] ; breast cancer in relationship to several known epidemiologic risk-factors, including height, age-at-menarche, BMI, cholesterol level [36][37][38][39] ; and major depressive disorder (MDD) 40 in relationship to BMI and years of education [41][42][43] . In addition, we explored potential causal interrelationship among some of the exposures themselves, such as between BMI and blood pressure, and between BMI and age-at-menarche 44

Simulations
Simulation studies show that MRMix can be far more robust compared to existing popular methods for MR analysis in a wide range of scenarios ( Simulation studies also reveal MRMix estimates have much higher precision, i.e.
smaller standard errors, relative to comparably robust estimators. In particular, the relative efficiency of MRMix compared to the weighted mode estimator, evaluated as the inverse of the ratio of respective variances, reached up to 3-4 fold across the different scenarios considered. This gain in efficiency was larger when higher proportion of the IVs was valid. As expected, the IVW method had the smallest standard errors across all the scenarios, but because it produces severe bias its efficiency is not comparable to that of MRMix. There were also several scenarios where the weighted median estimator had smaller standard error compared to MRMix, but in all of these cases the former method produced substantially more biased estimates. Finally, across all scenarios, the egger regression method produced estimates with much larger standard errors than all the other methods considered.
Reverse directional MR analysis shows that MRMix is highly sensitive to the causal direction (Supplementary Table 3). In contrast, the IVW, egger regression and the weighted median method reported estimates of causal effect of substantial magnitude from the outcome ( ) to the exposure ( ) even though no such relationship existed. The weighted mode method, similar to MRMix, was found to be robust in the regard that it estimated the causal effect in the opposite direction close to its expected null value.
In alternative simulation scenario, where we allow SNPs with larger effects to be more likely to be valid IVs, the methods rank similarly as described above, although all the methods tend to be less biased (Supplementary Tables 4). Simulation studies also showed that when the number of selected instruments were large, the bootstrap method for estimation of standard error through re-sampling of SNPs is generally quite accurate (Supplementary Table 5).

Data Analysis
The MRMix analysis detected significant causal effects of genetically determined LDL-C, BMI and blood pressure, but not that for HDL-C and triglycerides, on the risk of coronary artery diseases (CAD) ( Table 2) although did not explicitly account for BMI, produced estimate of causal effect that is both qualitatively and quantitative similar as those reported from BMI adjusted IVW in previous studies 27,50 . The weighted mode method, while pulled the estimate more towards the right direction compared to IVW, the estimate was attenuated and had large standard error resulting in statistically non-significant finding. Further, bi-directional MR analysis between BMI and AAM using MRMix suggested that genetically predicted BMI has an inverse causal effect on AAM, and not the other way (Supplementary Tables 6   and 7), and thus it appears that the SNPs which are associated with AAM through BMI, which, on its own, influences the risk of BC, are the underlying invalid instruments. The example demonstrates that MRMix has the ability to produce robust and efficient estimate of causal effects in the presence of potentially unobservable confounding factors.
Additional data analyses also illustrated the distinct property of MRMix compared to alternatives. In particular, the MRMix method estimated the causal effect of HDL and triglycerides on the risk of CHD to be virtually null, while standard IVW and weighted median methods detected these effects to be significantly away from null in directions consistent with known epidemiologic associations. A recent study reported significant putative causal effect of years of education on reducing risk of major depressive disorder based on standard IVW analyis 40 . The MRMix analysis indicated that such a causal effect is virtually non-existent and relationship reported previously likely to have been driven by pleiotropic effects. In contrast, MRMix found the causal effect of genetically determined BMI on increasing the risk of MDD to be notably stronger than that is indicated by the traditional IVW method. Thus, it is likely that that there are common genetic pathways underlying these traits which leads to genetic correlation in opposite directions than that is due to the direct causal effect of BMI on MDD. Another study proposed analysis of causal relationship between traits based on genetic relationship, but using a different framework than that for the standard MR analysis. The study defined one trait to be partially or fully causal for another, if there is an underlying genetically determined latent variable which influences both traits, but has a stronger relationship with the first than the second 52 . The study defined moment equations for the estimation of parameters quantifying degree of partial causality using GWAS summarystatistics. We believe this novel framework and more traditional MR hypothesis can complement each other to provide improved understanding of the nature of genetic correlation across traits. The use of latent variable framework, for example, detected evidence of partial causality of several cholesterol traits on blood pressure level. The MRMix analysis, as well as none of the other MR analysis methods, detected any evidence of direct causal effects underlying these traits (see Supplementary Table 6).
Thus the evidence of partial causality is likely to have been primarily driven by the existence of underlying common genetic pathways which are more strongly related to cholesterol level than blood pressure.
The MRMix method has limitations as well. First, the method relies on certain model for underlying effect-size distribution. As we have noted before, the mis-specification of effect-size distribution is not as critical as we do not use the model directly to perform maximum likelihood estimation, but instead use the model as an efficient way of identifying certain "mode"-based estimator based on underlying estimating equations.
Simulation studies show that even when the underlying effect-size distribution has more complexity than the assumed model, the estimation of causal effect parameters can remain relatively robust. Nevertheless, more extensive simulation and theoretical studies are needed to further understand the property of the method under complex but realistic models for effect-size distributions as has been evidenced from recent study 4 .
Second, the method does require pre-selection of SNPs as genetic instruments based on p-value for significance of their association. We have observed that as long as the significance threshold is stringent (e.g. p-value < 5×10 RY ), there was not substantial winner's curse bias due to selection of SNPs and estimation of their coefficients from the same study (Supplementary In conclusion, MRMix provides a novel tool for conducting robust and powerful MR analysis using large number of genetic instruments that are now rapidly becoming available from recent expansion of GWAS. We demonstrate through simulation studies, as well as variety of real data analyses, that the method has unique ability to trade-off bias and efficiency for estimation of causal effects in the presence of invalid instruments. Application of MRMix for future MR studies will lead to improved understanding of causal basis of genetic correlation across traits. (2) direct effects on both ! and "; (3) direct effect on " but no relationship with !; (4) related to neither ! nor ". Direct effects are denoted by $ % and $ & and the causal effect is denoted by '. .57 (0.05) a . / : sample size of the study associated with (; . 0 : sample size of the study associated with ). b IVs are defined as SNPs which reach genome-wide significance (1 < 5×10 78 ) in the study associated with (. IVs are defined as SNPs which reach genome-wide significance (! < 5×10 '( ) in the study associated with ). c LDSC: LD score regression estimates of causal effects is defined as * + /ℎ . / , the ratio between the genetic covariance and the heritability of the exposure (see Supplementary Notes for details).