Effect of selection bias on two sample summary data based Mendelian randomization

Wang, Kai; Han, Shizhong

doi:10.1038/s41598-021-87219-6

Download PDF

Article
Open access
Published: 07 April 2021

Effect of selection bias on two sample summary data based Mendelian randomization

Kai Wang¹ &
Shizhong Han^2,3

Scientific Reports volume 11, Article number: 7585 (2021) Cite this article

13 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Mendelian randomization (MR) is becoming more and more popular for inferring causal relationship between an exposure and a trait. Typically, instrument SNPs are selected from an exposure GWAS based on their summary statistics and the same summary statistics on the selected SNPs are used for subsequent analyses. However, this practice suffers from selection bias and can invalidate MR methods, as showcased via two popular methods: the summary data-based MR (SMR) method and the two-sample MR Steiger method. The SMR method is conservative while the MR Steiger method can be either conservative or liberal. A simple and yet more powerful alternative to SMR is proposed.

Mendelian randomization

Article 10 February 2022

Understanding the assumptions underlying Mendelian randomization

Article 26 January 2022

Robust Mendelian randomization in the presence of residual population stratification, batch effects and horizontal pleiotropy

Article Open access 01 March 2022

Introduction

As a feasible alternative to expensive and sometimes impossible randomized clinical trials, Mendelian randomization (MR) is becoming more and more popular for inferring causal relationship between an exposure and a trait^1,2,3. Summary data-based two-sample MR methods often take the following two steps:

Step 1:: Obtain instruments (typically SNPs) from exposure GWAS (Genome-Wide Association Study) that are significant at genome-wide level (typically $p<5\times 10^{-8}$);
Step 2:: Investigate the causal relationship between the exposure and the trait, using the summary exposure GWAS statistics at the selected SNPs and a trait GWAS. The summary exposure GWAS statistics are those used in Step 1 for SNP selection.

One appealing feature of these methods is that they only rely on summary statistics on the exposure GWAS and the trait GWAS. Individual-level data are not needed.

The inference validity of this two-step approach is affected by selection bias. When conducting causal inference in Step 2 with respect to the SNPs selected in Step 1, the summary statistics from the exposure GWAS can not be regarded as random samples for the true population association strength^4,5,6. Treating them as random samples leads to over-estimation of the effect size of these SNPs on the exposure. Association strength in a random sample is often much weaker, a phenomenon commonly seen in studies aimed at replicating previous findings. This selection bias has been noted in the literature^6,7. But its effect on hypothesis testing related to two sample summary data based Mendelian randomization is largely unknown.

Two popular MR methods, the summary data-based MR method² and the two-sample MR Steiger method¹, are considered. For the summary data-based MR method, the most significant SNP (instead of several SNPs) from a gene is selected as the instrument from the exposure GWAS. For the two-sample MR Steiger method, a SNP significantly associated with both the exposure GWAS and the trait GWAS is selected. The genotype score (0, 1, or 2) at this SNP is denoted by g. The exposure level is denoted by x and the trait value is denoted by y. The Wald statistic on chi-square scale for testing the association between the SNP and the exposure is denoted by $W_{gx}$. Its value is supposed to be large because it satisfies the selection criterion used in Step 1. For instance, when the selection criterion is $p<5\times 10^{-8}$, there must be $W_{gx}>29.71679$. The Wald statistic for testing the association between the SNP and the trait is denoted by $W_{gy}$, which is independent of $W_{gx}$.

Results

Summary data-based MR

Summary data-based MR² (SMR) is a popular MR method for inferring causality between x and y. Its null is $H_0: b_{xy}=0$, where $b_{xy}$ is the true regression coefficient for x with y the response. The two-stage least square (2SLS) estimate of $b_{xy}$ is

$$\begin{aligned} \hat{b}_{xy} = \frac{\hat{b}_{gy}}{\hat{b}_{gx}}, \end{aligned}$$

(1)

where $\hat{b}_{gx}$ is the least square estimate of $b_{gx}$, the regression coefficient for g with x the response, and $\hat{b}_{gy}$ is the least square estimate of $b_{gy}$, the regression coefficient for g with y the response. $\hat{b}_{xy}$ is also known as the Wald ratio⁵. Causal relationship between exposure x and y exists if the following test statistic is significant²:

$$\begin{aligned} T_\text {SMR} = \frac{W_{gx}W_{gy}}{W_{gx}+W_{gy}}, \end{aligned}$$

where $W_{gx} = [\hat{b}_{gx} / SE({\hat{b}}_{gx})]^2$ and $W_{gy} = [{\hat{b}}_{gy} / SE({\hat{b}}_{gy})]^2$ are Wald statistics on chi-square scale. The null distribution of $T_\text {SMR}$ is approximated by 1-df chi-square using the Delta method².

There are several issues with statistic $T_\text {SMR}$. The derivation of its null distribution assumes that ${\hat{b}}_{gx}$ is a consistent estimator of $b_{gx}$ and (asymptotically) follows a normal distribution (², Online Methods). However, these two conditions do not hold. If the significance level used in Step 1 is $5\times 10^{-8}$, there must be $W_{gx}\ge 29.71679$ which implies $|\hat{b}_{gx}| \ge \sqrt{29.71679}\times SE(\hat{b}_{gx})$. As a result, the distribution of $\hat{b}_{gx}$ is not (asymptotically) normal and ${\hat{b}}_{gx}$ is not a consistent estimator of $b_{gx}$. To numerically demonstrate this point, 10,000 random samples of $W_{gx}$ are generated from a 1-df chi-square with a large non-centrality 13 (to make sure there are reasonable number of $\{W_{gx}\}$). Among them, 322 are significant at genome-wide significant level $5\times 10^{-8}$. The quantile-quantile plot of these selected $\{W_{gx}\}$ against 322 random samples $\{W_{gy}\}$ from 1-df chi-square with non-centrality 13 is shown in Fig. 1. The distribution of $\{W_{gx}: W_{gx}\ge 29.71679\}$ is clearly different from the distribution of $\{W_{gx}\}$.

The applicability of the Delta method to approximating the distribution of $T_\text {SMR}$ is in doubt even in the absence of the selection process imposed on $W_{gx}$. Approximating the null distribution of $T_\text {SMR}$ by a 1-df chi-square is equivalent to approximating the null distribution of ${\hat{b}}_{xy}$ by a normal distribution. However, according to Eq. (1), ${\hat{b}}_{xy}$ is a ratio of two normals. In general, the distribution of the ratio of two normal variables can not be approximated by a normal as it can take a variety of shapes such as bimodal, unimodal, or asymmetric⁸. It is known that if $b_{gx}$ and $b_{gy}$ are both equal to 0, the distribution of ${\hat{b}}_{xy}$ would be a Cauchy, a fat-tailed distribution whose mean and variance do not exist. For the case $b_{gx}\not =0$ and $b_{gy}\not =0$, the distribution of ${\hat{b}}_{xy}$ can be approximated by a normal only in certain intervals⁸.

For the case $b_{gy}=0$, to our best knowledge, there are no known theoretical results regarding whether the distribution of ${\hat{b}}_{xy}$ can be approximated by a normal. The only thing we are sure about is that the distribution of ${\hat{b}}_{xy}$ is symmetric because the distribution of $-{\hat{b}}_{xy}=(-{\hat{b}}_{gy})/{\hat{b}}_{gx}$ is the same as the distribution of ${\hat{b}}_{xy}$. A numerical example is used to examine the distribution of ${\hat{b}}_{xy}$. Ten thousand random ${\hat{b}}_{gx}$’s are generated from $N(\sqrt{13}, 1)$ and 10,000 random ${\hat{b}}_{gy}$ are generated from N(0, 1). A normal quantile plot of ${\hat{b}}_{gy}/{\hat{b}}_{gx}$ is generated using the qqnorm and qqline functions in R with their default settings and is shown in Fig. 2 (left panel). Similar to a Cauchy distribution, the distribution of ${\hat{b}}_{xy}={\hat{b}}_{gy}/{\hat{b}}_{gx}$ is apparently fat-tailed compared to a normal: the lower end is more negative while the upper end is more positive.

A normal quantile plot is also generated for $\{{\hat{b}}_{xy}: \hat{W}_{gx}\ge 29.71679\}$ and is shown in the right panel of Fig. 2. It may be surprising that the distribution of $\{{\hat{b}}_{xy}: W_{gx}\ge 29.71679\}$ appears to be a normal. The reason of this phenomenon is that the range of ${\hat{b}}_{gx}$ is greatly reduced under the selection criterion. According to Eq. (1), ${\hat{b}}_{xy}$ is roughly proportional to ${\hat{b}}_{gy}$ with high probability.

A more general argument that the approximating distribution of $T_\text {SMR}$ is not 1-df chi-square is the following. Since

$$\begin{aligned} T_\text {SMR} = W_{gy}\cdot \frac{1}{1+W_{gy}/W_{gx}}, \end{aligned}$$

there is $T_\text {SMR}< W_{gy}$ regardless of the distribution of $W_{gx}$. That is, $T_\text {SMR}$ is always dominated by $W_{gy}$. Similarly, $T_\text {SMR}$ is always dominated by $W_{gx}$. Therefore, $T_\text {SMR}<\min \{W_{gx}, W_{gy}\}$. Since $W_{gx}$ and $W_{gy}$ approximately follow independent 1-df chi-square distributions, the approximating distribution of $\min \{W_{gx}, W_{gy}\}$ can not be 1-df chi-square. Neither the approximate distribution of $T_\text {SMR}$. Using a 1-df chi-square distribution for $T_\text {SMR}$ results in a conservative test.

We performed extensive simulations to investigate the null distribution of the SMR statistic in a more realistic setting by using imputed GWAS genotype data from the Atherosclerosis Risk in Communities (ARIC) study of European-ancestry samples⁹. Specifically, we simulated gene expression levels for each Ensemble gene on autosomes at varying numbers of causal eQTLs (n = 1, 5, and 10), (narrow sense) heritability levels ($h^2$ = 0.1, 0.2, 0.4, 0.8), and sample sizes (N = 250, 500, 1000, and 2000). We tested association between all SNPs within each gene and expression levels of the gene, and only genes whose top associated SNP met the selection criteria ($p < 5 \times 10^{-8}$) were subjected to SMR test. GWAS association signals were randomly assigned from a standard normal distribution. Figure 3 shows the QQ plot for the SMR statistics when instrumental eQTLs were selected from genes with 5 causal eQTLs and a level of heritability = 0.4 at all four sample sizes. Clearly, the SMR statistics were lower than expected null values at the tail of distribution, though the distribution became closer to the null at larger sample size, which may be explained by the stronger eQTL signals as shown in our numerical example above. The complete set of QQ plots for the SMR test statistic are shown in Supplementary Figs. S1–S12 online. Overall, our simulations showed that the SMR statistics were conservative and did not strictly follow the 1-df chi-squire distribution, especially when the effect size of each individual eQTL was small on average. These results are consistent with our theoretical insights.

More on SMR and a conditional test

One may want to use an estimate of $b_{gx}$ that takes into account the selection. However, such an estimate is not expected to be simple given the complexity of the selection (e.g., the SNP is the most significant one among a number of SNPs). Another alternative is to use another exposure GWAS independent of the exposure GWAS used in Step 1 to estimate $b_{gx}$ and then compute $T_\text {SMR}$. However, this is not recommended because $T_\text {SMR}$ is inherently conservative. $T_\text {SMR}$ is equal to the half of the harmonic mean of $W_{gx}$ and $W_{gy}$. Fixing one of $W_{gx}$ and $W_{gy}$, say $W_{gx}$, and change $W_{gy}$, $T_\text {SMR}$ reaches its smallest value $W_{gx}/2$ when $W_{gy} = W_{gx}$ and converges to $W_{gx}$ when $W_{gy}\rightarrow \infty$. The conservativeness of $T_\text {SMR}$ is also observed in simulation studies by Veturi and Ritchie¹⁰.

The null hypothesis for $T_\text {SMR}$ was not specifically defined in Zhu et al.². It is unlikely to be the intended null $H_0: b_{xy}=0$. Actually, similar to the Sobel’s statistic popular in mediation analysis, the null corresponding to $T_\text {SMR}$ is $H_0: b_{gx}=0 \text { or } b_{gy}=0$. For this null, a statistic more powerful than $T_\text {SMR}$ is $\min \{W_{gx}, W_{gy}\}$. The statistic $\min \{W_{gx}, W_{gy}\}$ rejects the null $H_0: b_{gx}=0 \text { or } b_{gy}=0$ if and only if both $W_{gx}$ and $W_{gy}$ are significant. Therefore, whenever $\min \{W_{gx}, W_{gy}\}$ rejects the null, $T_\text {SMR}$ will but not vice versa. This is because $T_\text {SMR} < \min \{W_{gx}, W_{gy}\}$.

A test more powerful than $\min \{W_{gx}, W_{gy}\}$ (hence also more powerful than $T_\text {SMR}$) in the current situation is a conditional test. Because the SNP is selected for its significant association with the exposure, the situation $b_{gx}=0$ can be excluded. Given this information, a meaningful null would be $H_0: b_{gy}=0, b_{gx}\not =0$ for which a test statistic is $W_{gy}$. The null is rejected when $W_{gy}$ is significant. This test, conditional on a significant $W_{gx}$ statistic, assumes that there is no pleiotropy. That is, the selected SNP affects the trait only through the exposure and there are no other paths. In other words, the selected SNP is a valid instrument. In light of Eq. (1), $b_{gy}=0$ if and only if $b_{xy}=0$ when the possibility of $b_{gx}=0$ is excluded. Hence the null for this conditional test is equivalent to $H_0: b_{xy}=0$. This test is asymptotically valid because $W_{gy}$ asymptotically follows a 1-df chi-square distribution. The threshold for significance for this test is not at the genome level. Rather, it is at the gene level and only needs to be corrected for the number of genes for which SNPs are selected for instruments. This results in a more powerful testing procedure than using a genome-wide threshold.

An empirical study

We compared the performance of conditional test we proposed and the SMR test on an empirical study of schizophrenia. We used to-date the largest GWAS summary statistics for schizophrenia¹¹ and the eQTL results from analysis of 1387 brain samples (prefrontal cortex) by the PsychENCODE¹² (downloaded from the SMR data resource website). In total, 9639 genes were tested for SMR at a top associated cis-eQTL ($p < 5 \times 10^{-8}$) and 65 genes were significant after Bonferroni correction. In contrast, the conditional test, whose test statistic is $W_{gy}$ and considers only those instrumental eQTLs, discovered 127 Bonferroni-significant genes, including 62 genes not detected by SMR ($p<0.05/9639 = 5.18726\times 10^{-6}$. Supplementary Table S1 online). Among those genes missed by SMR, there were several strong candidates for schizophrenia, such as AKT3^13,14,15, RGS6^16,17, and KCNN3. It may not be surprising that AKT3 and RGS6 were identified as these two harbored genome-wide significant variants ($p < 5 \times 10^{-8}$) in original GWAS¹¹, but the discovery of KCNN3 was novel and the strongest SNP-level association evidence for this gene was only at $p = 9 \times 10^{-7}$ (rs10796933). Of note, our previous study also showed evidence for the association of KCNN3 with schizophrenia through integrated analysis of GWAS with methylation QTL¹⁸.

Two-sample MR Steiger method

The two-sample MR Steiger method^1,19 assumes that there is a causal relationship between the exposure and the trait and that the selected SNP is a valid instrument for one of them (but it is unknown for which one)¹. A SNP is selected not only for its association with the exposure but also for its association with the trait^1,19. The null for the two-sample MR Steiger test is $H_0: \rho _{gx} = \rho _{gy}$ where $\rho _{gx}=Corr(g, x)$ and $\rho _{gy}=Corr(g, y)$ are the (population) Pearson correlation coefficients. Let $\hat{\rho }_{gx}$ and $\hat{\rho }_{gy}$ be the sample correlation coefficients corresponding to $\rho _{gx}$ and $\rho _{gy}$, respectively. Using Fisher’s Z transformation, there are

$$\begin{aligned} z_{gx}&:= \frac{1}{2}\ln \frac{1+|\hat{\rho }_{gx}|}{1-|\hat{\rho }_{gx}|} \nonumber \\&\quad \sim N\left( \frac{1}{2}\ln \frac{1+|\rho _{gx}|}{1-|\rho _{gx}|}, \frac{1}{n_x-3}\right) , \text { and} \end{aligned}$$

(2)

$$\begin{aligned} z_{gy}&:= \frac{1}{2}\ln \frac{1+|\hat{\rho }_{gy}|}{1-|\hat{\rho }_{gy}|} \nonumber \\&\quad \sim N\left( \frac{1}{2}\ln \frac{1+|\rho _{gy}|}{1-|\rho _{gy}|}, \frac{1}{n_y-3}\right) , \end{aligned}$$

(3)

where $n_x$ and $n_y$ are sample sizes. The null $H_0: \rho _{gx} = \rho _{gy}$ is equivalent to saying that the mean of $z_{qx}$ is equal to the mean of $z_{qy}$. The two-sample MR Steiger method uses the following statistic^1,19:

$$\begin{aligned} T_\text {Steiger} = \frac{z_{gx} - z_{gy}}{\sqrt{1/(n_x-3) + 1/(n_y-3)}} \sim N(0,1). \end{aligned}$$

If $T_\text {Steiger}$ is significant and positive, the causal direction is from x to y. If $T_\text {Steiger}$ is significant and negative, the causal direction is from y to x.

However, the statistic $T_\text {Steiger}$ does not approximately follow a standard normal distribution because the SNP is selected for its significant p-values. Using a selection criterion $p<5\times 10^{-8}$, or $W_{gx}$ and $W_{gy}$ greater than 29.71679 on 1-df chi-square scale, the sample correlation coefficients $|\hat{\rho }_{gx}|$ and $|\hat{\rho }_{gy}|$ would be at least 0.4823663, 0.1700451, or 0.05443772 for $n_x=$ 100, 1000, or 10,000 given the relationship $|\hat{\rho }_{gx}| = 1/\sqrt{1+(n_x-2)/W_{gx}}$. Although this selection procedure is useful for selecting the instrument SNP, it imposes a lower limit on $|\hat{\rho }_{gx}|$ and $|\hat{\rho }_{gy}|$. $|\hat{\rho }_{gx}| \left( |\hat{\rho }_{gy}|\right)$ over-estimates $|\rho _{gx}| \left( |\rho _{gy}|\right)$ and is not consistent. The mean of the statistic $T_\text {Steiger}$ is not around 0 even when $H_0: \rho _{gx} = \rho _{gy}$ holds if $n_x\not =n_y$. The distribution of $z_{gx}$ is truncated and is not normal. So is the distribution of $z_{gy}$. The variance of $z_{gx}$ is smaller than $1/(n_x-3)$ due to selection. Similarly, the variance of $z_{gy}$ is smaller than $1/(n_y-3)$. When $n_x=n_y$, the numerator of $T_\text {Steiger}$ is around 0 and $T_\text {Steiger}$ is conservative. When $n_x\not =n_y$, the numerator of $T_\text {Steiger}$ is no longer around 0 and $T_\text {Steiger}$ is liberal. Overall, the distributions of $z_{gx}$ and $z_{gy}$ are truncated normal instead of normal. The argument that the statistic $T_\text {Steiger}$ follows asymptotically a standard normal does not hold. The two-sample MR Steiger method can be either liberal or conservative.

Numerical examples are constructed. First we consider the case $n_x = 1000, n_y = 10{,}000$ and demonstrate the effect of selection severity. Ten thousand random samples of $z_{gx}$ and $z_{gy}$ are independently generated from the normal distributions shown in Eqs. (2) and (3). These $z_{gx}$ and $z_{gy}$ form a $10{,}000\times 2$ matrix. The first column contains values for $z_{gx}$ and the second for $z_{gy}$. Only the rows satisfying $z_{gx}\ge 0.5\ln [(1+0.1700451)/(1-0.1700451)]=0.17171315$ and $z_{gy}\ge 0.5\ln [(1+0.05443772)/(1-0.05443772)]=0.05449159$ are kept. This selection criterion corresponds to $5\times 10^{-8}$ on the p-value scale. When $\rho _{gx}=\rho _{gy}=0.15$, there are 2557 $(z_{gx}, z_{gy})$ selected on which the statistic $T_\text {Steiger}$ is computed. The sample mean of selected $\{z_{gx}\}$ is 0.1903508 while the sample mean of selected $\{z_{gy}\}$ ($=0.1512602$) is lower, as expected. A normal quantile-quantile plot of $T_\text {Steiger}$ is shown in the left panel of Fig. 4. Clearly the distribution of $T_\text {Steiger}$ is different from normal. Type I error rates are inflated. At significance level 0.05 and 0.01, the type I error rates (i.e., the proportion of significant $T_\text {Steiger}$ statistics) are 0.08916699 and 0.01486117, respectively. If $\rho _{gx}=\rho _{gy}=0.19$, the selection is less severe. Almost 75% (7434 out of 10,000) $(z_{gx}, z_{gy})$s are selected. Even so, the distribution of $T_\text {Steiger}$ shows apparent departure from normal as shown in the right panel of Fig. 4. At significance level 0.05 and 0.01, the type I error rates are 0.0306699 and 0.005111649, respectively. In this case, $T_\text {Steiger}$ appears to be conservative.

We also considered larger sample sizes. When $n_x=100{,}000, n_y=300{,}000$, and $\rho _{gx}=\rho _{gy}=0.015$, 2,375 $(z_{gx}, z_{gy})$s are selected. When $n_x=150{,}000, n_y=400{,}000$ and $\rho _{gx}=\rho _{gy}=0.015$, 6,410 $(z_{gx}, z_{gy})$s are selected. As shown in Fig. 5, there is apparent departure of the distribution of $T_\text {Steiger}$ from a normal. At significance level 0.05, the type I error rate is 0.09515789 for the case $n_x=100{,}000, n_y=300{,}000$ and is 0.03728549 for $n_x=100,100, n_y=400{,}000$. At significance level 0.01, the type I error rates are 0.01515789 and 0.006084243, respectively. The type I error rates can be either inflated or deflated.

One remedy would be to estimate $\rho _{gx}$ and $\rho _{gy}$ by maximizing the conditional likelihood given the SNP selection criteria. Let $\phi (\cdot )$ and $\Phi (\cdot )$ denote the density function and the distribution function of the standard normal, respectively. The likelihood ratio statistic for testing $H_0$ is $2\log (L_1/L_0)$ where

$$\begin{aligned} L_1&= \left[ \max _{\mu _{gx}}\frac{\phi (\sqrt{n_x-3}(z_{gx} - \mu _{gx}))}{1-\Phi (\sqrt{n_x-3}(c_{gx} - \mu _{gx}))}\right] \cdot \left[ \max _{\mu _{gy}}\frac{\phi (\sqrt{n_y-3}(z_{gy} - \mu _{gy}))}{1-\Phi (\sqrt{n_y-3}(c_{gy} - \mu _{gy}))}\right] , \\ L_0&= \max _{\mu _g}\left[ \frac{\phi (\sqrt{n_x-3}(z_{gx} - \mu _g))}{1-\Phi (\sqrt{n_x-3}(c_{gx} - \mu _g))} \cdot \frac{\phi (\sqrt{n_y-3}(z_{gy} - \mu _g))}{1-\Phi (\sqrt{n_y-3}(c_{gy} - \mu _g))}\right] \end{aligned}$$

with $c_{gx}$ and $c_{gy}$ selection thresholds corresponding to $z_{gx}$ and $z_{gy}$, respectively. However, due to selection, computation of $L_1$ and $L_0$ can be challenging. One alternative method is to use an exposure GWAS and a trait GWAS that are independent of those used to select the SNP. However, such studies may be impractical to obtain⁶.

Discussion

Summary statistics MR is subject to selection bias, resulting in excessive false positives (for instance, the MR Steiger method) or missed discoveries (for instance, the SMR method). This bias is a form of winner’s curse. Selection bias has been discussed in the literature in the context of the choice of the instrument SNPs⁷, colocalisation test²⁰, and estimation of exposure effect^5,6.

Our work complements previous studies on selection bias due to selection of SNPs. While previous work focused on the effect of this bias on the Wald ratio^5,6 (i.e., estimation), ours focuses on testing whether the exposure causally affects the outcome (i.e., inference). Selection bias leads to underestimation of the Wald ratio⁵ but its effect on type I error rate can be either liberal or conservative depending on the MR method used. Most importantly, the SMR method is conservative even in the absence of selection bias where ${\hat{b}}_{gx}$ is approximately normal.

Correcting for selection bias is a challenging task. Zhao et al.⁶ get around this issue by using an independent exposure GWAS. On the other hand, our conditional test, an alternative to the SMR method, uses only the trait GWAS. It may be expanded to accommodate multiple instrumental SNPs and the presence of pleiotropy.

References

Hemani, G., Tilling, K. & Smith, G. D. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 13, e1007081 (2017).
Article Google Scholar
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481 (2016).
Article CAS Google Scholar
Morrison, J., Knoblauch, N., Marcus, J. H., Stephens, M. & He, X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics. Nat. Genet. 52, 740-747 (2020).
Bowden, J. & Dudbridge, F. Unbiased estimation of odds ratios: Combining genomewide association scans with replication studies. Genet. Epidemiol. 33, 406–418 (2009).
Article Google Scholar
Haycock, P. C. et al. Best (but oft-forgotten) practices: The design, analysis, and interpretation of Mendelian randomization studies. Am. J. Clin. Nutr. 103, 965–978 (2016).
Article CAS Google Scholar
Zhao, Q. et al. Statistical inference in two-sample summary-data mendelian randomization using robust adjusted profile score. Ann. Stat. 48, 1742–1769 (2020).
Article MathSciNet Google Scholar
Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife 7, e34408 (2018).
Article Google Scholar
Díaz-Francés, E. & Rubio, F. J. On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables. Stat. Pap. 54, 309–323 (2013).
Article MathSciNet Google Scholar
The ARIC investigators. The Atherosclerosis Risk in Communities (ARIC) Study: Design and objectives. Am. J. 129, 687–702 (1989).
Google Scholar
Veturi, Y. & Ritchie, M. D. How powerful are summary-based methods for identifying expression-trait associations under different genetic architectures?. Pac. Symp. Biocomput. 23, 228–239 (2018).
PubMed PubMed Central Google Scholar
Pardiñas, A. F. et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet. 50, 381–389 (2018).
Article Google Scholar
Wang, D. et al. Comprehensive functional genomic resource and integrative model for the human brain. Science 362, eaat8464 (2018).
Schizophrenia Working Group of the Psychiatric Genomics Consortium and othersSchizophrenia Working Group of the Psychiatric Genomics Consortium and others. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).
Article ADS Google Scholar
Ripke, S. et al. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 45, 1150 (2013).
Article CAS Google Scholar
Ruderfer, D. M. et al. Polygenic overlap between schizophrenia risk and antipsychotic response: A genomic medicine approach. Lancet Psychiatry 3, 350–357 (2016).
Article Google Scholar
Ahlers, K. E., Chakravarti, B. & Fisher, R. A. RGS6 as a novel therapeutic target in CNS diseases and cancer. AAPS J. 18, 560–572 (2016).
Article CAS Google Scholar
Radulescu, E. et al. Identification and prioritization of gene sets associated with schizophrenia risk by co-expression network analysis in human brain. Mol. Psychiatry 25, 791-804 (2018).
Han, S. et al. Integrating brain methylome with gwas for psychiatric risk gene discovery. bioRxiv 440206 (2018).
Xue, H. & Pan, W. Inferring causal direction between two traits in the presence of horizontal pleiotropy with GWAS summary data. PLoS Genet. 16, e1009105 (2020).
Article CAS Google Scholar
Wallace, C. Statistical testing of shared genetic control for potentially related traits. Genet. Epidemiol. 37, 802–813 (2013).
Article Google Scholar

Download references

Acknowledgements

This study was partially supported by National Institutes of Health grant R01MH121394 (to SH).

Author information

Authors and Affiliations

Department of Biostatistics, The University of Iowa, Iowa City, 52242, USA
Kai Wang
Lieber Institute for Brain Development, Johns Hopkins School of Medicine, Baltimore, 21205, USA
Shizhong Han
Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, 21205, USA
Shizhong Han

Authors

Kai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shizhong Han
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.W. conceived the experiment and conducted some simulations, S.H. conducted the simulation study and the empirical study. Both authors drafted and reviewed the manuscript.

Corresponding author

Correspondence to Kai Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, K., Han, S. Effect of selection bias on two sample summary data based Mendelian randomization. Sci Rep 11, 7585 (2021). https://doi.org/10.1038/s41598-021-87219-6

Download citation

Received: 31 October 2020
Accepted: 18 March 2021
Published: 07 April 2021
DOI: https://doi.org/10.1038/s41598-021-87219-6

This article is cited by

Mendelian Randomization Analysis Identified Potential Genes Pleiotropically Associated with Polycystic Ovary Syndrome
- Qian Sun
- Yuan Gao
- Wen Yang
Reproductive Sciences (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.