Causal associations between risk factors and common diseases inferred from GWAS summary data

Health risk factors such as body mass index (BMI) and serum cholesterol are associated with many common diseases. It often remains unclear whether the risk factors are cause or consequence of disease, or whether the associations are the result of confounding. We develop and apply a method (called GSMR) that performs a multi-SNP Mendelian randomization analysis using summary-level data from genome-wide association studies to test the causal associations of BMI, waist-to-hip ratio, serum cholesterols, blood pressures, height, and years of schooling (EduYears) with common diseases (sample sizes of up to 405,072). We identify a number of causal associations including a protective effect of LDL-cholesterol against type-2 diabetes (T2D) that might explain the side effects of statins on T2D, a protective effect of EduYears against Alzheimer’s disease, and bidirectional associations with opposite effects (e.g., higher BMI increases the risk of T2D but the effect of T2D on BMI is negative).


To investigate whether there is inflation in GSMR test-statistic under the null hypothesis
We performed simulations to investigate whether there is inflation in GSMR test-statistic under the null hypothesis that there is no causal association between exposure (x) and outcome (y) but x and y are associated due to a non-genetic confounding factor. We simulated the genotypes of Note that if x and y are measured in the same sample, x will appear to be associated with y due to the confounding factor c. To mimic a two-sample MR analysis, we simulated 50,000 individuals as sample #1 (n1) and 20,000 individuals as sample #2 (n2). We then performed simple regression analysis to estimate ! '" in sample #1 and ! '# in sample #2. In addition, we simulated 5,000 individuals as sample #3 (n3) to calculate LD correlation matrix. Each simulation was replicated 1,000 times. For simplicity but without loss of generality, we used 100 SNPs in the simulations below.

SNP instruments are independent:
We simulated x using the same method as above with three levels of

SNP instruments are in LD:
We extracted SNP genotypes of a 15Mb segment on chromosome 22 from the UKB data. After pruning SNPs for LD with a LD r 2 threshold of 0.5 and a window size of 100kb, 2,006 SNPs were retained. We then randomly sampled 100 SNPs as causal variants, and simulated x and y using the same method as above. GSMR and weighted generalized linear regression (i.e., generalized MR-IVW) were used to estimate ! "# . We randomly sampled 50,000 individuals as sample #1 and 20,000 individuals as sample #2. For data analysis, the LD correlations among SNPs were estimated from a reference sample (sample #3) of 5,000 individuals randomly sampled from the ARIC data. The estimates of ! "# from 1,000 simulation replicates are shown in Supplementary Table 1.
The effect of sample size for the exposure on the unbiasedness of GSMR : We simulated x and y using the same method as above with 7 '" 2 = 0.15, 7 "# 2 = 0.05, and the sample size of sample #1 varying from 50,000 to 200,000 with an increment of 25,000. The mean estimates of ! "# from 1,000 simulation replicates are shown in Supplementary Figure 20.

To investigate whether we can interpret bxy as logOR
We generated x and y under alternative hypothesis that x has a causal effect on y using the same method as described above with 7 '" 2 = 0.15, 7 '# 2 = 0.05, n1 = 50,000 and n2 = 2,000,000. Note that in order to demonstrate the equivalence between the estimate of bxy from GSMR and logOR from logistic regression we did not simulate a confounding factor in this case. In sample #2, we transformed y to a dichotomous phenotype (y01 = 1 for affected and 0 for unaffected) based a liability-threshold model. The disease prevalence (k) was simulated from U(0.005, 0.01). We then randomly sampled 10,000 cases (affected) and 10,000 controls (unaffected) from the entire sample. We estimated logOR by logistic regression of y01 on x in sample #2. The GSMR estimate of bxy was obtained using the summary data from the two samples. The simulation was performed with 1,000 replicates. A comparison between the estimates of bxy using two-sample GSMR and logOR using one-sample logistic regression from 1,000 simulation replicates is shown in Supplementary Fig. 2.

To investigate the statistical power of GSMR
We simulated 1,000 independent SNPs from a binomial distribution z ~ B(2, f). We then generated x and y under the assumption that x has a causal effect on y, using the same method as above with 7 '" 2 = 0.6, 7 "# 2 = 0.005, n1 = 50,000 and n2 = 20,000. We conducted analyses using GSMR, MR-Egger 1 , MR-IVW 2 , generalized MR-IVW 3 , and the Pickrell methods 4 based on a random subset of the genome-wide significant SNPs associated with x, the number of instruments m ~ U (10, mt) where mt is the total number of genome-wide significant SNPs. We ran the simulations with 10,000 replicates.

SNP instruments are in LD:
We extracted SNP genotypes from the UKB data following the strategy described above, and simulated x and y using the same models as above. Samples #1 (n1=50,000) and #2 (n2=20,000) were from the UKB data, and sample #3 were from the ARIC data (n3=5,000). We investigated the (un)biasedness of GSMR in estimating ! "# under a pleotropic model in 1,000 simulation replicates. The results are presented in the Supplementary Table 2.

To investigate the test statistic of reverse-GSMR
We simulated x based on 110 simulated SNPs (z) from a binomial distribution z ~ B(2, f), including 100 SNPs (& 1 ) that have effects on y mediated by x, and 10 SNPs (& U ) that have direct effects on y.
Supplementary Note 2: Multi-trait-based conditional and joint GWAS analysis using summary data (mtCOJO)

To investigate the accuracy of mtCOJO
We performed simulations to investigate the accuracy of the approximate multi-trait based conditional and joint (mtCOJO) association analysis. The simulation was conducted in the imputed UKB data after QC (Methods). For the ease of computing, we only included the SNPs in common with HapMap phase 3. SNPs were partitioned into 2 subsets, those on the odd chromosomes and even chromosomes. Causal variants as described below were sampled from the even chromosomes. We assume that there were 3 correlated traits, x1, x2 and x3. We using the mtCOJO approach as described in Methods (based only on summary data) and the exact approach (regressing x3 on z, gx1 and gx2, where gx1 and gx2 could be observed in the simulations). The GSMR instruments were selected by clumping analysis (r 2 threshold = 0.05, pvalue threshold = 5e-8, and window size = 5Mb). The ARIC data was used as the reference sample to estimate LD between each pair of instruments. A comparison between the estimates of ! '" Y |(! " [ " Y , ! " A " Y ) using the mtCOJO approach and the exact is shown in Supplementary   Figure 5.
We further investigated whether the test-statistics are inflated or not when there are differences in the sample sizes between x1, x2 and x3. Under a causal model, we would not expect to see inflation because there would not be remaining effect of SNPs on x3 conditioning on gx1 and gx2 for SNPs on even chromosomes, and we would also not expect to see inflation for the null SNPs on odd chromosomes. To do this, we performed GWAS analysis for x1 and x2 using subsets of individuals, n = 30,000 for x1 and 20,000 for x2 without sample overlap. We the performed GWAS for x3 1) using a subset of individuals (n = 50,000) without sample overlap with x1 and x2, and 2) using all the individuals in the whole sample (n = 108,039). The QQ plot of p-values for Supplementary Figure 5.

To quantify the bias described in Aschard et al. (AJHG 2015)
To investigate whether the estimate of conditional SNP effect from mtCOJO is suffered from the bias described in Aschard et al. 5 or not, we performed simulation under the scenario described in their Figure 1c that there is no direct effect of the SNP in question on the phenotype but the phenotype and covariate are correlated due to shared environmental or genetic factors. We randomly sampled 600 SNPs (zc) as the confounding genetic factors from the odd chromosomes and simulated a phenotype y using a model similar to that described above, i.e. f is the allele frequency. Under this simulation model, the direct effect of z on y is expected to be zero. However, since y and x are correlated due to the shared genetic factors (zc) that are independent of the SNPs to be tested (z), the estimate of the effect of z on y conditional on x from a linear regression analysis using individual-level data is expected to be biased 5 . The estimated ! '# conditional on x from mtCOJO using summary-level data should be unbiased because mtCOJO uses GSMR to distinguish causal effect from confounding effect (Methods). The simulation results are presented in Supplementary Figure 6.

GWAS
We know that the sampling variance (SE 2 ) of the estimate of logOR from a logistic regression where v is the proportion of cases in the sample, x is the genotype indicator variable for a SNP (coded as 0, 1 or 2), and n the sample size.
Assuming Hardy-Weinberg equilibrium, var(x) = 2f(1-f) where f is the allele frequency, which can be obtained from a reference sample with individual-level data. We therefore have

Supplementary Note 4: The effects of WHRadjBMI on common diseases
We included in the analysis two obesity-related traits, body mass index (BMI) and waist-to-hip raio adjusted for BMI (WHRadjBMI). BMI is a measure of the amount of tissue mass and WHRadjBMI is a measure of body fat distribution. Previous studies suggest that BMI-and WHRadjBMI-associated genetic loci are enriched for genes expressed in central nervous system 7 and adipose tissue 8 , respectively, which implies potentially different genetic aetiologies of the two traits. We found that the effects of WHRadjBMI and BMI on disease were largely concordant ( Supplementary Fig. 10a), suggesting that the genetic etiologies of the two traits may differ but their effects on health outcomes are similar. Note that WHRadjBMI has been adjusted for BMI so that that effect sizes of WHRadjBMI on diseases should be independent of BMI. Nevertheless, the effect sizes of WHRadjBMI on diseases were smaller than those of BMI, and WHRadjBMI was detected with significant effects on only 4 diseases, type-2 diabetes (T2D), dyslipidemia, hypertension and coronary artery disease (CAD) (Fig. 2), although the smaller number of detections could be because of the smaller number of instruments used for WHRadjBMI (m=43) than for BMI (m=84) as the power of GSMR is proportional to the number of instruments ( Supplementary Fig. 3).

Supplementary Note 5: The effect of height on common diseases
Results showed that height had significant protective effects against irritable bowel syndrome  Data #1 and Supplementary Table   12). We selected independent genome-wide significant SNPs from the disease GWAS data using the PLINK clumping approach (r 2 threshold = 0.05, window size = 1Mb and p-value threshold = 5×10 -8 ; LD from the 1000G-imputed ARIC data). We did not include in the analysis diseases for which the number of instruments was smaller than 10. There were 49 disease and risk factor pairs included in the analysis in total. We then performed the HEIDI-outlier analysis to remove pleiotropic outliers, and run the reverse GSMR analysis to detect effects of diseases on risk factors. The threshold p-value corrected for multiple testing was 1.0×10 -3 at a family-wise error rate (FWER) of 0.05. The results are shown in Supplementary Table 13.

Supplementary Note 8: Caveats in interpreting the GSMR results
First, if the exposure is a composite trait that comprises multiple sub-phenotypes, we could not rule out the possibility that the effect of exposure on disease is driven by one of the subphenotypes. For instance, we have identified from the GSMR analysis that EduYears had effects on many diseases (Fig. 6). A conservative interpretation is that these are the effects of the genetic component of EduYears (e.g. cognitive ability and personality) on health outcomes. If we express EduYears=g + e where g is the genetic component of EduYears and e is the residual component that includes environmental influence, then the SNPs identified from GWAS for EduYears are those associated with g rather than e, meaning that the GSMR analysis for EduYears was performed on g rather than e and thus did not provide any evidence whether e also has effects on diseases. Therefore, strictly speaking, the causative associations identified in this study are not definitive and need to be confirmed by follow-up randomized controlled trials (RCTs) in the future, if practical. Second, the effect of a risk factor on disease can be non-linear (e.g. the relationship between BMI and mortality is a U-shaped curve 17 , suggesting that both underweight and overweight are risk factors of death) whereas we used a linear approximation to estimate the effect because of the limited information that we had access to from GWAS summary data. Therefore, the ! "# estimates need to be interpreted with caution at extremes.
Third, although we have identified a large number of associations, we would expect that associations of small effect size would be missed in our study (e.g. the instrument for SBP, SBP, was based on only 28 SNPs). The power can be improved in the future with GWAS results based on larger sample sizes. Fourth, our analyses ignored age-specific and sex-specific effects because of the lack of data from age-and sex-stratified analyses. Lastly, we have shown in a previous study that the SMR test-statistic is slightly deflated due to the use of a Taylor series expansion to

Dependent instruments
Expected −log 10 (P −value) Observed −log 10 (P −value) Figure 2 Estimate of bxy from GSMR vs. logOR from logistic regression. We described in the method section that bxy is logOR of risk factor (x) on disease (y). We sought to confirm this by simulation (see Supplementary Note 1 for details). We compared GSMR estimate of bxy with logOR estimated from a logistic regression analysis of y01 (disease status on the observed scale) against x in one sample. The result shows that bxy estimate is highly consistent with logOR, meaning that we can interpret exp(bxy) as odds ratio. Note that in order to demonstrate the equivalence between bxy and logOR we did not simulate a confounding factor in this case. In reality, if the association between x and y is partly confounded by unobserved confounding factors, the GSMR estimate of bxy from two samples will be smaller than logOR estimated from logistic regression in one sample. outcome) to infer causality by a maximum likelihood approach (referred to as Pickrell-ML). In panel (b), the statistical power is defined as the proportion of simulations in which the relative likelihood of the best non-causal model compared to the best causal model is smaller than 0.05.
Because the hypothesis tested in the Pickrell-ML method is very different hypothesis from those in the other methods, the simulation result above needs to be interpreted with great caution.
It is also of note that we did not include the bivariate LD score regression (LDSC) method 18 in the comparison because 1) the LDSC method aims to estimate the genetic correlation between two phenotypes rather than the directional effect of one phenotype on another; 2) it assumes a polygenic model for the traits and utilizes variability in LD scores across SNPs to estimate the genetic variance and covariance, and therefore has only been applied to all genome-wide SNPs. The LD score of a SNP instrument is defined as the sum of LD r 2 between the target instrument and all instruments (including the target) 19 . with that for DC while the standard error for the former is larger than that for the latter, suggesting that bxy for DC approximately equals that for DS and that it is more powerful to detect the causative effect on a risk factor on general health status using DC than using DS.

Data set Phenotype BMI HDL-c LDL-c TG SBP
Community