Integrative analysis of Mendelian randomization and Bayesian colocalization highlights four genes with putative BMI-mediated causal pathways to diabetes

Genome-wide association studies have identified hundreds of single nucleotide polymorphisms (SNPs) that are associated with BMI and diabetes. However, lack of adequate data has for long time prevented investigations on the pathogenesis of diabetes where BMI was a mediator of the genetic causal effects on this disease. Of our particular interest is the underlying causal mechanisms of diabetes. We leveraged the summary statistics reported in two studies: UK Biobank (N = 336,473) and Genetic Investigation of ANthropometric Traits (GIANT, N = 339,224) to investigate BMI-mediated genetic causal pathways to diabetes. We first estimated the causal effect of BMI on diabetes by using four Mendelian randomization methods, where a total of 76 independent BMI-associated SNPs (R2 ≤ 0.001, P < 5 × 10−8) were used as instrumental variables. It was consistently shown that higher level of BMI (kg/m2) led to increased risk of diabetes. We then applied two Bayesian colocalization methods and identified shared causal SNPs of BMI and diabetes in genes TFAP2B, TCF7L2, FTO and ZC3H4. This study utilized integrative analysis of Mendelian randomization and colocalization to uncover causal relationships between genetic variants, BMI and diabetes. It highlighted putative causal pathways to diabetes mediated by BMI for four genes.

Of our particular interest is the underlying causal mechanisms of diabetes. In this study, we aim to (i) couple MR with Bayesian colocalization to explore whether BMI is a mediator in the genetic causal pathways to diabetes; (ii) investigate the performance of two Bayesian colocalization methods, COLOC and eCAVIAR 23,25 . COLOC estimates how likely there is a shared causal SNP in a genetic test region for a pair of traits, by assuming there exists at most one causal SNP in the region for either trait, while eCAVIAR allows for multiple causal SNPs.
In particular, we exploit the summary results of two independent large-scale GWASs: BMI from the Genetic Investigation of ANthropometric Traits (GIANT) consortium 10 and diabetes from the UK Biobank 31 (round 1), by using BMI-associated SNPs as instruments in MR analysis to estimate causal effect of BMI on diabetes. There were several types of diabetes measures in the UK Biobank project. This work focused specifically on "diabetes diagnosed by doctor" with data collected from a touchscreen question "Has a doctor ever told you that you have diabetes?" (https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2443). If there is evidence for a statistically significant causal effect, we then further investigate whether there are shared causal SNPs between BMI and diabetes using COLOC, and therefore gain insights into the underlying mechanisms of diabetes (Fig. 1).

Results
Higher BMI causes increased risk of diabetes. We used summary statistics of 76 independent BMI associated SNP instruments (R 2 ≤ 0.001, P < 5 × 10 −8 ) and applied four existing methods (inverse variance weighted estimation, weighted median estimation, MR-Egger regression and MR-RAPS) in our MR analysis. All the MR results consistently showed evidence for a positive causal effect of BMI on diabetes (Table 1), which is agreement with existing literature [2][3][4][5][6] . Table 1 consists of estimated odds ratios (ORs), confidence intervals (CIs) and corresponding p-values. The estimated ORs are in the range of (1.038, 1.058) and none of the 95% CIs include value 1. The estimated ORs are fairly precise, with the widest 95% CI (1.015, 1.102) from MR-Egger. Our MR results are strongly suggestive of a causal relationship between BMI and risk of diabetes. For example, in the MR-RAPS method (estimated OR:1.048, 95% CI: (1.042, 1.054)), the odds of an individual being diagnosed with diabetes will increase by 4.8% per 1-SD (or 4.5 kg/m 2 ) increase in BMI. The estimated intercept from MR-Egger (estimate: −0.001, 95% CI: (−0.002, 0.001)), not significantly different from zero, suggests that the null hypothesis of zero average horizontal pleiotropic effect is not rejected. The left panel of Figure S1 is scatter plot of the estimated coefficients in the regression analysis of BMI and diabetes on the 76 independent SNPs included as instruments in our MR analysis. The magnitudes of the slopes of the regression lines correspond to the logarithms of the estimated ORs from the first three MR methods in Table 1.
In our MR analysis, as shown at the bottom of the left panel of Figure S1, we identified an outlier SNP rs7903146 (within gene TCF7L2). This SNP was already found to be an outlier and a horizontal pleiotropic instrument in other studies on BMI-diabetes causal relationships 3,9 , which was supported by previous findings that it   was associated with both fasting glucose and BMI 10,32 . We then performed a sensitivity analysis by excluding the outlier. The results without this SNP in Table S1 and Figure S1 Table 2); one region in chromosome 3 suggests two distinct causal SNPs, one for BMI only and the other for diabetes only (posterior probability of distinct causal SNPs PP 3 = 1; Table 2). In each of the five regions, we further calculated the posterior probability of each SNP being causal to both of the traits (PP4_Both, Fig. 2). Each region/panel comprises three sets of the posterior probabilities on log10 scale at the SNP level: causal to BMI only (PP1_BMI, top), causal to diabetes only (PP2_Diabetes, middle) and causal to both BMI and diabetes (PP4_Both, bottom). For the four regions showing evidence for colocalization (Panels (a)-(d)), SNPs with the maximum PP4_Both are most likely to be causal to both BMI and diabetes, and are thus selected as the candidate causal SNPs. In the region chr3:185324933-186022133 with evidence for distinct causal SNPs (PP 3 = 1, Panel (e)), two independent SNPs are likely to be candidate causal SNPs. The results from another Bayesian colocalization method eCAVIAR for the above five regions are also presented in Fig. 2. In Panels (a-d), the results of COLOC (circles) and eCAVIAR (dots) are almost identical in identifying a shared causal SNP between BMI and diabetes. It is clear that if a SNP is most likely to be causal to BMI (highest PP1_BMI), it is also most likely to be causal to diabetes (highest PP2_Diabetes), and consequently most likely to be a shared causal SNP (highest PP4_Both). In panel (e) we observe two distinct causal SNPs, one to BMI and the other to diabetes from both COLOC and eCAVIAR. However, evidence for the two distinct SNPs being most likely to be causal to BMI and diabetes (PP4_Both) from eCAVIAR is much weaker than that from COLOC. Thus, if there exists one shared causal SNP, there is little difference between COLOC and eCAVIAR, at both the region and the SNP levels. In this case, the two approaches can be regarded as alternatives. From Table 2 and panel (e) in Fig. 2, the SNP rs9816226 (nearest gene ETV5) is most likely to be causal to BMI only and rs1470580 (within gene IGF2BP2) to diabetes only. We have identified the same SNPs that are most likely to be shared causal signals from the two approaches, although at different degrees according to different values of PP4_Both.
Both of the approaches (MR + COLOC and MR + eCAVIAR) have highlighted shared causal SNPs of BMI and diabetes for four genes (TFAP2B, TCF7L2, FTO and ZC3H4). In combination with the MR results showing evidence for a causal effect of BMI on diabetes and no horizontal pleiotropy on average, the observed causal effects of these genes on the risk of diabetes might be mediated through changes in BMI.

Discussion
Interestingly, estimated causal effect of BMI on the risk of diabetes in our MR analysis were smaller but more precise than those reported in the literature. This could be partly explained by (i) the number of instruments. For example, we included 76 instruments in our analysis. However, Fall et al. used a single instrument in their study which could not separate mediation from pleiotropy 4 ; (ii) sample size. We used GWAS summary data for diabetes from the UK Biobank on the basis of ~ 330,000 individuals which was almost five times the sample size www.nature.com/scientificreports www.nature.com/scientificreports/ in Corbin's study 3 ; (iii) genotyping platforms. The summary results in this study were from recent GWASs which used more advanced genotyping technology which provides more accurate genotype measures.
MR, COLOC and eCAVIAR use summary statistics from large-scale GWASs. However, as they were designed for different purposes, each of which alone is insufficient to investigate mediators on the genetic causal pathways. We have applied an integrative approach by combining MR with COLOC to investigate BMI-mediated genetic causal pathways to diabetes. The five regions highlighted in COLOC have also been analyzed using eCAVIAR. Our results have consistently shown that if one shared causal SNP is present, then MR + COLOC and MR + eCAVIAR could serve as alternatives. However, where there is evidence for two distinct causal SNPs in a region, eCAVIAR has shown much weaker evidence for each of the distinct SNPs being causal to both BMI and diabetes. This might be because when multiple causal SNPs exist, assuming a single causal SNP results in an underestimation of the posterior probability of a shared causal SNP in eCAVIAR 25 . Thus, if there exists at most one causal SNP to either trait, one may prefer COLOC as it provides relatively robust results. In addition to providing the posterior probability of colocalization, COLOC also computes the likelihood of distinct causal SNPs in a test region, which is another advantage comparing with eCAVIAR when there are two distinct causal SNPs in a test region.
In our analysis, both MR + COLOC and MR + eCAVIAR have consistently shown evidence for causal effects on diabetes mediated by BMI for four genes (TFAP2B, TCF7L2, FTO and ZC3H4). Previous studies have suggested that TFAP2B is associated with BMI in both Europeans and African Americans 2,11 . This gene has been found associated with waist circumference which is known to be correlated with BMI and may modify the effect of dietary fat intake on weight loss and waist reduction 12,33 . Evidence for its association with diabetes in the UK population and a candidate contributor of the susceptibility to diabetes has also been reported 13 . Thus, the TFAP2B gene may play an important role of the underlying mechanisms of diabetes, where BMI is a mediator.
Rs7903146, located in the intronic region of the TCF7L2 gene, is associated with BMI and diabetes in the European population 10,14 . It was an outlier instrument in our study and in the literature 3 . However, there was little change in estimated causal effect of BMI on diabetes after removing this SNP in our sensitivity analysis. It was identified as a potential pleiotropic but weak instrument. Thus, the MR analysis performed in this study without this SNP provided higher precision in the estimates 3 .
FTO is associated with BMI and diabetes 2,33-37 . In addition, rs9939609 in this gene has been identified to affect diabetes through BMI in the UK population 15 . This SNP is in high linkage disequilibrium with the SNP rs1558902 (R 2 = 0.918) highlighted in our analysis. However, its association with diabetes is reported to be partly independent of its effect on BMI in the Scandinavian population and in east and south Asians [38][39][40] . Such discrepancy might come from the heterogeneity across populations 41,42 .
Rs3810291 (in ZC3H4) has been associated with BMI in European populations 43 . Recent research has further shown that it is associated with both BMI and diabetes in children 2 . Our analysis indicates that this SNP may causally affect diabetes through BMI in the Europeans.
In region chr3:185324933-186022133, there may exist two independent causal SNPs: rs9816226 (nearest gene: ETV5) for BMI and rs1470580 (in IGF2BP2) for diabetes. Previous GWASs have shown that genetic variation of ETV5 is predictive of BMI in multiple populations including the Europeans while IGF2BP2 predictive of the onset of diabetes in European populations and a Chinese Han population 34,[44][45][46] . Thus, IGF2BP2 may causally affect diabetes but not through BMI.
The MR + COLOC approach in our study assumes that the two non-overlapping samples of the two GWASs are from the same population. We have used the summary statistics from the UK Biobank (British population) and GIANT studies (individuals of all ancestries). One possible limitation of our study would be that if the individuals of the two studies came from difference populations and/or partly overlap, our results may be biased due to violations of the assumption. The unavailability of the individual level data in this study has hindered us from testing the plausibility of this assumption.
Our analysis using MR + COLOC has highlighted, indirectly, four genes with putative BMI-mediated causal pathways to diabetes. These findings, however, need further validation investigations (e.g. mediation analysis using individual level data).
Another limitation is GWAS results of diabetes we downloaded from the Neale lab. They used linear regressions rather than logistic regressions for binary phenotypes, although bias from such model misspecification is not an important issue in our study thanks to a large number of diabetes cases (~ 17,000) and rare variants excluded as instruments.

conclusion
In this paper, we have recommended an analytical approach MR + COLOC to investigate causal pathways to diabetes. The causal effects of four genes on diabetes were found mediated by BMI. Both COLOC and eCAVIAR have highlighted the same SNPs that are most likely to be shared causal signals when there is a single shared causal SNP. For a specific study, if we believe there exists at most one causal SNP for either trait or wish to test whether there exists two distinct causal SNPs in a genetic region, COLOC is recommended. If there are multiple shared causal SNPs, one may however prefer eCAVIAR.
The approach applied in the present study takes forward strengths of both MR and colocalization, which can be used as an indirect approach of investigating genetic causal pathways where a mediator conveys the genetic effects. This approach provides new insights into causal mechanisms of diabetes that could be further validated in other studies and ultimately help in the development of new and effective treatments. (2020) 10:7476 | https://doi.org/10.1038/s41598-020-64493-4 www.nature.com/scientificreports www.nature.com/scientificreports/

Methods
Study design and data. Figure 1 depicts the workflow of our study design to investigate putative BMImediated causal pathways to diabetes. We used publicly available GWAS summary data of BMI and diabetes. BMI summary data (including major and minor alleles and allele frequencies on 2,554,637 SNPs from 339,224 individuals of all ancestries, estimated effects of allele dose and their standard errors and corresponding p-values) were downloaded from http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consor-tium_data_files#GWAS_Anthropometric_2015_BMI. The same summary statistics of diabetes on 10,894,596 SNPs from 336,473 unrelated individuals from the UK were downloaded from https://www.dropbox.com/ s/41q5uj1sa6v99y0/2443.assoc.tsv.gz?dl=0.

Investigation of causal relationship between BMI and diabetes using Mendelian randomization.
First, we used MR to test if BMI causally affects diabetes, where BMI associated SNPs were used as the instrumental variables (IVs). The assumptions of MR are represented by a directed acyclic graph in Fig. 3. An IV is: i) associated with the exposure (an arrow pointing from IV to BMI), ii) independent of the confounders (no arrow between IV and confounders) and iii) independent of the outcome, conditioning on the exposure and the confounders (no arrow pointing directly from IV to diabetes). The third condition assumes that there exists no direct effect of the IV on the outcome, i.e., no horizontal pleiotropy 47 , which can be relaxed, for example, in MR-Egger regression. A direct arrow from IV to diabetes is present in Fig. 3 because MR-Egger regression was included in our MR analysis. Of the four MR methods applied in this study, MR robust adjusted profile score (MR-RAPS) was designed to reduce weak instrument bias 29 .
MR requires that SNP instruments are mutually independent. We first filtered the 2,042 BMI associated SNPs (P < 5 × 10 −8 ) from the GIANT GWAS results by clumping on the MR-Base platform 48 to ensure that the instruments in MR analysis were independent of each other. That is, the SNPs in LD (R 2 ≥ 0.001) were clumped together and only the one with the lowest p-value was retained. This led to 76 independent BMI-associated SNPs (Table S2) included as the IVs in MR analysis using four existing MR methods (inverse variance weighted estimation, weighted median estimation, MR-Egger regression and MR-RAPS) in R package TwoSampleMR, https:// mrcieu.github.io/TwoSampleMR/. The SNP rs7903146 was detected as an outlier instrument, which was removed in our sensitivity analysis. Leave-one-out analysis was further carried out by leaving one instrument out at a time in MR analysis, to investigate the influence of individual instrumental SNPs on estimated causal effect.
Identification of shared causal genes between BMI and diabetes using COLOC. Since the IVs are required to be associated with BMI but not necessarily in a causal way, one cannot decide whether such associations are causal. To investigate genetic causal pathways to diabetes, we applied COLOC to detect shared causal SNPs between the two traits: BMI and diabetes. This approach assumes that: (1) in each test region, there exists at most one causal SNP for either trait; (2) the probability that a SNP is causal is independent of the probability that any other SNP in the genome is causal; (3) all causal SNPs are genotyped or imputed and included in analysis. According to these assumptions, there are five mutually exclusive hypotheses for each test region: (1) there is no causal SNP for either trait (H 0 ); (2) there is one causal SNP for trait 1 only (H 1 ); (3) there is one causal SNP for trait 2 only (H 2 ); (4) there are two distinct causal SNPs, one for each trait (H 3 ); and (5) there is a causal SNP common to both traits (H 4 ). Our primary interest lies in the last hypothesis H 4 -colocalization. Support for each of the hypotheses is quantified by the posterior probability (PP), denoted by PP 0 , PP 1 , PP 2 , PP 3 , and PP 4 accordingly. These PPs were calculated from the priors and the approximate Bayes factors. We set the prior probability of each SNP that is causal to either of the traits to 1 × 10 −4 (i.e. one in 10,000 SNPs in the genome are causal to either trait) and causal to both traits to 1 × 10 −6 (i.e. one in 100 SNPs in the genome causal to one trait are causal to both traits). We used the GWAS summary statistics of BMI and diabetes to approximate the Bayes factors. COLOC treats trait 1 and trait 2 as two outcomes in no particular order. Thus, it has no capacity for examining relationships between BMI and diabetes.
To define test regions in COLOC, we included the 76 independent BMI-associated SNPs from our MR analysis and 52 independent diabetes-associated SNPs (P < 5 × 10 −8 ). Each of these 128 SNPs and their neighbor SNPs (distance within 200 kb on https://genome.ucsc.edu GRCh37/hg19) were used to define a test region. After . Directed acyclic graph (DAG) of Mendelian randomization analysis using SNPs as instrumental variables (IVs). An IV satisfies three assumptions: (1) it is associated with BMI, i.e., there is an arrow from IV to BMI; (2) it is independent of confounders (both observed and unobserved), i.e., there is no arrow between IV and confounders; (3) it is independent of diabetes conditioning on BMI and confounders (known as no horizontal pleiotropy), i.e., there is no arrow between IV and diabetes. This last assumption, however can be relaxed, for example, in MR-Egger regression. A direct arrow from IV to diabetes is present in this figure because MR-Egger regression was included in our MR analysis.