Bayesian integration of genetics and epigenetics detects causal regulatory SNPs underlying expression variability

The standard expression quantitative trait loci (eQTL) detects polymorphisms associated with gene expression without revealing causality. We introduce a coupled Bayesian regression approach—eQTeL, which leverages epigenetic data to estimate regulatory and gene interaction potential, and identifies combination of regulatory single-nucleotide polymorphisms (SNPs) that explain the gene expression variance. On human heart data, eQTeL not only explains a significantly greater proportion of expression variance but also predicts gene expression more accurately than other methods. Based on realistic simulated data, we demonstrate that eQTeL accurately detects causal regulatory SNPs, including those with small effect sizes. Using various functional data, we show that SNPs detected by eQTeL are enriched for allele-specific protein binding and histone modifications, which potentially disrupt binding of core cardiac transcription factors and are spatially proximal to their target. eQTeL SNPs capture a substantial proportion of genetic determinants of expression variance and we estimate that 58% of these SNPs are putatively causal.

: Mixing rate of eQTeL with and without block sampler. ( Note: regulatory and interaction priors were removed for this exposition). The block sampler leverages information of Linkage disequilibrium (LD) blocks to choose sparse set of SNPs within each LD block. In the current example, there are two (identical) SNPs within each LD block. The sampler without block sampler are more often stuck at previously selected SNPs in consecutive MCMC iterations compared to the block sampler. This problem will exponentially increase with growing number of SNPs in LD block. On the other hand, block sampler chooses subset of SNPs with a LD from their full posterior distribution in each iteration independently using a MH sampler. Relatively higher number of combinations of SNPs will be explored by block sampler. The block sampler chooses comparatively better subset of SNPs since it explores relatively larger fraction of the model space.  Fig 2). SNPs from eQTeL were selected using posterior probability > 0.5. The figure shows (5 fold) crossvalidated explained variance and correlation between predicted expression using alleles of identified SNPs for each methods.  Fig  2). SNPs from eQTeL were selected using posterior probability > 0.5. The figure shows (10 fold) cross-validated correlation between predicted expression using alleles of identified SNPs for each method.  Fig. 6: Comparison of recall-rate of different methods (controlled for overall effective sparsity). eQTeL-high is eeSNPs with high regulatory potential (above 75 quantile). eQTeL-low is eeSNPs with low regulatory potential in lower 25% quantile.      Number of SNPs were controlled for each method (as in Fig 2). SNPs from eQTeL were selected using posterior probability > 0.5. The figure shows (10 fold) cross-validated correlation between predicted expression using alleles of identified SNPs for each method. Proportion of causal SNP detected by eQTeL TF Fig. 14: Proportion of causal SNPs detected by eQTeL: Highly putatively causal were identified SNPs using difference in association between best-associated SNP and second-associated SNP for each gene. Y axis shows mammalian TF motifs that are preferetially disrupted by causal SNPs. For each of these motifs, proportion of causal SNPs among eeSNPs was estimated using ratio of relative enrichment (over background) of motif disruption score ( differential binding score between major allele and minor allele of SNP) between eeSNPs and causal SNPs.

Sampling γ parameters accounting for Linkage Disequilibrium
We estimated linkage disequilibrium block using PLINK [4], by using default setting of SNPs within 200kb. The effects of SNPs in Linkage Disequilibrium are dependent on each other because the SNP alleles are highly correlated. Gibbs or Metropolis-Hastings samplers that ignore the LD structure of SNPs can get stuck in local minima while failing to explore high probability combinations of γ (Fig. S15). To overcome these poor mixing properties, we devise a block MCMC sampler that explicitly uses LD-block information to sample from the posterior probability of a LD-block i.e.
where, γ LD and γ −LD are γ of set of SNPs respectively within and outside the LD-block. The resulting sampler mixed much faster (Fig. S15) by exploring high probability models in a hierarchical fashion: we use a Gibbs sampler to sample highly-probable combinations of LD blocks and within these sampled LD block, and then a Metropolis-Hasting sampler is used to sample a sparse combination of SNPs that explain expression variance.

Sampling α and θ parameters
We follow the latent variable Gibbs sampling strategy of [5] to sample the logistic regression parameters α. Specifically, we can sample latent variables from a Pólya-gamma distribution, and then sample α from a normal distribution, and Ω being a diagonal matrix of the w i 's. Then, for each SNP i and gene j, the regulatory-interaction potential θ ij is sampled from its posterior distribution as where φ(γ) = π γ π 1−γ 0 . If θ were estimated based only on whether the corresponding SNP was an expression-regulator (i.e based on value of γ ), then the resulting estimation of regulatory-interaction potential would be equivalent to supervised learning. On the other hand, if θ were sampled on posterior that depended only on current estimate of α and not on γ, the resulting estimation be equivalent to clustering. eQTeL, however, uses both in its posterior sample and therefore induces a semi-supervised clustering of genomic regions into interacting regulator and neutral regions. This approach to model θ induces a semi-supervised clustering of genomic-region into interacting-regulators and noninteracting-regulators, since each MCMC iteration produces a sample of θ ij for each SNP that depends on its γ ij in addition to its current estimate of regulatory and interaction potentials.

Inference of β, σ 2 and c
For simplifying the notations, in the section we only consider subset of SNPs which were selected by the model so that X represents X γ (this is n × q matrix, where n is number of samples and q is total number of SNP selected in the model). The generative model for β, σ 2 and c are: For Zellner's g-prior ν is usually assumed to be zero. β, σ 2 and c are sampled from the full posterior distribution as:

Convergence of sampler
Convergence of the MCMC sampler was assessed by running 10 independent chains and diagnostics of MCMC chain was performed using R-package "coda". In general, we found that the Markov chains converge within 5000 iterations of the sampler.

Initialization
We use univariate-eQTL to initialize different parameter of the eQTeL model.

Supplementary Note 2: Further investigation into the reasons for eQ-TeL's performance gain
In this paper, we have chosen to compare performance of eQTeL against eqtnminer since it is the only method that mostly explicitly incorporated epigenomic data in eQTL as opposed to traditional eQTL approaches. First eqtnminer estimates Bayesian factor (likelihood of association) of each SNP, assuming at most one SNP per gene to be causal; this assumption can be limiting, because it cannot identify combination of SNPs that jointly explain the expression variance. It then estimates posterior probability of each SNP to be causal regulator by modeling prior probability as a function of epigenetic data. However, eqtnminer parameter estimation relies on maximizing a likelihood function, which is prone to get stuck in local maxima due to correlation among different types of epigenetic data (demonstrated in supplementary note 6 and Fig S9). Further, they do not explicitly model relative weights of genetic and epigenetic factors in determining causality of SNPs. Another approach by Lee et al. [2], does not have the limiting assumption of single causal SNP per gene but it does not incorporate epigenomic data, making comparison infeasible. Recently, Lappalainen et al. [6] uses Matrix-eQTL (essentially a univariate eQTL method) to find associated SNPs, and estimates the proportion of causal SNPs by comparing their epigenomic profiles with that of the most associated SNP per gene as a gold standard (which is a strong assumption). Since they do not explicitly identify causal SNPs amongst associated SNP (the only estimate proportion of causal SNPs), this method is not directly comparable with our method.
To assess performance of eQTeL, we also chose LASSO as a representative of multivariate regression eQTL approaches, because of its good performance and scalability to larger datasets. Other approaches to date [7,8,9] that identify causal variants in GWAS, but not in eQTL studies and therefore are not directly comparable.
eQTeLs performance gain is potentially due to two main factors (i) integration of epigenetic data, (ii) allowing multiple causal variants per gene (cite http://www.ncbi.nlm.nih.gov/pubmed/25104515). In quantifying the relative contribution of each of these factors, we note that the mean correlation between actual and predicted eQTeL-predicted gene expression, when a single causal SNP per gene is allowed, is 0.154. This correlation improved substantially to 0.289 when 5 causal SNPs per gene are allowed in eQTeL ( Fig S8). However, in the absence of epigenomic data, i.e., when using standard LASSO, we do not see any such performance gain, and in general, the performance is substantially worse than that for eQTeL. This strongly suggests that allowing multiple SNP per gene is useful in identifying regulatory SNP specifically when functional information is used.
Another advantage of eQTeL is that it models heterogeneity in epigenetic signatures of expression regulators. eQTeL is a hierarchical Bayesian model as opposed to empirical Bayes model. Unlike empirical Bayes, hyper-parameters of model are drawn from unparameterized distributions. For this reason in eQTeL all parameters are estimated using MCMC sampling and EM approximation was not required. Empirical prior models [10,2,11,12] assumes a single signature for all regulators and therefore cannot account heterogeneity in the type of regulators of different genes. The eQTeL accommodates such heterogeneity because it allows variation in parameter combinations.
Supplementary Note 3: Other methods for comparison 3

.1 Eqtnminer
The software tool related to Gaffney et. al. was downloaded from http: // eqtnminer.sourceforge.net. For each of the comparative analysis, the ini-tial set of SNPs per gene was kept same for both eqtnminer and eQTeL for fair comparison. We obtained Bayesian factor for each SNPs using eqtnminer. The parameters to calculate epigenetic prior were estimated using maximizing equation (9) of Veyrieras et. al. The parameters were initialized as recommended by Veyrieras et. al.
To generate Fig 2, we controlled for total number of SNPs selected by eQTeL and eqtnminer. To do so, we sorted SNPs based on eqtnminer prior probability and selected top 2428 SNPs. As Gaffney et. al. recommend the eqtnminer for single SNP per gene, we compared the performance of eqtnminer in main manuscript (Fig 5, 6 and 8) using single SNP per gene for footprint enrichment, allele-specificity and ChiA-PET enrichment analyses. We repeated that analysis by controlling for number of SNPs per gene between eQTeL and eqtnminer; the eQTeL still outperform eqtnminer in that case. To generate Fig S8, for each gene we selected N (= 1,2, 3 and 5) top SNP(s) based on eqtnminer posterior probability.

LASSO
R-package GLMNET was used for L1 regulaizer multivariate regression (LASSO). LASSO estimates effect size (regression coefficient), for the SNP included in the model. We used 10-fold cross validation to estimate the hyper-parameter (lambda, regularization parameter). For each of the comparative analysis, the initial set of SNPs per gene was kept the same for both LASSO and eQTeL for fair comparison.
To generate Fig 2, we controlled for total number of SNPs selected by eQTeL and LASSO. To do so, we sorted SNPs based on absolute value of effect size estimated by LASSO-selected top 2428 SNPs. To generate Fig S8, for each gene we selected N (=1,2,3 and 5) top SNP(s) based on absolute value of estimated effect size estimated by LASSO.

Epigenetic-only model
In simulation study, α parameters were learned, in supervised manner, by using enhancers as training example. Bayesian logistic regression [5] was used to learn α. Based on learned α, SNPs were sorted based on their regulatory potential.

Known-epigenetic-prior-eQTeL
Known-epigenetic-prior-eQTeL, is a version of eQTeL (for simulation study only) where instead of estimating α, the α used to generate regulatory po-tential for simulation study was used. Thus it is a theoretically best model for eQTeL.

Variable selection method
Variable selection model was implemented by modifying eQTeL model as follows: (a) informative prior was changed to uninformative priors. (b) hierarchical sampling SNP (based on LD block) was switched off; each SNP were processed sequentially, similar to Liang et. al. [13].

Lirnet
Lirnet was downloaded from (http://homes.cs.washington.edu/~suinlee/ lirnet/). Because of computational limitation of lirnet (it takes 13 days of CPU processing in a 64 core machine to process 200 genes), this analysis was limited to 200 random genes. Hyper-parameter of the model was set by crossvalidation as recommended in Lee et. al. [2]. For comparing the performance of Lirnet with eQTeL we ran eQTeL with same set of 200 genes. Figure 13 demonstrates that eQTeL outperforms Lirnet in terms of explained variance and prediction accuracy ( we controlled for number of SNPs selected by each methods). Figure 8 also demonstrates that higher fraction of of SNPs detected by eQTeL overlaps with footprints, suggesting eeSNPs are more likely to be functional compared to SNPs detected by Lirnet.