Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

Many methods have been developed to leverage expression quantitative trait loci (eQTL) data to nominate candidate genes from genome-wide association studies. These methods, including colocalization, transcriptome-wide association studies (TWAS) and Mendelian randomization-based methods; however, all suffer from a key problem—when assessing the role of a gene in a trait using its eQTLs, nearby variants and genetic components of other genes’ expression may be correlated with these eQTLs and have direct effects on the trait, acting as potential confounders. Our extensive simulations showed that existing methods fail to account for these ‘genetic confounders’, resulting in severe inflation of false positives. Our new method, causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes. In conclusion, cTWAS provides a robust statistical framework for gene discovery.


Supplementary Notes EM algorithm for estimating the prior parameters
We first restate our full model as described in Methods.We have a linear model with K groups of explanatory variables: where X j , 1  j  p, is j-th explanatory variable and j 2 M k denotes that it belongs to the group k.
The e↵ect size j 's follow spike-and-slab prior distributions with group-specific parameters.Let j be an indicator of whether X j has non-zero e↵ect: For simplicity, we assume 2 is given (see Methods).Our goal is to estimate the prior parameters k , 1  k  K}.We use Expectation-Maximization (EM) algorithm to estimate these parameters, treating the e↵ect sizes = ( 1 , • • • , p ) and the configuration of all indicators = ( 1 , • • • , p ) as missing data.The complete data log-likelihood function is given by: log P (y, , |X, ✓) = log P ( |✓) + log P (y, |X, , ✓) = log P ( |✓) + log P ( | , ✓) + log P (y|X, ) Note that the last term does not depend on the prior parameters ✓, so we can ignore it in the EM algorithm.
In the E-step, we take expectation of the complete log-likelihood function over , |✓ (t) , where ✓ (t) is the parameters from step t: Q(✓|✓ (t) ) = E , log P (y, , |X, ✓, ) We can simplify this function.First, we note: where ↵ (t) j = P ( j = 1|✓ (t)  , X, y) is the posterior inclusion probability of variable j at the t-th iteration.Next, = ↵ (t) (10) Given that j = 0 when j = 0, the second term in the expectation is simply 0. For the first term, we plug 1 in the normal density j ⇠ N (0, 2 k ): where ⌧ 2,(t) j = E( 2 j | j = 1, X, y, ✓ (t) ) is the second moment of posterior e↵ect size distribution of j given the current parameter estimate ✓ t and that j = 1.
Putting all these together, we have: At the M-step, we find ✓ that maximize Q(✓|✓ t ).We begin with the update rule for ⇡ k , noting that only the first term contains ⇡ k .
Solving this equation gives the update rule of ⇡ k : where This leads to the update rule for 2 k as: The computation of ↵ (t) j and ⌧

2,(t) j
at each iteration will be described below.We perform E and M step until each prior parameters change < 0.00001 in one iteration or the maximum number of iterations is met.
Calculation of ↵ j and ⌧ 2 j under the single e↵ect approximation.
In the EM iteration, we need to compute the PIP ↵ (t) j and the posterior second moment ⌧ 2,(t) j of each variable (the index (t) will be dropped later for simplicity).To simplify, we analyze each block one at a time, assuming each block has at most a single variable with non-zero e↵ects.We further assume that each single block explains a minimal variance of y, so we set = 1 in the calculation.Let j be the configuration where j = 1 and all other indicators are 0, and 0 be the null configuration where j = 0 for all variables.Let p j be the prior inclusion probability of variable j, which depends on the group variable j belongs to, p j = ⇡ k when j 2 M k .The prior probabilities of the configurations are given by: For variable j, we define its Bayes factor, as the probability of data under j vs. under 0 .Based on the Wakefield formula [1], it is given by: where k is the prior variance of e↵ect size for variables in group k. ˆ j is the estimated e↵ect size of variable j, and s j is the standard error.They are given by: The posterior inclusion probabilities depend on the prior and the BFs of variables: = P ( j |X, y, ✓) = P (y|X, j , ✓)P ( j |✓) P (y|X, 0 , ✓)P ( 0 |✓) + Following the derivation in the SuSiE paper, the posterior distribution of j | j = 1 follows N (µ 1j , 2 1j ) where ˆ j /s 2 j , so we have the second moment of posterior distribution, When using summary statistics in the form of Z-scores, we replace ˆ j as ẑj /n, where ẑj is the Z-score of variable j. ↵ j can still be derived by Equation 27 where B j is given by where k is the variance of Z score for variables in group k. ⌧ 2 j can still be derived by Equation 28where µ 1j = 2 1j ẑj .We provided some justifications of the single e↵ect assumption.In statistical fine-mapping, several methods have been developed to incorporate functional information of variants to set the prior inclusion probabilities.In such models, it has been found that for the purpose of estimating parameters of the prior distributions, the single e↵ect approximation is often su cient [2].Furthermore, in our implementation, we prune large blocks that are likely to contain multiple causal variables in the parameter estimation step.Finally, our simulations showed that the parameters estimated under single-e↵ect assumption were generally accurate (Figure 2 of the main text).
In the implementation of our method, we used the algorithm of single e↵ect model (SER) from SuSiE to get ↵ j and ⌧ 2 j .When p j ⌧ 1, PIP and the second moment of posterior distribution under SuSiE SER should be similar or the same as those from our model.In more detail, we use p j as prior probabilities in SuSiE and ⇡ 0 = 1 P j p j as null weight, PIP under SuSiE SER ↵ 0 j is given by For the approximation, we used ⇡ 0 = 1 P j p j ⇡ 1 p j under the assumption that p j ⌧ 1.The posterior distribution of j | j = 1 under SuSiE SER is the same as in our model.
Similarly, we used the algorithm of the SER model in SuSiE RSS to get ↵ j and ⌧ 2 j when the input is summary statistics(ẑ, R).

Initialization and selection of single e↵ect blocks
To have a good chance of working with blocks satisfying the single e↵ect assumption, we compute the prior probability of a block having at most one causal variable, denoted as p single e↵ect .It is given by: Note that this probability considers both cases where a block has no causal variable, and just one.We choose blocks with p single e↵ect above a threshold, 0.8 in our simulation and real data analysis, to be used in prior parameter estimation.The e↵ect of this block selection step is that large blocks that are likely to contain more than one signal a priori, are filtered.
To initialize the EM algorithm described in "Inference of the individual level model" and "Inference of the summary statistics model", we use the same prior parameters for genes and variants.The thinning procedure is applied at the beginning of the EM algorithm if specified.We then run 3 iterations of the EM algorithm using all LD blocks, using this as an initial estimate of p j .Then, we select single e↵ect blocks based on p single e↵ect given by Equation (31) and used these for 30 more iterations of the EM algorithm.

Computation of marginal associations ẑ and correlation matrix R
. When variable j is a gene, we can compute its Z-score, ẑj following the S-PrediXcan formula [3].Specifically, suppose the gene j has m SNPs with nonzero weights in its expression prediction model.Let w j = (w j1 , w j2 , .., w jm ) be their standardized weights, ẑs j = (ẑ j1 , ẑj2 ..., ẑjm ) T be the Z scores of the associations of the variants to the phenotype, and R s j be the m ⇥ m correlation matrix, i.e.LD matrix, of these variants.Then ẑj is basically the weighted sum of the Z-scores of all the m variants: We also need R, the correlation matrix, of all variables.We denote its element at row i and column j as R ij = Cor(X i , X j ), where X i and X j are the n-dim.vectors (n is sample size) of the variables i and j.When both variable i and j are variants, R ij is given by the variant LD matrix.When one of the variable, for example X j , is an imputed gene, we can write it as: X j = X s j w T j , where X s j is a n ⇥ m matrix of genotypes for the m variants in gene j's prediction model.R ij is then given by: where R s i,j is the m-dimensional vector of correlation between variant i with each of the m variants in the prediction model of gene j.When both variables i and j are genes, where X s i is the genotype matrix and w i is the vector of standardized weights for variants in gene i's prediction model.R s i is the correlation matrix of X s i .R s i,j is now the correlation matrix between the variants in the prediction models of gene i and gene j.

Speeding up computation
Parameter estimation is the most time-consuming step.This is especially the case when variants are densely genotyped.We developed a "thinning" procedure to speed up this step.Specifically, we randomly select a fraction of SNPs in the single e↵ect blocks, and only use these variants in the EM update.When most selected blocks have < 1000 variants, the EM algorithm can be done in one or two hours.We have found in our simulations that this thinning step does not a↵ect the accuracy of results.In the final step of analyzing individual blocks, the estimated prior probability for variants from the thinning procedure will be scaled back for the original variants data.This is done by dividing the prior probability by the fraction of selected variants.The other parameters are not a↵ected by this procedure.To reduce computational burden of the final step, we can analyze an individual block with the original variants only when results using the thinned variants showed strong signals for gene e↵ect, e.g the maximum PIP of genes in the block > 0.8.Our software allows the user to specify the desired thinning parameter (the fraction of variants selected) in the parameter estimation step and the PIP threshold used to determine which blocks to analyze with original variants in the final step.

Connection with colocalization methods
We show that under certain conditions, cTWAS reduces to colocalization analysis.Specifically, we assume there is a single causal gene in a region being studied, and that gene has a single causal eQTL variant (known as eQTN in literature).Let X be the expression of the causal gene, X be its cis-genetic component, and G j be the genotype of the eQTN.We have: where w j is the weight of G j in the prediction model of X.We also assume that G j acts on the phenotype only through the gene X, thus its direct e↵ect ✓ j = 0. Our model of the phenotype y is given by: where , ✓ i are the causal e↵ects of the gene and variants, respectively.Also note that both and ✓ i 's follow spike-and-slab prior with prior inclusion probabilities ⇡ G (gene e↵ect) and ⇡ V (variant e↵ect), respectively.
We can now plug in X = G j w j into the model of phenotype: We see now that the gene e↵ect 6 = 0 if and only if the e↵ect of variant j, w j 6 = 0.So our problem reduces to a fine-mapping problem, where we determine which variant(s) has non-zero coe cients.The di↵erence with standard fine-mapping is that the prior inclusion probabilities vary with variants.The eQTN variant has prior probability ⇡ G , while other variants have prior ⇡ V , generally smaller than ⇡ G .So the model would e↵ectively perform fine-mapping, favoring the eQTN variant.
The model we described is e↵ectively Enloc, a method for colocalization analysis [4].We give a short description here of Enloc here.We denote d j as an indicator of whether G j is an eQTL, and j as an indicator of whether G j is a causal variant of the phenotype in GWAS.Colocalization of eQTL and GWAS trait in a region of interest thus means that there is some variant in the region, where d j = 1 and j = 1.Enloc performs colocalization analysis in several steps.It first performs fine-mapping analysis of eQTL data, estimating P (d j = 1) for all variants.In the next step, Enloc assesses how often the eQTNs are also causal variants of a complex phenotype from GWAS.This step estimates the conditional distribution P ( j |d j ).In general, we expect P ( j = 1|d j = 1) >> P ( j = 1|d j = 0).The enrichment of causal variants in eQTNs is reflected by the parameter ↵ 1 in Enloc, defined as the log odds ratio: Enloc uses the data across genome to estimate ↵ 1 .In the final step, Enloc performs fine-mapping of the GWAS trait in any regions of interest, using the prior probabilities of j 's estimated in the previous step.
Because eQTNs generally have higher prior, this allows Enloc to favor eQTNs in fine-mapping of the trait GWAS.The results of Enloc are expressed as P (d j = 1, j = 1|D, ↵1 ), where D means all the data we have.From this description of Enloc, we can see that the reduced cTWAS model, Equation 37, is almost equivalent to Enloc.Both methods perform fine-mapping of causal variants of a phenotype, favoring eQTNs.In fact, the prior probabilities of the two models are related by: To see this, we note that under cTWAS, the prior probability of a non-eQTN variant being causal variant to the GWAS trait is ⇡ V .The prior probability of an eQTN being causal variant is just the prior probability that its target gene is causal, which is ⇡ G .We can now express ↵ 1 , the key Enloc parameter, in terms of ⇡ G and ⇡ V : where we use the approximation that ⇡ G , ⇡ V are usually small, so 1 ⇡ G ⇡ 1 and 1 ⇡ V ⇡ 1.There is one subtle di↵erence between this reduced cTWAS model and Enloc.Enloc uses a common parameter for the prior variance of e↵ect sizes for all variants, including eQTNs.cTWAS in contrast, allows the prior variance to di↵er between gene e↵ects (hence eQTL e↵ects) and variant e↵ects.In our simulations, we have found that using a common prior variance for gene and variant e↵ects leads to inflated PIPs (data not shown).
Having shown the near equivalence of Enloc and the reduced cTWAS model, we can see that cTWAS is also related to coloc [5].coloc performs model selection, comparing several models, H 1 : there is an eQTL in the region being analyzed but no causal variant for GWAS, H 2 : there is a causal variant for GWAS trait but no eQTL, H 3 : there are distinct eQTL and GWAS causal variants, and H 4 : the same causal variant a↵ects both expression and phenotype.The goal of coloc is largely about estimating the posterior probability of H 4 , i.e.P (H 4 |D), where D denotes the eQTL and GWAS data we have.This probability is related to the Enloc model by: Having shown the connection between coloc and Enloc, we can also relate the key parameters under the two models.In coloc, the prior probabilities of the four models are related to the prior inclusion probabilities of variants: p 1 , the prior of a variant being eQTN; p 2 , the prior of causal variant of GWAS trait; and p 12 , the prior of a variant being causal to both expression and GWAS trait.Whether coloc detects a colocalization, i.e.H 4 is chosen, is controlled largely by these prior parameters, p 1 , p 2 , p 12 .In the Enloc paper, the authors show that Enloc is equivalent to coloc with the parameters under the two methods related by (see Equation 10of Enloc): where we again assume small prior probabilities.We can also see that the coloc parameteres are related to the cTWAS parameters.In fact, p 1 is the prior of causal variant, so p 1 ⇡ ⇡ V ; and p 12 /p 2 is the conditional probability that a variant is causal to the phenotype given that it is an eQTN.As we have explained above, this is P ( j = 1|d j = 1), which is ⇡ G under the reduced cTWAS model.So the enrichment parameter under coloc becomes: This is the same Equation 40 we have derived linking ↵ 1 in Enloc and prior parameters of cTWAS.
From these derivations, we see that under the simple scenario of single causal gene, and single eQTN, cTWAS is equivalent to Enloc and coloc.Compared to coloc, both Enloc and the reduced cTWAS have the advantage that all the key prior parameters are estimated from the genome-wide data analysis.In contrast, coloc sets the parameters, p 1 , p 2 , p 12 , somewhat arbitrarily.Many have reported that coloc results are sensitive to these parameters.Of course, the equivalence of cTWAS with colocalization methods only applies in the simple scenario.In general, cTWAS has the advantage that it allows multiple causal genes, and multiple eQTNs.

Running other methods on simulation data
We ran FUSION (https://github.com/gusevlab/fusion_twas) on the same simulated GWAS summary statistics data and the same gene prediction models.We ran coloc (https://cran.r-project.org/web/packages/coloc/index.html) which is provided as part of the FUSION package and used PIP4 > 0.8 as cuto↵.We ran FOCUS (https://github.com/bogdanlab/focus,v0.6) under the default setting and use PIP > 0.8 as the cut o↵.For SMR with HEIDI filter (https://yanglab.westlake.edu.cn/software/smr/), we used the significant ceQTL marginal association statistics provided by GTEx v7 as input, for genes with FUSION prediction models.These include significant variants ±1mb of the transcription start site (TSS) of each gene.For SMR (v1.0.3), we used B-H adjusted p values < 0.05 to select significant genes, and we also required p HEIDI > 0.05 for the HEIDI filter.For MR-JTI (https://github.com/gamazonlab/MR-JTI),we used the full eQTL marginal association statistics provided by GTEx v7 after LD-pruning (r 2 = 0.2) as input.These include all variants ±1mb of the TSS of each gene.We analyzed genes with Fusion B-H adjusted p values < 0.05 using MR-JTI and required a Bonferroni-corrected p < 0.05 to determine significance.For PMR-Egger (https://github.com/yuanzhongshang/PMR,v1.0), we used the full eQTL marginal association statistics provided by GTEx v7 as input, for genes with FUSION prediction models.We restricted the data to variants ±100kb of the body of each gene.Based on correspondence with the author of PMR-Egger, we regularized the LD matrices for these variants as 0.8R + 0.2I to avoid matrix decomposition errors.We used B-H adjusted p < 0.05 to select significant genes for PMR-Egger.For MRLocus (v0.0.25), we used the full eQTL marginal association statistics provided by GTEx v7 as input.We restricted our analysis to genes with FUSION B-H adjusted p values < 0.05.For these genes, we used the recommended settings from the MRLocus paper.These include clumping variants with LD r 2 > 0.1 and retaining only those with eQTL p < 0.001.After clumping and running the colocalization model, we trimmed eSNPs with LD r 2 > 0.1, prioritizing those with the highest eQTL significance.To control for multiple testing, we used a local false sign rate (LFSR) ¡ 0.1s.LFSR is analogous to local false discovery rate (LFDR), but reflects confidence in the sign of e↵ect rather than in the e↵ect being non-zero [6].

Detailed simulation results for PMR-Egger
While all competing methods su↵ered from high false positive rates in our simulations, PMR-Egger performed particularly poorly.Its false rate rate was even higher than standard TWAS (FUSION).This is quite unexpected, given that the objective of PMR-Egger is to avoid false positives under TWAS.We thus investigated the simulation results for PMR-Egger in more detail to understand the reasons for its poor performance.
We begin with a short review of PMR-Egger.PMR-Egger analyzes one gene a time.Let X be the expression of a gene of interest, and X be its cis-genetic component.PMR-Egger assumes that the trait y, depends on X, as well as the pleiotropic e↵ect of nearby variants, whose genotypes are denoted as Z y (p-dimension).We have: where µ y is the mean trait value, ↵ is the gene e↵ect, and the p-dimensional e↵ect sizes of nearby variants.PMR-Egger also accounts for the uncertainty of X through another model linking observed expression X with genotypes of all nearby variants under a polygenic assumption.The model as shown in Equation 44, by itself, is not identifiable, because of colinearity of X and Z y : X is e↵ectively a linear combination of variant genotypes.Thus PMR-Egger made an extra assumption, known as Egger regression in Mendelian Randomization, that all variants have identical pleiotropic e↵ects: The goal of PMR-Egger is to test if ↵ = 0, while allowing 6 = 0.When we first ran PMR-Egger with default settings, we noted that it did not return results for a large percentage of genes (83.4% on average for the high PVE simulations) due to errors decomposing LD matrices.Tuning the shrinkage parameter ( ), as suggested by the software, did not substantially reduce the number of decomposition errors.Upon discussion with the author, we performed an alternative regularization of the LD matrices, regularizing them as 0.8R + 0.2I, where R is the original LD matrix, and I the identity matrix.The number of decomposition errors was substantially reduced, although a substantial fraction of genes still have errors (18.6% on average for the high PVE simulations).
Among the genes that were significant by PMR-Egger, the proportion of false positives was very high (85.9% on average for the high PVE simulations), higher than FUSION.To understand this discrepancy, we focused on a single simulation from the high PVE scenario and the 6,454 genes which had both FUSION and PMR-Egger results.We observed that log 10 p-values for FUSION and PMR-Egger are generally correlated (r = 0.567), but there are outlier genes which are insignificant by FUSION but highly significant by PMR-Egger (Fig. S19A).Many of these outlier genes are confounded by nearby causal variants with relatively large e↵ect sizes.
As an example, we visualized the locus containing the gene ICA1, which was insignificant by FUSION, but had very significant p-values (⇠ 10 29 ) by PMR-Egger (Fig. S19B).At this locus, there is a single causal variant with a large e↵ect size, and no causal gene.As expected, the causal variant shows the strongest association in the locus (Fig. S19B, top, the red vertical bar).The FUSION expression prediction model of ICA1 has two variants, both of which are nearby (within 72kb) the causal variant (Fig. S19B, middle), but are not in LD with the causal variant (R 2 = 0.022 for strongest correlation).In the eQTL analysis of ICA1, the two variants in the prediction model show strong associations, as expected, but the causal variant is also nominally significant (Fig. S19B, bottom).Because PMR-Egger uses all cis-variants of a gene to predict expression under a polygenic assumption, it is likely that the prediction model of PMR-Egger includes some contribution of the causal variant.This induces a correlation of predicted expression of ICA1 with the simulated trait.On the other hand, the null model of PMR-Egger, where ↵ = 0, attempts to explain the data using an identical e↵ect for all cis-variants of ICA1.However, this is a mismatch of the data, as the association is simply due to a single causal variant with large e↵ect.Taken together, we believe these two facts (the correlation of x and y due to PMR-Egger's expression prediction model, and the poor fit of the data by the null model) explain the false positive finding by PMR-Egger in this case.The FUSION method, on the other hand, avoided this mistake.The two variants of ICA1 are not in high LD with the causal variant, and have low associations with the trait in GWAS (Fig. S19B, top).As a result, the imputed expression using FUSION is not highly correlated with the trait.
Many other genes that are discordant between FUSION and PMR-Egger are qualitatively similar.These results suggest that PMR-Egger is susceptible to false positives, when (1) the underlying expression prediction model is sparse with only a few causal variants, and (2) when there is a nearby large e↵ect causal variant acting on the phenotype directly.

Prediction model of gene expression.
We used expression prediction models for 49 GTEx [7] tissues from PredictDB [8] [3].We used the mashr-based prediction models, which borrowed information across tissues to improve prediction accuracy in any specific tissue [9].These prediction models were built using fine-mapped variants and are sparse, with a maximum of 5 eQTL per gene.The models were constructed using in-sample LD from GTEx, and the covariance between pairs of variants within each gene is reported alongside the models.We included only protein-coding genes for our analyses.Variants included in the models but missing in the GWAS summary statistics or LD reference were given zero weight.
Total heritability and percentage of heritability mediated by genes using MESC.
To assess parameter estimates from cTWAS, we computed total heritability and the percentage of heritability mediated by genes using MESC [10] for comparison.We used the same summary statistics as in the cTWAS analysis, and we used GTEx v8 expression scores for individual tissues available via the MESC repository.These expression scores are derived from the same GTEx dataset as the PredictDB models.Note that expression scores for "Kidney Cortex" tissue were not available from MESC, although this tissue is included in PredictDB and our other analyses.
Silver standard and previously reported candidate genes of complex traits.
To evaluate our findings, we compared the genes detected using cTWAS to curated lists of known or candidate genes for each trait.For LDL, we combined curated gene lists from two previous studies [11] [12], for a total of 69 LDL-related genes.These are genes that cause Mendelian disease or are drug targets, are in the KEGG "cholesterol metabolism" pathway, or are otherwise documented in the literature.For IBD, we used a list of 26 genes curated from literature, reported in the paper describing the activity-by-contact (ABC) model [13].We refer to the gene lists for LDL and IBD as "silver standard" gene lists throughout the text based on the strength of their evidence.In supplemental materials, we also refer to "previously reported candidate genes" for SCZ and SBP.These are gene lists that are primarily supported by GWAS data rather than literature or experimental evidence.For SCZ, we used a list of 120 genes prioritized as SCZ-related based on proximity to GWAS, fine-mapping results and eQTL evidence [14].For SBP, we used a list of 53 genes, supported by TWAS, gene expression, drug targets, pathway analysis and rare variant studies, reported in a large trans-ethnic GWAS (Table 3 of the paper) [15].
Assessing the novelty of cTWAS detected genes.
If a gene was not included in the list of silver standard or previously reported genes of a trait, and was not the nearest gene of the lead variant in a genome-wide significant locus, we considered the gene a "novel" finding.We defined "nearest" genes as those with start or end positions that were the nearest out of all protein-coding genes to the lead variant of a genome-wide significant locus.To define independent lead variants for the GWAS, we performed the following procedure.First, we selected the most significant variant as a lead variant, and we removed all other variants within 500kb.Then, we iterated these selection and removal steps until there were no genome-wide significant variants remaining.We then identified the protein-coding genes that were nearest to each lead variant in the resulting list.
Gene Ontology (GO) analysis of candidate genes.
We performed enrichment analysis to characterize the genes detected by cTWAS for LDL and IBD.We used Enrichr [16] to identify Gene Ontology terms in the Biological Process, Molecular Function, or Cellular Component domains that are enriched for genes detected by cTWAS.Enrichr uses the Benjamini-Hochberg procedure to control FDR in each domain.We used a threshold of FDR ¡ 0.05 for significant enrichment.We report the full enrichment analysis results using Enrichr across all domains for LDL and IBD (Supplementary Table 3; Supplementary Table 10).We compared the results from cTWAS genes with similar enrichment analysis results that used silver standard genes for LDL and IBD (Supplementary Table 4; Supplementary Table 11).We also compared these results with enrichment results from MAGMA [17] [18], a method that performs enrichment analysis based on GWAS data, using default settings (Supplementary Table 5; Supplementary Table 12).When reporting the IBD results (Supplementary Table 8), we annotated detected genes with the enriched GO terms (using cTWAS genes, silver standard genes, or MAGMA) that each gene is a member of.For all three GO annotations, a maximum of 5 significant terms per gene were shown, ordered by odds ratios (cTWAS, silver standard) or p-values (MAGMA).
To visualize the GO biological process terms enriched for LDL cTWAS genes (Fig. 5b), we removed redundant terms to improve clarity.To do this, we first ranked all GO terms by their significance (p-values).A term was considered redundant if all the cTWAS genes in that term had already been included in a more significant GO term.
To visualize the GO biological process terms enriched for IBD cTWAS genes (Fig. 6e), we used the "Weighted Set Cover" tool in WebGestalt [19] to remove redundant GO terms.To further simplify visualization, we omitted GO terms whose detected genes were all included in other terms identified by Weighted Set Cover.We also report the full enrichment analysis results using WebGestalt for IBD (Supplementary Table 9).
Analyzing the HPR and POLK / HMGCR loci using other methods.
We analyzed the genes at the HPR and POLK loci using coloc, SMR, and FOCUS.We applied these methods as described in the "Running other methods on simulation data" subsection of the methods, with the following exception.For SMR, rather than use a FDR-based significance threshold, which requires pvalues genome-wide, we used a local Bonferroni significance threshold to account for the number of genes tested at each locus.
To examine the POLK / HMGCR locus, we used fine-mapping results of LDL from UK Biobank from PolyFun-SuSiE [20] (downloaded from http://data.broadinstitute.org/alkesgroup/polyfun_results/biochemistry_LDLdirect.txt.gz).Specifically, PolyFun specifies prior probabilities for fine-mapping by estimating functional enrichments for a broad set of coding, conserved, regulatory and LD-related annotations.Then SuSiE was applied using the estimated priors with L=10 e↵ects per locus.
To annotate the functions of variants their putative target genes, we used additional datasets.H3K4me1 ChIP-seq data of from human liver samples were downloaded from the ENCODE portal (ENCODE ID: "ENCFF323TFF").Promoter Capture Hi-C (PC-HiC) data from human liver were obtained from a previous study [21].Liver activity-by-contract (ABC) data were also obtained from a previous study [22].
Novel candidate genes from application of cTWAS on several complex traits.
The application of cTWAS to several complex traits suggested interesting and novel candidate genes.In the case of LDL, we identified two genes, ACVR1C and INHBB, in the signaling pathway of activin, a signaling molecule in the TGF-beta superfamily.ACVR1C is a receptor of activin A and INHBB is a subunit of activin B. Both activin A and B regulate lipolysis in hepatocytes and adipocytes and alter lipid compositions [23] [24].In an exome-wide association study, a loss-of-function variant of ACVR1C was associated with metabolic phenotypes such as triglycerides and high-density lipoprotein level [25].We identified another signal transduction gene, a protein kinase, PRKD2, as a candidate gene a↵ecting LDL.PRKD2 deficiency in mice triggers hyperinsulinemia, metabolic disorders and dysregulation of LDL [26].
Our study of IBD revealed a number of interesting candidate IBD risk genes (Table 1).IFNGR2 is a receptor of interferon-gamma, a proinflammatory cytokine, whose activation may result in intestinal lesions [27].IRF8 is an important regulator of multiple immune processes ranging from antigen presentation to response to cytokines [28].IRF8 deficiency in an experimental model of colitis resulted in more severe inflammation [29].CCR5 is a chemokine receptor, best known for its role in HIV infection.CCR5 blockade in mice inhibited leukocyte tra cking and reduced mucosal inflammation in murine colitis [30].TYMP encodes thymidine phosphorylase, and TYMP mutations cause Mitochondrial Neurogastrointestinal Encephalomyopathy (MNGIE), a rare recessive disease [31].The disease causes gastrointestinal symptoms and is often misdiagnosed as celiac or Crohn's disease [31].LSP1 is a regulator of T-cell migration, and deletion of LSP1 gene has been implicated in rheumatoid arthritis, an autoimmune disease [32].