Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

Zhao, Siming; Crouse, Wesley; Qian, Sheng; Luo, Kaixuan; Stephens, Matthew; He, Xin

doi:10.1038/s41588-023-01648-9

Download PDF

Technical Report
Open access
Published: 26 January 2024

Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

Nature Genetics volume 56, pages 336–347 (2024)Cite this article

16k Accesses
188 Altmetric
Metrics details

Subjects

Abstract

Many methods have been developed to leverage expression quantitative trait loci (eQTL) data to nominate candidate genes from genome-wide association studies. These methods, including colocalization, transcriptome-wide association studies (TWAS) and Mendelian randomization-based methods; however, all suffer from a key problem—when assessing the role of a gene in a trait using its eQTLs, nearby variants and genetic components of other genes’ expression may be correlated with these eQTLs and have direct effects on the trait, acting as potential confounders. Our extensive simulations showed that existing methods fail to account for these ‘genetic confounders’, resulting in severe inflation of false positives. Our new method, causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes. In conclusion, cTWAS provides a robust statistical framework for gene discovery.

Conditional transcriptome-wide association study for fine-mapping candidate causal genes

Article 26 January 2024

Transcriptome-wide association studies: recent advances in methods, applications and available databases

Article Open access 01 September 2023

SUMMIT: An integrative approach for better transcriptomic data imputation improves causal gene identification

Article Open access 25 October 2022

Main

Genome-wide association studies (GWAS) have identified many loci associated with a range of human traits^1,2. To translate these associations into knowledge of causal genes and molecular mechanisms³, researchers have often used expression quantitative trait loci (eQTL) data, which associate variants with gene expression. In the popular transcriptome-wide association studies (TWAS)^4,5, researchers build predictive models of gene expression from cis-genetic variants, and then test for associations between predicted (‘imputed’) expression and a trait. TWAS thus identifies candidate genes and the likely cell/tissue contexts, and requires only summary statistics. Because of these benefits, TWAS has become widely used to convert GWAS associations into candidate genes⁶. The framework is also applicable to other molecular traits, such as RNA splicing, or chromatin features, further broadening its utility⁷.

A central question in TWAS is whether the identified genes have causal effects on the phenotype. A simple analysis suggests this is not always the case (Fig. 1a). In one scenario, a noncausal gene, X, has an eQTL, G, that is in linkage disequilibrium (LD) with the eQTL of a nearby causal gene X′. This creates a noncausal association of the genetic component of X with the trait. In another scenario, G is in LD with a nearby causal variant, $G'$, which acts on the trait directly, for example, by altering the protein-coding sequence of a nearby gene, again creating a noncausal association of the genetic component of X with the trait. These scenarios are known as ‘horizontal pleiotropy’, a key challenge facing TWAS⁶.

**Fig. 1: Overview of the cTWAS method.**

Alternative methods to jointly analyze eQTL and GWAS data face similar challenges. Colocalization methods test whether gene expression and a trait are affected by the same causal variant^8,9. However, colocalization may still report false-positive findings. This may happen when the eQTL variant G of a gene and a nearby causal variant $G'$ have high LD, as shown in Fig. 1a, thus effectively indistinguishable; or the eQTL variant G has pleiotropic effects on both expression and the trait, without a causal relationship between the two¹⁰. Mendelian randomization (MR) is another strategy to nominate causal genes, treating eQTLs of a gene as instrumental variables (IVs)¹¹. However, the potential pleiotropic effects of instruments and their LD with nearby causal variants violate the key assumption of MR. Several methods such as transcriptome-wide Mendelian randomization (TWMR)¹² and MR-joint-tissue imputation (JTI)¹³ attempted to address this issue by using a heterogeneity filter to remove variants that violate the MR assumption. However, in practice, genes often have only one or few cis-eQTLs (IVs), making the detection of heterogeneity difficult. Lastly, methods such as FOCUS¹⁴ and TWMR¹² jointly analyze multiple genes in a region. While these methods mitigated the challenge due to nearby genes (Fig. 1a, left), they largely failed to account for direct effects of nearby variants (Fig. 1a, right).

Multiple lines of evidence suggest that the scenarios creating possible false-positive findings are common. First, in TWAS and colocalization analysis, it is common to find multiple candidate genes at a single locus, with most genes likely noncausal^6,15. Second, at a biochemical level, coregulation of genes by the same regulatory elements is common¹⁶. Third, eQTLs are pervasive in the genome. In GTEx, half of all common variants are eQTLs in at least one tissue¹⁷, suggesting that chance associations (LD) between eQTLs of noncausal genes and causal variants are probably common¹⁸. All this evidence points to a critical need for better control of false discoveries in TWAS and other eQTL-based analyses.

Here we propose a new statistical framework to address the limitations of existing methods. Our approach can be viewed as a generalization of TWAS, which we term ‘causal-TWAS’ (cTWAS). The fundamental problem of TWAS is that when assessing the association of the imputed expression of any gene (the ‘focal gene’) with a trait, nearby genes and variants may confound this relationship. We refer to them as ‘genetic confounders’ to distinguish them from the environmental confounders that are a common focus in the literature of MR. This reasoning suggests a conceptually simple solution—we should include the tested genes and all genetic confounders in the same model. In practice, implementing this strategy is complicated by high correlations among all these variables, which creates an identifiability challenge. Our key intuition is that causal signals in a genomic region affecting a phenotype of interest, whether via gene expression or variants, are likely sparse. This motivates a Bayesian variable selection model, which has been widely used in statistical fine-mapping^19,20,21,22. Our approach, cTWAS, generalizes standard fine-mapping methods by including imputed gene expression and genetic variants in the same regression model. In realistic simulations and applications to real data, cTWAS greatly reduces the number of false discoveries from TWAS, colocalization and MR-based methods, laying a foundation for the reliable discovery of causal genes from GWAS.

Results

Overview of the cTWAS model

We start with a formal description of standard TWAS²³. We assume genetic variants, denoted as G, affect the expression of a gene, X, which affects a trait y (Fig. 1b). Both X and y could be affected by unobserved environmental variable(s) U, such as diet. We introduce $\tilde{X}$ to denote the cis-genetic component of X. Importantly, the genetic variants G act on y only through $\tilde{X}$. Under this model, the regression coefficient of $\tilde{X}$, with y as the response, would give the causal effect of X on the trait. The confounder U is not a concern here because U contributes only to the nongenetic component of X. In a formal language, the path from $\tilde{X}$ to y through U has a collider, X, which blocks the association. Following similar analysis, TWAS are also robust to ‘reverse causality’ where y affects X (Supplementary Fig. 1). In such a case, X is a collider in the paths from $\tilde{X}$ to y.

Unfortunately, the key assumption underlying TWAS, which G is not associated with y through other paths, is often violated. In the example discussed in Fig. 1a, $\tilde{X}$ may become correlated with y, through a nearby variant ${G}^{{\prime} }$, or the genetic component of a nearby gene ${X}^{{\prime} }$. These are technically known as backdoor paths, leading to possible false discoveries by TWAS (Supplementary Fig. 1).

To control for all potential confounders, cTWAS jointly models the dependence of phenotype on all imputed genes, and all variants, with their effect sizes denoted as β_j for genes and θ_m for variants, respectively (Fig. 1c). Joint estimation of all these parameters would then lead to causal effect estimates. In practice, to simplify computation, we partition the genome into disjoint blocks, with imputed expression and variants independent across blocks, and perform the analysis block-by-block.

The potentially high correlations among the variables in cTWAS pose a new challenge. To address this, we assume that in any genomic region, causal effects, whether they are from genes or variants, are sparse. The problem then becomes similar to standard fine-mapping, where one aims to identify a small number of likely causal variants among many correlated ones. Additional intuitions help explain that the model can potentially learn gene effects despite collinearity. While most variants are nonfunctional, gene expression traits should be more likely to have causal effects a priori. Also, a causal gene may have multiple eQTLs, each of which would be associated with the trait. Thus, a single gene effect would be a more parsimonious explanation of data, compared with several independent variant effects.

We thus fit cTWAS using the statistical machinery developed for fine-mapping. We assume sparse prior distributions of the gene and variant effects (Fig. 1c) and use an empirical Bayes strategy to estimate these prior parameters. With the estimated parameters, we infer likely causal genes and variants in each block, using SuSiE, a state-of-the-art fine-mapping method^19,20. The results of cTWAS are expressed as posterior inclusion probabilities (PIPs) of genes and variants, representing the probabilities that genes or variants have nonzero effects (Fig. 1d). While cTWAS is formulated in terms of individual-level data, we have also derived a version based on summary statistics (Fig. 1d; Methods).

The cTWAS model generalizes and unifies a number of existing methods (Discussion). We show that under a simple scenario where genes have only single eQTL variants, cTWAS reduces to colocalization methods (Supplementary Notes). Similar to TWAS, cTWAS can also be viewed as a two-stage MR method²⁴, where cis-genetic expression is used as the IV. However, cTWAS accommodates horizontal pleiotropy through the inclusion of effects from variants and other genes. Lastly, while our primary goal is gene discovery, the learned prior parameters allow us to estimate the proportion of heritability attributable to gene expression. This application is related to several other methods^25,26.

cTWAS controls false discoveries in simulation studies

We designed realistic simulations to assess the performance of cTWAS. Previous studies often simulate individual regions, and these regions usually contain causal genes. In our simulations, we created genome-wide data across all regions, under realistic genetic parameters from previous studies²⁵. In particular, the proportion of heritabilities mediated through eQTLs are relatively low, so many regions may have causal variants, but not causal genes. Specifically, we used genotype data of variants with a minor allele frequency of >0.05 from ∼45k samples of White British ancestry from the UK Biobank²⁷, and imputed gene expression using the prediction models from GTEx by FUSION⁵. We varied prior probabilities for genes and single-nucleotide polymorphisms (SNPs) being causal, and prior effect size variances, with a total of ten settings. We focused on two representative settings in the main results here, where the proportion of trait variance explained by gene expression set at 10% (high gene PVE setting, where PVE stands for ‘proportion of variance explained’) or 4% (low gene PVE).

We first assessed the accuracy of parameter estimation. cTWAS estimated parameters were generally close to true values (gene results in Fig. 2a, and variant results in Supplementary Fig. 2). In practice, what matters most is the ratio of prior probability of gene effects to that of variant effects. This ‘enrichment’ parameter determines the extent to which the model favors gene versus variant effects. Although cTWAS slightly underestimated some prior parameters under some settings, the estimated enrichment remains accurate (Fig. 2a). Finally, cTWAS accurately estimated the proportion of trait variance explained by the gene effects (Fig. 2a). We also found that PIPs of genes computed by cTWAS are well-calibrated (Fig. 2b and Supplementary Fig. 2). Good calibration means that at PIP > 0.9, we would expect at least 90% of genes above the threshold to be causal genes. Calibration is especially good at the high PIP range (90% or higher), which is what matters most in practice.

**Fig. 2: Parameter estimation and PIP calibration in simulations.**

cTWAS successfully removed many noncausal genes with highly significant associations in standard TWAS (Fig. 3a). We systematically compared the performance of cTWAS with other methods, including the standard TWAS implemented by FUSION⁵, coloc²⁸, MR-based methods (summary-data-based Mendelian randomization (SMR) with HEIDI filter¹¹, MR-JTI¹³, PMR-Egger²⁴ and MRLocus²⁹ and FOCUS¹⁴, a multigene analysis method. Despite using stringent statistical thresholds, all these methods suffered from high false-positive rates (Fig. 3b). In contrast, cTWAS controlled the proportions of false discoveries in all settings (Fig. 3b). The power of cTWAS is somewhat lower, especially in the low gene PVE setting (Fig. 3b). This may reflect the fact that cTWAS threshold is somewhat conservative. Indeed, despite a threshold of PIP > 0.8, the actual false discovery proportions (FDPs) were well below 20% (Fig. 3b). We also assessed the methods using a different metric—the power of a method at a given FDP. cTWAS again outperformed other methods (Supplementary Fig. 3). Somewhat unexpectedly, the MR-based methods performed similarly or worse than other methods. We thus performed an additional investigation of false positives in one of these methods, PMR-Egger (Supplementary Notes).

**Fig. 3: Comparison of cTWAS with other methods in simulations.**

We illustrated, with two examples, how cTWAS removed false positives. In the first example, the region has a single causal effect in Gene 1. However, because of LD, two noncausal genes (Genes 2 and 3) also showed strong associations with the trait (Fig. 3c, top). cTWAS correctly identified Gene 1 as the true signal, and assigned low PIPs to the two other genes (Fig. 3c, bottom). In contrast, coloc assigned a high probability of colocalization to the noncausal Gene 3 (coloc PP4 = 0.995). In the second example, the causal signal in the region is an SNP, but it is in LD with the eQTL of Gene 1, creating a significant association of Gene 1 with the trait (Fig. 3d, top). cTWAS was able to correctly identify the SNP effect as the causal signal and assigned low PIP to Gene 1 (Fig. 3d, bottom). Coloc again gave a high probability of colocalization to Gene 1 (PP4 = 0.8).

Finally, we investigated whether cTWAS is robust to different simulation settings. We added a setting where the trait heritability was considerably lower, with PVE of variants 0.1–0.2, and PVE of genes 0.01–0.1. cTWAS was able to estimate the parameters accurately, produce calibrated PIPs and outperform other methods (Supplementary Figs. 4 and 5). Next, we used a different definition of LD blocks³⁰ in running cTWAS. The resulting PIPs are calibrated and highly correlated with those from our default setting (Supplementary Fig. 6). Lastly, we sampled the effect sizes of causal genes and variants from mixtures of several normal distributions. These distributions better capture the ‘long tails’ of effect size distributions, that is some genes or SNPs have especially large effect sizes. We found that the gene effect enrichment was still accurately estimated, and PIPs were well-calibrated (Supplementary Fig. 7).

cTWAS accurately identified causal genes of LDL cholesterol

We applied cTWAS to GWAS of low-density lipoprotein (LDL) cholesterol from the UK Biobank³¹. We used the expression prediction models from GTEx³² liver in PredictDB^4,33. After harmonizing eQTL data with the UK Biobank LD panel (Methods), we included 9,881 protein-coding genes in the analysis. Using the summary level GWAS data, cTWAS estimated that genes were 62 times more likely than variants to be causal for LDL a priori (Supplementary Fig. 8a). Genetic variants and imputed expression together explained 5.6% of the variation of LDL (total heritability), of which 22.7% was attributable to expression. These estimates are in line with the 8.3% estimate for total heritability using LD score regression³⁴ and 33.5% of mediated heritability through expression using MESC²⁵. The somewhat lower estimates of cTWAS may result from its assumption of sparse causal effects.

cTWAS identified 35 genes with PIP > 0.8 (Supplementary Table 1). In contrast, standard TWAS identified 215 genes at a Bonferroni-corrected threshold of 0.05. Following an earlier strategy to assess these results³⁵, we used 69 known LDL-related genes as the positive set (‘silver standard’)^13,36, and nearby ‘bystander’ genes within 1 Mb as the negative set. We limited our analysis to 46 imputable genes of 69 silver standard genes and 539 imputed bystander genes (Supplementary Table 2). cTWAS has a precision of 75% (6 of 8, Fig. 4a), greatly outperforming standard TWAS, which has a precision of 31% (19 of 61).

**Fig. 4: cTWAS avoids false discoveries of candidate genes of LDL cholesterol.**

To illustrate how cTWAS avoided false positives, we examined two loci in detail. The first locus contains five genes substantially associated with LDL by TWAS, including HPR and four other genes. cTWAS identified a single candidate, HPR (PIP = 1.000), while giving no evidence (PIP < 0.01) to all other genes (Fig. 4b). Literature evidence suggests that HPR, a haptoglobin-related protein that binds hemoglobin and apolipoprotein-L³⁷, is the likely causal gene at this locus. For comparison, we also ran a few other methods (Supplementary Fig. 9). Coloc reported modest evidence of colocalization for HPR (PP4 = 0.64). SMR missed HPR and reported two other genes instead. While FOCUS gave high PIP to HPR, it also reported additional high PIP genes. The extra candidate genes from SMR and FOCUS have no obvious connections with the biology of LDL. This example shows that cTWAS avoids false positives due to confounding with nearby gene expression.

The second locus has three genes strongly associated with LDL by TWAS (Fig. 4c, top). A recent method, MR-JTI¹³, highlighted POLK, DNA polymerase κ, as the potential causal gene at this locus, and proposed a connection between DNA repair and regulation of LDL. The associations of the three genes, however, were much weaker than some nearby variants. Indeed, cTWAS selected several variants as causal signals while giving little evidence to all three genes (Fig. 4c, bottom). Other popular methods (coloc, SMR and FOCUS) all gave modest or strong support of POLK as the risk gene (Supplementary Fig. 10).

To better understand these results, we inspected the fine-mapping results of PolyFun, which uses functional information of variants to improve fine-mapping³⁸. PolyFun identified two credible sets in the region, both of which are inside or close to the gene HMGCR, whose expression was not imputable in our data (Fig. 4d). All these variants are far from the three TWAS genes (>200 kb). In addition, promoter-capture Hi-C (PC-HiC) and the activity-by-contact score in the liver provided no evidence linking these variants to POLK. Instead, the top variant, rs12916 (PIP = 0.99) is within the 3′ UTR of HMGCR, and 1,310 bp away from a chromatin loop interacting with the HMGCR promoter (Fig. 4d). Consistent with these results, HMGCR is an enzyme for cholesterol synthesis and the target of statin, a key drug for reducing LDL levels³⁹. All the evidence thus points to HMGCR, instead of POLK, as the causal gene in this region. This example demonstrates that by controlling nearby genetic variants, cTWAS is able to avoid false-positive genes.

We systematically evaluated the sources of false-positive findings from standard TWAS. We call a gene a likely false positive if it is significant under TWAS (Bonferroni threshold), but PIP < 0.5 under cTWAS. These cases were classified into ‘confounding by genes’ or ‘confounding by variants’ depending on whether the low PIPs of these genes were driven by nearby genes or variants (Methods). The majority of 83 false-positive genes (75%; Fig. 4e) were driven by confounding variants. These results show that the greatest risk of TWAS is not shared eQTLs among nearby genes but the correlation of genes with nearby variants whose effects are not manifested as eQTLs.

To seek new insights into the genetics of LDL, we evaluated the functions of 35 genes with cTWAS PIP > 0.8 (Fig. 5a). Only six of these genes were in the curated silver standard genes, and 20 were not the nearest genes of GWAS lead variants (Fig. 5a). The 35 genes were enriched for multiple cholesterol-related Gene Ontology (GO) Biological Process terms (false discovery rate (FDR) < 0.05; Fig. 5b, 13 nonredundant terms shown; Supplementary Table 3). Compared with the GO enrichment results from silver standard genes (Supplementary Table 4) and GWAS gene set analysis using MAGMA (Supplementary Table 5), several GO terms from cTWAS genes were new, including ‘peptidyl-serine phosphorylation’ and ‘activin receptor signaling pathway’, highlighting the importance of signal transduction in LDL regulation. Activin signaling, in particular, regulates metabolic processes including lipolysis and energy homeostasis^40,41. The cTWAS genes associated with the two terms include well-known LDL genes, such as CSNK1G3, TNKS and GAS6, as well as new and promising genes such as ACVR1C, an activin receptor, and PRKD2 (Fig. 5c, Supplementary Fig. 8b and Supplementary Notes). In the cases of ACVR1C and PRKD2, no nearby variant reaches genome-wide significance.

**Fig. 5: Candidate genes and pathways for LDL discovered by cTWAS.**

While cTWAS reduced false positives and identified promising LDL candidate genes, its power seemed low, identifying 6 of 69 silver standard genes (Fig. 4a). To understand why, we categorized the outcome of cTWAS for all 69 genes (Fig. 5d). Many silver standard genes had no significant GWAS association signals nearby (26.1%, 18 of 69), no imputable liver expression (33.3%, 23 of 69) or insignificant TWAS associations (13.0%, 9 of 69). These results suggest that to improve the power of cTWAS, and eQTL-based methods in general, it is necessary to improve the power of GWAS and the power of eQTL studies, and include more trait-related tissues/cell types (Discussion).

cTWAS discovered candidate genes of several common traits

We applied cTWAS to GWAS summary statistics of inflammatory bowel disease (IBD), systolic blood pressure (SBP) and schizophrenia (SCZ). We used the expression prediction models of protein-coding genes from PredictDB^4,33,42 across 49 tissues in GTEx³². These models borrowed information across tissues to improve prediction accuracy⁴³. The number of imputed genes ranged from 6,591 to 11,985 across tissues (Supplementary Fig. 11). We ran cTWAS analysis in each tissue separately. We summarized the results below, with an emphasis on IBD as a representative trait.

We first assessed the parameters learned by cTWAS. The prior probability of a gene being causal ranged from 0.17% to 2.16% across tissue–trait pairs (Supplementary Fig. 12 and Supplementary Table 6). For example, for IBD, the top tissue is whole blood, with the percent of causal genes (1.54%). The estimated proportions of heritability explained by the genetic components of expression were generally small, for example, for IBD, from 4% to 15% (Fig. 6a and Supplementary Table 6). These estimates were in line with estimated values from MESC (Fig. 6a, Supplementary Fig. 13 and Supplementary Table 7).

**Fig. 6: cTWAS analysis of IBD and other traits using all GTEx tissues.**

We next assessed the number of high-confidence genes at PIP > 0.8 (Supplementary Fig. 14). In the top two tissues per trait, cTWAS identified 13–26 genes (Fig. 6b). In general, the number of cTWAS genes was much smaller than those from standard TWAS (Supplementary Fig. 15). For instance, for IBD, while TWAS reported 68–125 genes across 49 tissues, cTWAS identified 0–17 genes (Supplementary Fig. 15). These results show that only a small proportion of genes found by TWAS are likely causal.

To increase the power, we grouped related tissues into ‘tissue groups’ and took the union of genes across tissues within a group (Fig. 6c and Supplementary Fig. 16). The top tissue groups include trait-relevant tissues, such as ‘digestive’ tissue for IBD (Fig. 6c), ‘cardiovascular’ for SBP and ‘central nervous systems’ for SCZ (Supplementary Fig. 16). The number of discovered genes in the top tissue group per trait ranged from 37 (IBD) to 48 (SBP), highlighting the increased power of discovery from multiple tissues. We also assessed the novelty of the found genes. In the case of IBD, most cTWAS genes were not in the curated genes for IBD⁴⁴, and not the nearest protein-coding genes of lead genome-wide significant GWAS variants (Fig. 6c).

We found that most cTWAS genes were identified in a small number of tissues (Supplementary Fig. 17). For instance, for 56 IBD genes found in the ‘blood/immune’ or ‘digestive’ tissue groups, 57% were found, at a relaxed threshold of PIP > 0.5, in five or fewer tissues (Fig. 6d). One caveat in interpreting these findings is that the power of discovery is low, so cTWAS may underestimate the number of tissues for discovered genes.

We examined specific genes found by cTWAS (see Supplementary Data for all traits). We focused our analysis here on 56 IBD candidate genes at PIP > 0.8, in the following two biologically relevant tissue groups: digestive and blood/immune (Supplementary Table 8). At a more stringent PIP > 0.9, 30 genes were found (Table 1). The set of 56 genes included well-known IBD genes, such as TNFSF15, CARD9, RNF186, ITGAL and ATG16L1. Gene set enrichment analysis revealed IBD-related GO terms (Supplementary Table 9). Using Weighted Set Cover⁴⁵, we identified four nonredundant GO terms, including ‘cytokine production’ and ‘defense response’ (Fig. 6e).

Table 1 IBD genes detected by cTWAS in the blood/immune and digestive tissue groups

Full size table

We highlight some new genes found by cTWAS. Many of these genes, namely, IFNGR2, FOSL2, STAT3, FCGR2A, IRF8 and ZFP36L2 (Supplementary Note) are located within known IBD-associated loci and have immune functions. cTWAS also identified new genes in the loci whose associations fall below the standard GWAS cutoff. Some of these genes, including UBE2W (Fig. 6f), TYMP, LSP1 and CCR5 (Supplementary Fig. 18 and Supplementary Note), have IBD-related functions. For example, UBE2W is a ubiquitin-conjugating enzyme. Ubiquitination is a post-translational modification that controls multiple steps in autophagy, a key process implicated in IBD. Indeed, UBE2W knockdown mice showed mucosal injuries, and its overexpression ameliorated the severity of experimental colitis, a model of IBD⁴⁶.

Discussion

Expression QTL data are commonly used to nominate candidate genes for complex traits. Existing methods for such analysis, however, are susceptible to false-positive findings. Our approach generalizes the TWAS model by jointly modeling the effects of all gene expression traits and genetic variants in a region. Through simulations and applications to several GWAS traits, we showed that cTWAS reduced false findings and discovered a number of candidate genes for these traits, highlighting its potential as a powerful gene discovery tool.

cTWAS is related to existing methods but has several key advantages. When the gene of interest has a single causal eQTL, and the gene is the only causal gene in a locus, cTWAS reduces to colocalization analysis (Supplementary Note)^8,28,47. Colocalization, however, typically focuses on individual variants, yet cTWAS uses imputed gene expression, which combines the effects of multiple variants. While colocalization has been generalized⁴⁸, it does not explicitly account for the combined effects of variants. cTWAS can also be viewed as a generalization of FOCUS, which uses a similar fine-mapping framework, but includes mostly gene effects, with a very simple model of variant effects. As our results showed (Fig. 4e), confounding by nearby variants is a much more common source of false discoveries. cTWAS is also related to some MR-based methods. PMR-Egger²⁴ jointly models the effect of a gene on a phenotype and the potential pleiotropic effects of variants. This model, however, analyzes one gene at a time, and its treatment of pleiotropy is overly simplified, assuming all genetic instruments of a gene have identical pleiotropic effects. TWMR¹² uses multivariate MR to jointly infer the causal effects of multiple genes in a locus. However, it does not explicitly model the pleiotropic effects from variants.

The power of cTWAS is somewhat limited (Fig. 5d). This probably reflects the fact that cis-genetic components of expression explain relatively low proportions of heritability²⁵ (Fig. 6a). One explanation is that most complex traits probably have genetic components from multiple tissues, while our analysis was limited to one tissue a time. Indeed, combining results across multiple tissues increased the power of cTWAS (Fig. 6c). Another explanation is that regulatory variants may act in specific cell types, developmental stages or conditions (for example, stimulation), and are missed by current eQTL studies. Ongoing efforts to map eQTLs across various cell types and in disease-related conditions would mitigate this challenge and improve the power of cTWAS. Lastly, we note that cTWAS can be applied to other types of molecular QTL data, for example, splicing or chromatin accessibility QTLs, which may explain a large fraction of heritability missed by eQTLs⁴⁹.

We discuss possible directions for further development. First, it is relatively straightforward to include more tissues or cell types in cTWAS. This can be done by including multiple groups of imputed expression traits, with different priors for different groups. This may increase the power to detect causal genes and help identify the ‘causal contexts’ of these genes. Second, we treated imputed expression levels as given. It may be helpful to account for imputation errors in the model⁵⁰. Third, cTWAS assumes that eQTL and GWAS samples are from the same population ancestry. An important direction is to extend cTWAS to multiple ancestries. Lastly, it would be interesting to generalize the model to allow joint analysis of multiple types of molecular QTL data.

In conclusion, by modeling genetic variants and imputed gene expression jointly, cTWAS accounts for pleiotropic effects and LD, creating a robust framework for detecting causal genes. With the large amount of molecular QTL datasets available and being generated, cTWAS promises to translate genetic associations of diseases into knowledge of risk genes, disease mechanisms and potential therapeutic targets.

Methods

Model of individual-level data

Let y be the quantitative phenotype, assumed to be standardized, of an individual. We assume that y depends on imputed gene expressions and variant genotypes of the individual. We denote X_j the expression of the gene j, $\tilde{{X}_{j}}$, as its cis-genetic component, and G_m the genotype of the variant m. We assume that ${\tilde{X}}_{j}$ is given, imputed from a pretrained expression prediction model, and the imputation errors/uncertainty would be ignored. We have the following regression model:

$$y=\mathop{\sum}\limits_{j}{\beta }_{j}\tilde{{X}_{j}}+\mathop{\sum}\limits_{m}{\theta }_{m}{G}_{m}+\epsilon ,$$

(1)

where β_j and θ_m are the effect sizes of gene expression j and the variant m, respectively. ϵ is a normally distributed error term, that is, ϵ ∼ N(0, σ²), and is assumed to be independent across individuals. In practice, we standardize both $\tilde{{X}_{j}}$ and G_m to make the variance equal to 1 for all the genes and variants.

To obtain the imputed expressions, we use existing expression prediction models. Specifically, the imputed expression of a gene j is defined as ∑_lw_jlG_l, where G_l is the genotype of variant l, and w_jl is the weight of the lth variant in gene j’s expression prediction model. We assume that these weights are given at the standardized scale, that is, the weights were derived using standardized variant genotypes. This is the case for the FUSION expression models (http://gusevlab.org/projects/fusion/). When the provided weights are not on the standardized scale, for example, from PredictDB (https://predictdb.org/), these weights must be scaled. This can be done by multiplying the weights by genotype variances from the LD reference.

We specify different prior distributions of gene effects β_j’s, and variant effects θ_m’s. To describe these priors, we note that our model is a special case of a more general regression model, where explanatory variables come from multiple groups with different distributions of effect sizes.

We write the general model with K groups of explanatory variables as

$$y=\mathop{\sum }\limits_{k=1}^{K}\,\mathop{\sum}\limits_{j\in {M}_{k}}{\beta }_{j}{X}_{j}+\epsilon ,$$

(2)

where X_j is jth explanatory variable and j ∈ M_k denotes that it belongs to group k. In our case, the model has two groups of variables, imputed gene expressions and genetic variants. For simplicity of notation, we will use this general model in our following discussions. We assign a spike-and-slab prior distribution for the effect of variable j, with group-specific prior parameters. Specifically, when j ∈ M_k, we denote γ_j an indicator of whether X_j has nonzero effect

$$\begin{array}{rcl}{\gamma }_{j}& \sim &\,{{\mbox{Bernoulli}}}\,({\pi }_{k})\\ {\beta }_{j}| {\gamma }_{j}=1& \sim &N(0,{\sigma }_{k}^{2})\\ {\beta }_{j}| {\gamma }_{j}=0& \sim &{\delta }_{0}.\end{array}$$

(3)

Here δ₀ is the Dirac’s delta function, π_k = P(γ_j = 1∣j ∈ M_k) is the prior probability of the jth variable from group k being casual to the trait (nonzero effect) and ${\sigma }_{k}^{2}$ is the prior variance of the effect size of causal variables in the group k.

Inference of the individual-level model

The inference has two main steps. In the first step, we estimate the prior parameters $\bf{\theta} =\{{\pi }_{k},{\sigma }_{k}^{2},k\in \{1,2\}\}$ for the two groups, gene effects and variants effects. In the second step, we use the estimated θ, and compute the PIP of each variable, defined as the posterior probability of γ_j = 1 given all the data and parameters.

The parameter estimation is done by maximum likelihood. Let y_n×1 be the data of the response variable, where n is the sample size. Let X_n×p = [X₁X₂…X_p] be the data of all the p explanatory variables. The likelihood of our model is given by

$$L(\bf{\theta} ;\bf{X},\bf{y},\sigma )=P(\bf{y}| \bf{X},\bf{\theta} ,\sigma )=\mathop{\sum}\limits_{{\bf{\Gamma}}}P(\bf{y}| \bf{X},\bf{\Gamma},\theta ,\sigma )P(\bf{\Gamma }| \bf{\theta} ),$$

(4)

where Γ = [γ₁, γ₂, …, γ_p] represents the ‘configuration’ of the causal (nonzero effect) status of all variables. We note that σ is the standard deviation of the phenotypic variance, and is assumed to be given (see below). To maximize the likelihood, we use the expectation-maximization (EM) algorithm. In the E-step, we obtain the expectation of log-likelihood over Γ, ${{\mathbb{E}}}_{{{{\bf{\Gamma }}}}}\log P({{{\bf{X}}}},{{{\bf{y}}}},{{{\bf{\Gamma }}}}| \bf{\theta }^{(t)},\sigma )$, where θ^(t) is the parameter value in the t-th iteration. In the M-step, we update θ^(t) using the following rules to maximize the expectation from the E-step (Supplementary Note):

$${\pi }_{k}^{(t+1)}=\frac{1}{| \bf{M}_{k}| }\mathop{\sum}\limits_{j\in \bf{M}_{k}}{\alpha }_{j}^{(t)}$$

(5)

$${\sigma }_{k}^{2,(t+1)}=\frac{{\sum }_{j\in \bf{M}_{k}}{\alpha }_{j}^{(t)}\cdot {\tau }_{j}^{2,(t)}}{{\sum }_{j\in \bf{M}_{k}}{\alpha }_{j}^{(t)}},$$

(6)

where ∣M_k∣ is the number of variables in group k, ${\alpha }_{j}^{(t)}=P({\gamma }_{j}=1| \bf{X},\bf{y},{\bf{\theta} }^{(t)},\sigma )$ is the PIP of variable j given data and current parameter values θ^(t) and ${\tau }_{j}^{2,(t)}={\mathbb{E}}({\,\beta }_{j}^{2}| {\gamma }_{j}=1,\bf{X},\bf{y},{\bf{\theta}}^{(t)},\sigma )$ is the second moment of the posterior effect size of variable j, given that it is a causal variable. The updated rules have simple interpretations. The new parameter ${\pi }_{k}^{(t+1)}$ is simply the average PIP of all variables in the group k and the new ${\sigma }_{k}^{2,(t+1)}$ is the weighted average of the second moment of the posterior effect sizes.

Computing α_j and ${\tau }_{j}^{2}$ at the t-th iteration (we removed superscript t from now on for simplicity) using all variables in the genome is computationally challenging. To reduce the computational burden, we divide the genome into LD blocks using LDetect⁵¹ with variants approximately independent between blocks. We assign a gene to an LD block if all SNPs in its expression prediction model fall into that block. If the variants of the prediction model of any gene span multiple LD blocks, we merge all such blocks into a new block. We will then compute α_j and ${\tau }_{j}^{2}$ of the variables in each block independently, while still using all variables in the genome to update the parameters using Eqs. (5) and (6).

Even within a single block, there may still be hundreds to thousands of variables. This makes it difficult to compute α_j and ${\tau }_{j}^{2}$, as it requires marginalization of Γ. To address this challenge, we first notice that our problem is now reduced to standard fine-mapping or Bayesian variable selection problem, with different prior distributions of the effects of different variables. Therefore, we borrow from fine-mapping literature to compute α_j and ${\tau }_{j}^{2}$ (refs. ^19,20; see Supplementary Note for details).

After we estimate the prior parameters, we apply SuSiE¹⁹, a fine-mapping method, on all variables, including both genes and variants, in each block. Note that all blocks, including the large blocks pruned in the parameter estimation step, will be analyzed. In applying SuSiE, we set the prior probability and prior effect variance of each variable, using the estimated parameters of the group (genes or variants) that this variable belongs to. We allow multiple causal variables by setting L = 5 in SuSiE and assigning null weight as 1 − ∑_jp_j. SuSiE will then return PIPs of all genes and variants in each LD block.

Model of summary statistics

The summary data would include the effect size estimates of variants, ${\hat{\beta }_{j}}$, and their standard errors s_j, as well as the LD between all pairs of variants, denoted as the matrix R. The effect sizes can be standardized, denoted as ${\hat{z}}_{j}={\hat{\beta }}_{j}/{s}_{j}$. Given that the summary data have only variant information, our model would first need to expand the summary data to include gene information. Specifically, we compute the marginal association of each imputed gene with the GWAS trait, and the correlation of any gene with all other genes and all the variants. These calculations are described in the Supplementary Note. Once computed, we will have marginal associations of all variables, including genes and variants, $\hat{{{{\bf{z}}}}}$, and their correlation matrix R. These data would be the input of our analysis.

Following the literature^20,52, and particularly, the summary statistics version of SuSiE (SuSiE-RSS)²⁰, we have the following model of $\hat{\bf{z}}$:

$$\hat{\bf{z}}| \bf{z},\bf{R} \sim \bf{N}_{p}(\bf{R}\bf{z},\bf{R}),$$

(7)

where z = (z₁, z₂, …, z_p) denotes the ‘standardized’ true effect sizes. We use the same spike-and-slab prior for z_j—when the variable j belongs to the group k

$$\begin{array}{rcl}{\gamma }_{j}& \sim &\,{{\mbox{Bernoulli}}}\,({\pi }_{k})\\ {z}_{j}| {\gamma }_{j}=1& \sim &N\left(0,{\sigma }_{k}^{2}\right)\\ {z}_{j}| {\gamma }_{j}=0& \sim &{\delta }_{0}.\end{array}$$

(8)

Again, we denote $\bf{\theta} =\{{\pi }_{k},{\sigma }_{k}^{2}\}$ the prior parameters and Γ the causal configuration. We estimate the prior parameters θ by MLE. This can be done with the same algorithm used for the individual-level model. Specifically, following SuSiE-RSS²⁰, the likelihood function under the individual-level data can be rewritten in terms of sufficient statistics and ${s}_{j}^{2}$. Then, if we make the following substitutions, the likelihood of the individual-level model would be identical to that of the summary statistics model. Specifically, we change β = (β₁, β₂, …, β_p) to z, X^TX to R, X^Ty to $\hat{\bf{z}}$, y^Ty to 1 and n to 1. Also, the prior model of z in the summary statistics model is the same as the prior model of β in the individual-level model. Therefore, we can use the same EM algorithm and the update rules to estimate θ. The update rules follow Eqs. (5) and (6), where the PIP of variable j is now defined as ${\alpha }_{j}^{(t)}=P({\gamma }_{j}=1| \hat{\bf{z}},\bf{R},\bf{\theta }^{(t)})$ and the second moment of the posterior effect ${\tau }_{j}^{2,(t)}={\mathbb{E}}({z}_{j}^{2}| {\gamma }_{j}=1,\hat{\bf{z}},{{{{\bf{R}}}}},\bf{\theta }^{(t)})$.

Once the parameters were estimated, we followed the same procedure as in the individual-level model to obtain PIPs of all variables, except that SuSiE-RSS is used in fine-mapping.

Estimating proportions of phenotypic variance explained by variants and genes

We assume that all the explanatory variables and the response variable in the regression model are standardized, with a variance equal to 1. Then the proportion of variance explained (PVE) by a single variable, j, is simply ${\beta }_{j}^{2}\cdot {{{\rm{Var}}}}(\bf{X}_{j})/{{{\rm{Var}}}}(\bf{y})={\beta }_{j}^{2}$. Assuming that we use the z scores in the summary statistical model, the effect size is related to z score by ${\beta }_{j}={z}_{j}/\sqrt{n}$, where n is the sample size. So, on average, the PVE of a variable in group k (variant or gene) is ${\mathbb{E}}({z}_{j}^{2})={\sigma }_{k}^{2}/n$, where σ_k is the prior variance of effect size in the group, k, at the z-score scale. The expected number of variables with nonzero effects in the group k is π_k ⋅ ∣M_k∣, where π_k is the prior inclusion probability and ∣M_k∣ is the group size. Putting this together, the PVE by the group k is given by

$${{{{\rm{PVE}}}}}_{k}={\sigma }_{k}^{2}\cdot {\pi }_{k}\cdot | \bf{M}_{k}| \cdot {n}^{-1}.$$

(9)

This equation is used to compute PVE from estimated parameters using both simulated and real data.

Simulation procedure

In our simulations, we used the following data: (1) genotype data. We used genotype data from UK Biobank by randomly selecting 80,000 samples. We then filtered samples to only keep ‘White British’, removed samples with missing information, mismatches between self-reported and genetic sex or ‘outliers’ as defined by UK Biobank. We also removed any individuals who have close relatives in the cohort. This ended up with a cohort of n = 45,087 samples. We used SNPs from chromosome (chr) 1 to chr 22 and selected those with a minor allele frequency of >0.05 and at least 95% genotyping rate. After filtering, 6,228,664 SNPs remained and were used in our analysis. (2) Gene expression prediction models. We used GTEx v7 Adipose tissue dataset. This dataset contains 8,021 genes with expression models. We used the LASSO weights from the FUSION website (http://gusevlab.org/projects/fusion/).

We first impute gene expression for all samples using the prediction models. SNP genotypes are harmonized between the expression prediction model and UK Biobank genotypes so that the reference and alternate alleles match. SNPs in the FUSION prediction models but not in UK Biobank, about 13% of all, were not used in imputing gene expression. We then sample the causal genes and SNPs under given prior inclusion probabilities π_k’s and then sample their effect sizes accordingly using the prior variance parameter ${\sigma }_{k}^{2}$. We then simulate y under the model defined in Eq. (1). The prior parameters ${\pi }_{k},{\sigma }_{k}^{2}$ were chosen to reflect the genetic architecture in real data. In particular, it was estimated that gene expression mediates about 10–20% of trait heritability²⁵. And the studies using rare variants for complex traits suggested that about 5% of protein-coding genes are likely causal⁵³. Given these considerations, we set the prior probability for SNPs to 10⁻⁴ or 2.5 × 10⁻⁴, and PVE of SNPs to 0.3 or 0.5. For the genes, we set the prior probability to 0.015 or 0.05 and PVE of genes from 0.02 to 0.1.

To test if our method is robust to mis-specified priors for causal gene effect, we have also simulated causal gene effect under the mixture of normal distributions. For the mixture of normal distributions, we used equal mixtures of four normal distributions, each with mean 0 and standard deviations with ratios of 1:2:4:8. That is for gene j, its prior distribution of causal effect size follows: ${\beta }_{j}| {\gamma }_{j}=1 \sim {\sum }_{\omega \in [1,2,4,8]}{\pi }^{{\prime} }$(N(0,$\omega {\sigma }^{{\prime} 2}$). The prior probability being a casual gene is, therefore, $4{\pi }^{{\prime} }$ and causal effect size variance is $15{\sigma }^{{\prime} 2}$. The prior probability of being a casual gene and the PVE of genes were set to values as described above.

To run cTWAS, we performed the association of individual SNPs with the trait y, to obtain summary statistics of SNPs ${\hat{\bf{z}}}_{\rm{SNP}}$. We randomly selected 2,000 samples from the cohort to calculate SNP genotype correlation matrix or LD matrix ${\hat{\bf{R}}}_{{{{\rm{SNP}}}}}$. We then ran cTWAS summary statistics version under each simulation setting with ${\hat{\bf{z}}}_{\rm{SNP}}$, ${\hat{\bf{R}}}_{{{{\rm{SNP}}}}}$ and expression prediction models as input. The software will harmonize SNP genotypes for ${\hat{\bf{z}}}_{\rm{SNP}}$, ${\hat{\bf{R}}}_{{{{\rm{SNP}}}}}$ and expression prediction models, so that the reference and alternate allele match. To further reduce the computational burden in estimating parameters, we only used one in every ten SNPs (SNP thinning) in the EM algorithm. When calculating PIP, we first run SuSiE-RSS with L = 5 in each LD Block with thinned SNPs. For each block with maximum gene PIP > 0.8, we rerun SuSiE-RSS with L = 5 with the original SNPs to get the final gene PIPs.

GWAS summary statistics

The LDL and SBP summary statistics were from the UK Biobank, computed by the Rapid GWAS project³¹ using Hail⁵⁴. These summary statistics were downloaded from the IEU OpenGWAS project⁵⁵ using GWAS IDs ‘ukb-d-30780_irnt’ (LDL) and ‘ukb-a-360’ (SBP). Both LDL and SBP summary statistics were based on the White British subpopulation of the UK Biobank, with sample sizes of n = 343,621 and n = 317,754, respectively. The IBD summary statistics were from the International IBD Genetics Consortium⁵⁶, computed by meta-analysis using METAL⁵⁷. These summary statistics were obtained from IEU OpenGWAS using GWAS ID ‘ebi-a-GCST004131’. IBD includes cases of both Crohn’s disease and ulcerative colitis. The IBD summary statistics were based on nonoverlapping samples of European ancestry with a combined sample size of n = 59,957. The SCZ summary statistics were from the Psychiatric Genetics Consortium and the CardiffCOGS study⁵⁸, computed by meta-analysis using METAL⁵⁷. These summary statistics were obtained from the authors via the link provided in the manuscript. The SCZ summary statistics were based on nonoverlapping samples of primarily European ancestry with a combined sample size of n = 105,318.

LD reference data

We computed the LD reference panel of common biallelic variants using the White British subpopulation of the UK Biobank. This panel is an in-sample reference for GWAS summary statistics from the Rapid GWAS project³¹. First, we used plate and well information from the genotyping to unambiguously identify over 99% (357,654 of 361,194) of the samples used in the Rapid GWAS project in our data. To ease computation, we randomly selected 10% of these samples to serve as the LD reference panel⁵⁹. We also limited our panel to common autosomal variants with MAF > 0.01 in the Rapid GWAS analyses. Then, we computed correlations between all pairs of variants within each of 1,700 approximately independent regions. These regions are assumed to have low LD between them and are based on previously identified regions⁵¹ that could be lifted over from hg37 to hg38 positions. The final LD reference panel consists of 1,700 correlation matrices and contains 9,309,375 variants. This LD reference panel was used when analyzing all traits, including those that were not measured in the White British subpopulation of the UK Biobank.

Harmonization of GWAS data and expression prediction models to LD reference. We restricted our analyses to variants that were non-missing in the GWAS summary statistics, expression prediction models, and LD reference panel. To ensure consistency between these three datasets, we performed two harmonization procedures. The objective of harmonization was to ensure that the reference and alternate alleles of each variant are defined consistently across all three datasets²⁰. In our case, we must harmonize both the GWAS z-scores and the eQTL prediction models to our LD reference, and we use a different harmonize procedure for each. These procedures are based in part on previous work³³. To describe the two procedures, it is necessary to define several cases of inconsistencies that can occur in either dataset. The first case is a variant with its reference and alternate alleles ‘flipped’ with respect to the LD reference. The GWAS z scores or eQTL weights in the prediction model of the flipped variants should have their signs reversed to be consistent with the LD reference. The second case is a variant that has had its strand ‘switched’ with respect to the LD reference (for example, variant is G/A in the LD reference but C/T in the other dataset). In this case, the reference and alternate alleles are the same, just named using different strands. The z scores or weights of switched variants should not be changed, as they are already consistent with the LD reference. The third case is a variant that is ‘ambiguous’ as to whether it is flipped or switched. This occurs when the two alleles of a variant are also complementary base pairs (A ↔ T substitutions or G ↔ C). For example, let us consider a variant that is A/T in the LD reference and T/A. It is unclear when this variant is flipped or switched with respect to the LD reference (both result in T/A), and it is ambiguous as to whether the signs of the z scores or weights should be reversed. We say that variants are ‘unambiguous’ when they do not involve substitutions of complementary base pairs.

To harmonize the z scores from GWAS summary statistics, we first identified all inconsistencies in reference and alternate alleles between the z scores and the LD reference. Next, we resolved all unambiguous cases of flipped and switched alleles, reversing the sign of z scores that were flipped and taking no action for switched alleles. Then, we imputed the z scores for ambiguous variants using all unambiguous variants in each of the LD regions⁶⁰. If the sign of the imputed z score did not match the sign of the observed z score, we used the sign of the imputed z score, reversing the sign of the observed z score. Note that we did not perform the procedure to resolve ambiguous variants when analyzing LDL or SBP, as both the summary statistics and LD reference panel are derived from UK Biobank data.

To harmonize the eQTL prediction models, we first identified all inconsistencies in reference and alternate alleles between the prediction models and the LD reference. Next, we resolved all unambiguous cases of flipped and switched alleles, reversing the sign of weights that were flipped and taking no action for switched alleles. These steps to resolve unambiguous variants are the same as in the z-score harmonization procedure. To resolve ambiguous variants, we leveraged correlations between ambiguous and unambiguous variants in both our LD reference panel and the LD panel used to construct the PredictDB models. PredictDB reports the covariance between pairs of variants within each gene prediction model. For gene prediction models that include both ambiguous variants and unambiguous variants, we computed the sum of correlations between each ambiguous variant and the unambiguous variants in the prediction model, using both our LD reference panel and the LD used for the prediction models. If the sign of the total correlation in the LD reference of the prediction models did not match the sign of the total correlation in our LD reference panel, we reversed the sign of the prediction model weights for the ambiguous variant. If the total correlation in the LD reference was equal to zero, then we set the weight of the ambiguous variant to zero, as these ambiguous variants did not have any unambiguous variants in the same LD region. For gene prediction models that include only a single ambiguous variant and no unambiguous variants, we left the sign of the prediction model weight unchanged; the resulting gene z score may have an incorrect sign, but the magnitude of the z score will be correct. We excluded gene prediction models with multiple ambiguous variants and no unambiguous variants, as their gene z scores could be incorrect in both sign and magnitude. Such exclusions were infrequent, affecting less than 1% of liver genes in the LDL analysis (94 of 11,502 genes with prediction models).

Performing cTWAS analysis in real data

We used the following cTWAS settings when analyzing real data. For parameter estimation, we used the default procedure for selecting the starting values of the EM algorithm. We then performed 30 iterations of the EM algorithm assuming L = 1 effect (at most a single causal effect) in each region, using variants that were thinned by 10% to reduce computation. For computing PIPs of genes and variants, we used thinned variants and assumed L = 5 (at most five causal effects) in each region. For regions with maximum gene PIP > 0.8, we recomputed PIPs using all variants, with L = 5. For this final step, we allowed a maximum of 20,000 variants in a region to reduce computation; if the maximum number of variants was exceeded, we randomly selected 20,000 variants to include. Unless specified otherwise, we used the threshold PIP > 0.8 for declaring significant genes.

Evaluating methods in distinguishing silver standard and bystander genes for LDL

Following previous studies³⁵, we assessed the performance of TWAS and cTWAS on real data by comparing their ability to distinguish LDL silver standard genes from other nearby genes. We defined a set of ‘bystander’ genes that were within 1 Mb of a silver standard gene. These bystander genes would be considered the negative set. We limited our analysis to 46 of 69 silver standard genes with imputed expression after harmonization, and the 539 imputed bystander genes that are nearby these genes. Next, we determined if these silver standard and bystander genes were significant by TWAS (Bonferroni) or cTWAS (PIP > 0.8). Then, we computed the precision of each method as follows: (number of detected silver standard genes)/(number of detected silver standard genes + number of detected bystander genes).

Classifying TWAS false-positive genes for LDL by source of confounding

To better understand how TWAS generated false-positive findings, we classified whether TWAS false positives as primarily due to confounding by variants or confounding by genes. We defined TWAS false positives as genes that were significant by TWAS (Bonferroni) but PIP < 0.5 by cTWAS. To categorize these false-positive genes, we first assigned them to credible sets. These credible sets were reported by cTWAS, using the default SuSiE setting, which means that only credible sets with sufficient ‘purity’ are reported (that is all variables in a credible set are highly correlated, r > 0.5). If a false-positive gene was not included in any credible set but was highly correlated (r > 0.5) with at least one variant or gene in a credible set, that false-positive gene was also assigned to the credible set. After assigning a total of 83 false-positive genes to credible sets, for each assigned gene, we summed the PIPs of all other genes and variants in its credible set to obtain total PIPs for confounding genes and variants. If the total gene PIP was higher than that of the variants, we classified the gene as confounded by genes, otherwise, confounded by variants.

Summarizing cTWAS results using tissue groups

To aid the interpretation of cTWAS findings, we grouped related tissues into ‘tissue groups’ and summarized the findings within these groups. We used previously defined tissue groups that assigned 37 of 49 tissues to one of 7 tissue groups²⁵. We then took the union of genes detected at PIP > 0.8 in any tissue within each tissue group, and we used these combined lists of detected genes for downstream analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Genotype data from UK Biobank are available through the UK Biobank data access process (http://www.ukbiobank.ac.uk/register-apply/). GTEx v7 Adipose tissue dataset gene prediction models (http://gusevlab.org/projects/fusion/). Publicly available summary statistics for LDL, SBP and IBD were obtained from the IEU OpenGWAS project (https://gwas.mrcieu.ac.uk/) using GWAS IDs ‘ukb-d-30780_irnt’ (LDL), ‘ukb-a-360’ (SBP) and ‘ebi-a-GCST004131’ (IBD). Publicly available summary statistics for SCZ from the Psychiatric Genetics Consortium and the CardiffCOGS study were obtained from http://walters.psycm.cf.ac.uk/. Publicly available prediction models for 49 GTEx tissues from PredictDB (https://predictdb.org/post/2021/07/21/gtex-v8-models-on-eqtl-and-sqtl/).

Code availability

Our software is available at https://xinhe-lab.github.io/ctwas/. Code related to analyses performed in this study can be accessed at https://github.com/xinhe-lab/ctwas-paper and https://zenodo.org/doi/10.5281/zenodo.10373122 ref. ⁶¹.

References

Visscher, P. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Article CAS PubMed PubMed Central Google Scholar
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Prim. 1, 59 (2021).
Article CAS Google Scholar
Tam, V. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484 (2019).
Article CAS PubMed Google Scholar
Gamazon, E. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gusev, A. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wainberg, M. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 51, 592–599 (2019).
Article CAS PubMed PubMed Central Google Scholar
Baca, S. C. et al. Genetic determinants of chromatin reveal prostate cancer risk mediated by context-dependent gene regulation. Nat. Genet. 54, 1364–1375 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hormozdiari, F. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet. 99, 1245–1260 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wen, X., Lee, Y., Luca, F. & Pique-Regi, R. Efficient integrative multi-SNP association analysis via deterministic approximation of posteriors. Am. J. Hum. Genet. 98, 1114–1129 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hukku, A., Sampson, M., Luca, F., Pique-Regi, R. & Wen, X. Analyzing and reconciling colocalization and transcriptome-wide association studies from the perspective of inferential reproducibility. Am. J. Hum. Genet. 109, 825–837 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhu, Z. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
Article CAS PubMed Google Scholar
Porcu, E. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nat. Commun. 10, 3300 (2019).
Article PubMed PubMed Central ADS Google Scholar
Zhou, D. A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nat. Genet. 52, 1239–1246 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mancuso, N. Probabilistic fine-mapping of transcriptome-wide association studies. Nat. Genet. 51, 675–682 (2019).
Article CAS PubMed PubMed Central Google Scholar
Liu, S. Illuminating links between cis-regulators and trans-acting variants in the human prefrontal cortex. Genome Med. 14, 133 (2022).
Article PubMed PubMed Central Google Scholar
Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford) bax028 (2017).
Urbut, S., Wang, G., Carbonetto, P. & Stephens, M. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat. Genet. 51, 187–195 (2019).
Article CAS PubMed Google Scholar
Liu, B., Gloudemans, M., Rao, A., Ingelsson, E. & Montgomery, S. Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 51, 768–769 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B Stat. Methodol. 82, 1273–1300 (2020).
Article MathSciNet Google Scholar
Zou, Y., Carbonetto, P., Wang, G. & Stephens, M. Fine-mapping from summary data with the ‘sum of single effects’ model. PLoS Genet. 18, e1010299 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guan, Y. & Stephens, M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5, 1780–1815 (2011).
Article MathSciNet Google Scholar
Carbonetto, P. & Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108 (2012).
Article MathSciNet Google Scholar
Wright, S. Correlation and causation. J. Agric. Res. 20, 557–585 (1921).
Google Scholar
Yuan, Z. Testing and controlling for horizontal pleiotropy with probabilistic Mendelian randomization in transcriptome-wide association studies. Nat. Commun. 11, 3861 (2020).
Article PubMed PubMed Central ADS Google Scholar
Yao, D., O’Connor, L., Price, A. & Gusev, A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat. Genet. 52, 626–633 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhou, X., Im, H. & Lee, S. CORE GREML for estimating covariance between random effects in linear mixed models for complex trait analyses. Nat. Commun. 11, 4208 (2020).
Article PubMed PubMed Central ADS Google Scholar
Sudlow, C. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Giambartolomei, C. Bayesian test for colocalization between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Article PubMed PubMed Central Google Scholar
Zhu, A. et al. MRLocus: identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity. PLoS Genet. 17, e1009455 (2021).
Article CAS PubMed PubMed Central Google Scholar
McManus, J. N., Lovelett, R. J., Lowengrub, D. & Christensen, S. A unifying statistical framework to discover disease genes from GWASs. Cell Genom. 3, 100264 (2023).
Article CAS PubMed PubMed Central Google Scholar
Rapid GWAS Project. www.nealelab.is/uk-biobank (2018).
GTEx Consortium The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Article Google Scholar
Barbeira, A. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9, 1825 (2018).
Article PubMed PubMed Central ADS Google Scholar
UKBB Heritability. Heritability of >4,000 traits & disorders in UK Biobank. nealelab.github.io/UKBB_ldsc/index.html (2022).
Ndungu, A., Payne, A., Torres, J., Van De Bunt, M. & Mccarthy, M. A multi-tissue transcriptome analysis of human metabolites guides interpretability of associations based on multi-SNP models for gene expression. Am. J. Hum. Genet. 106, 188–201 (2020).
Article CAS PubMed PubMed Central Google Scholar
Forgetta, V. An effector index to predict target genes at gwas loci. Hum. Genet. 141, 1431–1447 (2022).
Article CAS PubMed Google Scholar
Guthrie, P. Complexity of a complex trait locus: HP, HPR, haemoglobin and cholesterol. Gene 499, 8–13 (2012).
Article CAS PubMed PubMed Central Google Scholar
Weissbrod, O. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
Article CAS PubMed PubMed Central Google Scholar
Stancu, C. & Sima, A. Statins: mechanism of action and effects. J. Cell. Mol. Med. 5, 378–387 (2001).
Article CAS PubMed PubMed Central Google Scholar
Yogosawa, S., Mizutani, S., Ogawa, Y. & Izumi, T. Activin receptor-like kinase 7 suppresses lipolysis to accumulate fat in obesity through downregulation of peroxisome proliferator-activated receptor γ and c/ebpα. Diabetes 62, 115–123 (2013).
Article CAS PubMed Google Scholar
Hashimoto, O. et al. Activin E controls energy homeostasis in both brown and white adipose tissues as a hepatokine. Cell Rep. 25, 1193–1203 (2018).
Article CAS PubMed Google Scholar
Barbeira, A. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol. 22, 49 (2021).
Article PubMed PubMed Central Google Scholar
Barbeira, A. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet. 15, e1007889 (2019).
Article PubMed PubMed Central Google Scholar
Nasser, J. Genome-wide enhancer maps link risk variants to disease genes. Nature 593, 238–243 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Liao, Y., Wang, J., Jaehnig, E., Shi, Z. & Zhang, B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 47, W199–W205 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wang, S. et al. UBE2W improves the experimental colitis by inhibiting the NF-κB signaling pathway. Dig. Dis. Sci. 67, 5529–5539 (2022).
Article CAS PubMed Google Scholar
Wen, X., Pique-Regi, R. & Luca, F. Integrating molecular QTL data into genome-wide genetic association analysis: probabilistic assessment of enrichment and colocalization. PLoS Genet. 13, e1006646 (2017).
Article PubMed PubMed Central Google Scholar
Wallace, C. A more accurate method for colocalization analysis allowing for multiple causal variants. PLoS Genet. 17, e1009440 (2021).
Article CAS PubMed PubMed Central Google Scholar
Xiang, R. et al. Gene expression and RNA splicing explain large proportions of the heritability for complex traits in cattle. Cell Genom. 3, 100385 (2023).
Article CAS PubMed PubMed Central Google Scholar
Shi, X., Yang, C. & Liu, J. Using collaborative mixed models to account for imputation uncertainty in transcriptome-wide association studies. Methods Mol. Biol. 2212, 93–103 (2021).
Article CAS PubMed Google Scholar
Berisa, T. & Pickrell, J. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283–285 (2016).
Article CAS PubMed Google Scholar
Zhu, X. & Stephens, M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann. Appl. Stat. 11, 1561 (2017).
Article MathSciNet PubMed PubMed Central Google Scholar
Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584 (2020).
Article CAS PubMed PubMed Central Google Scholar
King, D. Hail-is/hail. GitHub. github.com/hail-is/hail/releases/tag/0.2.13 (2019).
Elsworth, B. et al. The MRC IEU OpenGWAS data infrastructure. Preprint at bioRxiV https://zenodo.org/doi/10.1101/2020.08.10.244293v1 (2020).
De Lange, K. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 49, 256–261 (2017).
Article PubMed PubMed Central Google Scholar
Willer, C., Li, Y. & Abecasis, G. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).
Article CAS PubMed PubMed Central Google Scholar
Pardiñas, A. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet. 50, 381–389 (2018).
Article PubMed PubMed Central Google Scholar
Benner, C. Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. 101, 539–551 (2017).
Article CAS PubMed PubMed Central Google Scholar
Pasaniuc, B. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics 30, 2906–2914 (2014).
Article CAS PubMed PubMed Central Google Scholar
Zhao, S. & Crouse, W. Code and scripts used in cTWAS analysis. Zenodo https://zenodo.org/doi/10.5281/zenodo.10373123 (2023).

Download references

Acknowledgements

This work was supported by the National Institutes of Health (NIH) under grants R01MH110531 (to X.H.), R01HG010773 (to X.H.), R01HG002585 (to M.S.), and a research grant through Geisel School of Medicine at Dartmouth’s Center for Quantitative Biology through NIH grant P20GM130454 (to S.Z.). We thank H. Im (University of Chicago), X. Liu (University of Chicago) and other members of He and Stephens’ groups for helpful comments on the work and the manuscript.

Author information

These authors contributed equally: Siming Zhao, Wesley Crouse.

Authors and Affiliations

Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA
Siming Zhao
Department of Human Genetics, University of Chicago, Chicago, IL, USA
Siming Zhao, Wesley Crouse, Sheng Qian, Kaixuan Luo, Matthew Stephens & Xin He
Dartmouth Cancer Center, Lebanon, NH, USA
Siming Zhao
Department of Statistics, University of Chicago, Chicago, IL, USA
Matthew Stephens

Authors

Siming Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wesley Crouse
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Qian
View author publications
You can also search for this author in PubMed Google Scholar
Kaixuan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Stephens
View author publications
You can also search for this author in PubMed Google Scholar
Xin He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.H. conceived the idea and supervised the project. S.Z., X.H. and M.S. developed the method and algorithm. S.Z. and W.C. implemented the software and performed the analyses. K.L. and S.Q. tested the software, performed the analyses and verified the reported results. S.Z., W.C., M.S. and X.H. wrote the manuscript.

Corresponding authors

Correspondence to Siming Zhao, Matthew Stephens or Xin He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Seunggeun Lee, Ruidong Xiang, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note and Supplementary Figs. 1–19.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–12.

Supplementary Data

cTWAS results of all analyzed genes across all tissues for IBD, SCZ, and SBP. Results are stored separately for each combination of traits and tissues. Column legends for each set of results are the same as in Supplementary Table 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, S., Crouse, W., Qian, S. et al. Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits. Nat Genet 56, 336–347 (2024). https://doi.org/10.1038/s41588-023-01648-9

Download citation

Received: 20 December 2022
Accepted: 14 December 2023
Published: 26 January 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s41588-023-01648-9