Introduction

Transcriptome-wide association study (TWAS) is a valuable analysis strategy for identifying genes that influence complex traits and diseases through genetic regulation of gene expression1,2,3,4,5. Researchers have successfully deployed TWAS analyses to identify risk genes for complex human diseases, including Alzheimer’s disease6,7,8, breast cancer9,10,11, ovarian cancer12,13, and cardiovascular disease14,15. A typical TWAS consists of two separate stages. In Stage I, TWAS acquires individual-level genetic and expression data from relevant tissues available in a reference dataset like the Genotype-Tissue Expression (GTEx) project16,17 or the North American Brain Expression Consortium18, and trains multivariable regression models on the reference data treating gene expression as outcome and SNP genotype data (typically cis-SNPs nearby the test gene) as predictors to determine genetically regulated expression (GReX). After Stage I that uses the GReX regression models to estimate effect sizes of SNP predictors that, in the broad sense, are effect sizes of expression quantitative trait loci (eQTLs), Stage II of TWAS proceeds by using these trained eQTL effect sizes to impute GReX within an independent GWAS of a complex human disease or trait. One can then test for association between the imputed GReX and phenotype, which is equivalent to a gene-based association test taking these eQTL effect sizes as corresponding test SNP weights19,20,21.

For Stage I of TWAS, a variety of training tools exist for fitting GReX regression models using reference expression and genetic data, including PrediXcan19, FUSION20, and TIGAR22. While these methods all employ different techniques for model fitting, they all require individual-level reference expression and genetic data to estimate eQTL effect sizes for TWAS. Therefore, these methods cannot be applied to emerging reference summary-level eQTL results such as those generated by the eQTLGen23 and CommonMind24 consortia, which provide eQTL effect sizes and p values relating individual SNPs to gene expression. The development of TWAS methods that can utilize such summary-level reference data is valuable to permit the applicability of the technique to broader analysis settings. Moreover, as TWAS power increases with increasing reference sample size25, TWAS using summary-level reference datasets can lead to enhanced performance compared to using individual-level reference datasets since the sample sizes of the former often are considerably larger than the latter. For example, the sample size of the summary-based eQTLGen reference sample is 31,684 for blood, whereas the sample size of the individual-level GTEx V6 reference is only 338 for the same tissue. Consequently, TWAS analysis leveraging the summary-based eQTLGen dataset as a reference can likely provide insights into the genetic regulation of complex human traits.

In this work, we propose a framework that can use summary-level reference data to train GReX regression models required for Stage I of TWAS analysis. Our method is motivated by a variety of published polygenic risk score (PRS) methods26,27,28,29,30,31 that can predict phenotype in a test dataset using summary-level SNP effect size estimates and p values based on single SNP tests from an independent reference GWAS. We can adapt these PRS methods for TWAS since eQTL effect sizes are essentially SNP effect sizes resulting from a reference “GWAS” of gene expression. Thus, our predicted GReX in Stage II of TWAS is analogous to the PRS constructed based on training GWAS summary statistics of single SNP-trait association. Here, we adapt four representative summary-data-based PRS methods—p value thresholding with linkage disequilibrium (LD) clumping (P+T)26, frequentist LASSO32 regression-based method lassosum27, nonparametric Bayesian Dirichlet Process Regression (DPR) model-based33 method SDPR29, and Bayesian multivariable regression model-based method with continuous shrinkage (CS) priors PRS-CS28 for TWAS analysis. We apply each of these PRS methods to first train eQTL effect sizes based on a multivariable regression model from summary-level reference eQTL data (Stage I), and subsequently use these eQTL effect sizes (i.e., eQTL weights) to impute GReX and then test GReX-trait association in an independent test GWAS (Stage II).

As we will show, the PRS method with optimal performance for TWAS depends on the underlying genetic architecture for gene expression. Since the genetic architecture of expression is unknown apriori, we maximize the performance of TWAS over different possible architectures by proposing a TWAS framework called OTTERS (Omnibus Transcriptome Test using Expression Reference Summary data). OTTERS first constructs individual TWAS tests and p values using eQTL weights trained by each of the PRS techniques outlined above, and then calculates an omnibus test p value using the aggregated Cauchy association test34 (ACAT-O) with all individual TWAS p values (Fig. 1). OTTERS is applicable to both summary-level and individual-level test GWAS data within Stage II TWAS analysis.

Fig. 1: OTTERS framework.
figure 1

OTTERS estimates cis-eQTL weights from eQTL summary data and reference LD panel using four imputation models (Stage I), and conducts ACAT-O test to combine gene-based association test p values from individual methods with individual/summary-level test GWAS data (Stage II).

In subsequent sections, we first describe how to use the PRS methods on summary-level reference eQTL data in Stage I TWAS, and then describe how we can use the resulting eQTL weights to perform Stage II TWAS using OTTERS. We then evaluate the performance of individual PRS methods and OTTERS using simulated expression and real genetic data based on patterns observed in real datasets. Interestingly, when we assume individual-level reference data are available, we observe that OTTERS outperforms the popular FUSION20 approach across all simulation settings considered. Many of the individual PRS methods also outperform FUSION in these settings. We then apply OTTERS to blood eQTL summary-level data (n = 31,684) from the eQTLGen consortium23 and GWAS summary data of cardiovascular disease from the UK Biobank (UKBB)35. By comparing OTTERS results to those of FUSION20 using individual-level GTEx reference data of whole blood tissue, we demonstrate that OTTERS using large summary-level reference datasets and multiple gene expression imputation models can successfully reveal potential risk genes missed by FUSION based on smaller individual-level reference datasets and only one model. Finally, we conclude with a discussion.

Results

Method overview

For the standard two-stage TWAS approach, Stage I estimates a GReX imputation model using individual-level expression and genotype data available from a reference dataset, and then Stage II uses the eQTL effect sizes from Stage I to impute gene expression (GReX) in an independent GWAS and test for association between GReX and phenotype. GReX for test samples can be imputed from individual-level genotype data and eQTL effect size estimates. When individual-level GWAS data are not available, one can instead use summary-level GWAS data for TWAS by applying the TWAS Z-score statistics proposed by FUSION20 and S-PrediXcan36 (see details in Methods).

Since eQTL summary data are analogous to GWAS summary data where gene expression represents the phenotype, we can follow the idea from PRS methods to estimate the eQTL effect sizes based on a multivariable regression model using only marginal least squared effect estimates and p values (based on a single variant test) from the eQTL summary data as well as a reference LD panel from samples of the same ancestry26,27,28,29. Although all PRS methods are applicable to TWAS Stage I, we only consider four representative methods—P+T26, Frequentist lassosum27, Nonparametric Bayesian SDPR29, Bayesian PRS-CS28 (see details in Methods).

As shown in Fig. 1, OTTERS first trains GReX imputation models per gene g using P+T, lassosum, SDPR, and PRS-CS methods that each infers cis-eQTL weights using cis-eQTL summary data and an external LD reference panel of the same ancestry (Stage I). Once we derive cis-eQTL weights for each training method, we can impute the respective GReX using that method and perform the respective gene-based association analysis in the test GWAS dataset. We thus derive a set of TWAS p values for gene g, one per training method. We then use these TWAS p values to create an omnibus test using the ACAT-O34 approach that employs a Cauchy distribution for inference (see details in Supplementary Methods). We refer to the p value derived from ACAT-O test as the OTTERS p value. The ACAT-O34 approach has been widely used in hypothesis testing to combine multiple testing methods for the same hypothesis37,38,39, which has been shown as an effective approach to leverage different test methods to increase the power while still managing to control for type I error. Adding TWAS p values based on additional PRS methods to the ACAT-O test can possibly improve the power further at the cost of additional computation.

Simulation study

We used real genotype data from 1894 whole-genome sequencing (WGS) samples from the Religious Orders Study and Rush Memory and Aging Project (ROS/MAP) cohort40,41 and Mount Sinai Brain Bank (MSBB) study42 for simulation. We divided 14,772 genes into five groups according to gene length, and randomly selected 100 genes from each group (500 genes in total). We randomly split samples into 568 training (30%) and 1326 testing samples (70%) to mimic a relatively small sample size in the real reference panel for training gene expression imputation models. From the real genotype data, we simulated six scenarios with two different proportions of causal cis-eQTL, \({p}_{{causal}}=\)\(\left({{{{\mathrm{0.001,0.01}}}}}\right),\) as well as three different proportions of gene expression variance explained by causal eQTL, \({h}_{e}^{2}=\left({{{{\mathrm{0.01,0.05,0.1}}}}}\right).\)

We generated gene expression of gene \(g\) (\({{{{{{\bf{e}}}}}}}_{g}\)) using the multivariable regression model \({{{{{{\bf{e}}}}}}}_{g}={{{{{{\bf{X}}}}}}}_{g}{{{{{\bf{w}}}}}}{{{{{\boldsymbol{+}}}}}}{{{{{{\boldsymbol{\epsilon }}}}}}}_{g}\), where \({{{{{{\bf{X}}}}}}}_{g}\) represents the standardized genotype matrix of the randomly selected causal eQTL of gene \(g\), \({{{{{{\boldsymbol{\epsilon }}}}}}}_{g}\sim N(0,(1-{h}_{e}^{2}){{{{{\bf{I}}}}}})\). We generated the eQTL effect sizes \({{{{{\bf{w}}}}}}\) from \(N({{{{\mathrm{0,1}}}}})\) and then re-scaled these effects to ensure that the expression variance explained by all causal variants is \({h}_{e}^{2}\). We generated 10 replicates of gene expression per scenario. For each simulated gene expression, we then generated 10 sets of GWAS Z-scores to perform a total of 50,000 TWAS simulations. We generated the GWAS Z-scores from a multivariate normal distribution with \({{{{{\bf{Z}}}}}}\sim {MVN}\left({{{{{{\mathbf{\Sigma }}}}}}}_{g}{{{{{\bf{w}}}}}}\sqrt{{n}_{{gwas}}{h}_{p}^{2}}{{,}}{{{{{{\mathbf{\Sigma }}}}}}}_{g}\right)\)38, where \({{{{{\bf{w}}}}}}\) is the true causal eQTL effect sizes, \({{{{{{\mathbf{\Sigma }}}}}}}_{g}\) is the correlation matrix of the standardized genotype \({{{{{{\bf{X}}}}}}}_{g}\) from test samples, \({n}_{{gwas}}\) is the assumed GWAS sample size, and \({h}_{p}^{2}\) denotes the amount of phenotypic variance explained by simulated \({{{{{{\bf{GReX}}}}}}{{{{{\boldsymbol{=}}}}}}{{{{{\bf{X}}}}}}}_{g}{{{{{\bf{w}}}}}}\) (see Methods). We set \({h}_{p}^{2}=0.025\). To calibrate power, we considered \({n}_{{gwas}}\) = (200K, 300K, 400K, 500K) for scenarios with \({h}_{e}^{2}\) = 0.01, \({n}_{{gwas}}\) = (25K, 50K, 75K, 100K) for scenarios with \({h}_{e}^{2}=0.05\), and \({n}_{{gwas}}\) = (10K, 20K, 30K, 40K) for scenarios with \({h}_{e}^{2}=0.1\).

In Stage I of our TWAS analysis, we applied P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS methods to estimate eQTL weights using eQTL summary data and the reference LD of training samples. In Stage II of the TWAS, we used the estimated eQTL weights and the simulated GWAS Z-scores to conduct a gene-based association test. In addition to gene-based association tests based on eQTL weights per training method, we further constructed the corresponding OTTERS p values. We evaluated the performance of the training methods with test samples, comparing test \({R}^{2}\) that was the squared Pearson correlation coefficient between imputed GReX and simulated gene expression. We evaluated TWAS power given by the proportion of 50,000 repeated simulations with TWAS p value \( < 2.5\times {10}^{-6}\) (genome-wide significance threshold adjusting for testing 20K independent genes).

As shown in Fig. 2, we demonstrated that the Stage I training method with optimal test \({R}^{2}\) and TWAS power depended on the underlying genetic architecture of gene expression (\({p}_{{causal}}\)) as well as gene expression heritability (\({h}_{e}^{2}\)). In situations where true cis-eQTLs were sparse (\({p}_{{causal}}\) = 0.001) and the gene expression heritability was small (\({h}_{e}^{2}\) = 0.01), P+T (0.05) method performed the best with the highest TWAS power among all individual methods. When gene expression heritability is low (\({h}_{e}^{2}\) = 0.01), the power of P+T (0.001) and lassosum methods were shown as the lowest. When gene expression heritability increased (\({h}_{e}^{2}\) = 0.05 or 0.1) within this sparse eQTL model, P+T (0.001) and PRS-CS were generally the optimal methods. For a less sparse model with \({p}_{{causal}}\) = 0.01, SDPR and PRS-CS generally performed best among the individual methods. Relative to individual methods, we found that combining the TWAS p values based on the four PRS training methods together for analysis in our OTTERS framework obtained the highest power across all scenarios.

Fig. 2: Test R2 (A) and TWAS power (B) comparison in simulation studies.
figure 2

Various scenarios with proportions of true causal cis-eQTL \({p}_{{causal}}=\left(0.001,0.01\right)\) and gene expression heritability \({h}_{e}^{2}=\left(0.01,0.05,0.1\right)\) were considered in the simulation studies. Distribution of test R2 in 5000 simulations per method per scenario was presented using box-plot (A). The median was shown as a black bar. The lower and upper hinges corresponded to the 25th and 75th percentiles. Whiskers extended from the hinge to the value no further than 1.5 of the interquartile range. Data beyond the end of the whiskers were plotted individually. The GWAS sample size in the x-axis of panel B was chosen with respect to \({h}_{e}^{2}\) values. The proportion of phenotype variance explained by gene expression (\({h}_{p}^{2}\)) was set to be 0.025. TWAS was conducted using simulated GWAS Z-scores.

To evaluate the type I error of the individual PRS methods along with OTTERS, we picked one simulated replicate per gene from the scenario with \({h}_{e}^{2}=0.1\) and \({p}_{{causal}}=0.001\), simulated \(2\times {10}^{3}\) phenotypes from \(N({{{{\mathrm{0,1}}}}})\), and permuted the eQTL weights for TWAS to perform a total of \({10}^{6}\) null simulations. OTTERS was shown well calibrated in the tails of the distribution as shown by quantile-quantile (Q-Q) plots of TWAS p values in Supplementary Fig. S1. We also observed that OTTERS had well-controlled type I error for stringent significance levels between \({10}^{-4}\) and \({2.5\times 10}^{-6}\) (Supplementary Table S1), which are typically utilized in TWAS. For more modest significance thresholds (\(\alpha={10}^{-2}\)), we noted that OTTERS had a slightly inflated type I error rate. This modest inflation is consistent with the findings of the original ACAT-O work, which showed that the Cauchy-distribution-based approximation that ACAT-O employs might not be accurate for larger p values when the correlation among tests is strong34. This suggests that modest OTTERS p values may be interpreted with caution.

We also compared the performance of our individual PRS training methods to those of FUSION, assuming individual-level reference data were available for the latter method to train GReX models. As shown in Fig. 2A, we interestingly observed that our training methods yielded similar or improved test \({R}^{2}\) compared to FUSION in this situation, with SDPR and PRS-CS outperforming FUSION across all simulation settings. Comparing TWAS power, we found that OTTERS outperformed FUSION by a considerable margin in our simulations (Fig. 2B). These simulation results suggest that, while we developed OTTERS based on PRS training methods to handle summary-level reference data, OTTERS can still improve TWAS power when individual-level reference data are available. This is likely because OTTERS accounts for multiple possible models of genetic architectures of gene expression assumed by the different PRS training methods.

GReX imputation accuracy in GTEx V8 blood samples

To evaluate the imputation accuracy of P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS methods in real data, we applied these training methods to summary-level eQTL reference data from the eQTLGen consortium23 with n = 31,684 blood samples, to train GReX imputation models for 16,699 genes. For test data, we downloaded the transcriptomic data of 315 blood tissue samples that are in GTEx V8 but were not part of GTEx V6 (as GTEx V6 samples contributed to the reference eQTLGen consortium summary data). For these 315 samples, we compared imputed GReX to observed expression levels. We considered trained imputation models with test \({R}^{2} \, > \, 0.01\) as “valid” models, as suggested by previous TWAS methods20,43. We also compared the imputation accuracy of these five training models to those using FUSION based on a smaller individual-level training dataset (individual-level GTEx V6 reference dataset; see Methods). For such models, we compared the test \({R}^{2}\) for genes that had test \({R}^{2} > 0.01\) by at least one training method.

We observed that PRS-CS obtained the most “valid” GReX imputation models with test \({R}^{2}\) > 0.01. Among 16,699 tested genes, PRS-CS obtained “valid” GReX imputation models for 10,337 genes, compared to 9816 genes by P+T (0.001) (5.0% less valid genes than PRS-CS), 9662 genes by P+T (0.05) (6.5% less), 8718 genes by lassosum (15.7% less), 9670 genes by SDPR (6.5% less), and 4704 genes by FUSION (54.5% less) (Table 1). Among the “valid” GReX imputation models obtained by each method, the ones trained by PRS-CS have the highest median test \({R}^{2}\). The P+T (0.001) method obtained the second most “valid” GReX imputation models with the second largest median test \({R}^{2}\), as compared to P+T (0.05), lassosum, and SDPR (Table 1). We note that the performance of PRS-CS method was not sensitive to the global-shrinkage parameter (Supplementary Fig. S2).

Table 1 Test R2 in n = 315 whole blood tissue samples from GTEx V8

By comparing test \({R}^{2}\) per “valid” GReX imputation model by PRS-CS versus the other methods (Fig. 3), we observed that PRS-CS had the best overall performance for imputing GReX as it provided the most “valid” models with higher GReX imputation accuracy compared to P+T methods, lassosum, SDPR, and FUSION. Comparing the test \({R}^{2}\) among the other four training methods, we observed that these two P+T methods obtained similar test \({R}^{2}\) per “valid” model. Meanwhile, the test \({R}^{2}\) per valid model varied widely among the P+T methods, lassosum, and SDPR (Supplementary Fig. S3), suggesting that none of these four were optimal across all genes and their performance likely depended on the underlying unknown genetic architecture. These results are consistent with our simulation results.

Fig. 3: Test R2 by PRS-CS versus P+T (0.001), P+T (0.05), lassosum, SDPR and FUSION.
figure 3

Test R2 by PRS-CS versus P+T (0.001) (A), P+T (0.05) (B), lassosum (C), SDPR (D), and FUSION (E) with 315 GTEx V8 test samples, with different colors denoting whether test R2 > 0.01 only by PRS-CS (red), only by the y-axis method (green), or both methods (blue). Genes with test R2 > 0.01 by at least one method were included in the plot.

TWAS of cardiovascular disease

Using the eQTL weights trained by P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS methods with the eQTLGen23 reference data and reference LD from GTEx V8 WGS samples44, we applied our OTTERS framework to the summary-level GWAS data of Cardiovascular Disease from UKBB (n = 459,324, case fraction = 0.319)35 (see Methods). We performed TWAS of cardiovascular disease for 16,678 genes. First, for each gene, we obtained TWAS p values per individual training method (P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS). Second, we performed genomic control45 for TWAS test statistics generated under each specific training model, by scaling all test statistics to ensure that the median test p value equals to 0.5. Last, we only considered genes with test GReX \({R}^{2} > 0.01\) by at least one PRS training method in additional GTEx V8 samples in the follow-up ACAT-O test. We combined the adjusted p values across all five training models using ACAT-O to obtain our OTTERS test statistics and p values. Genes with OTTERS p values < \(2.998\times {10}^{-6}\) (Bonferroni corrected significance level) were identified as significant TWAS genes for cardiovascular risk.

In total, we identified 40 significant TWAS genes by using OTTERS. To identify independently significant TWAS genes, we calculated the \({R}^{2}\) (squared Pearson correlation) between the GReX predicted by PRS-CS for each pair of genes. For a pair of genes with the predicted GReX \({R}^{2} > 0.5\), we only kept the gene with the smaller TWAS p value as the independently significant gene. OTTERS obtained 38 independently significant TWAS genes (Table 2 and Fig. 3B), compared to 17 independently significant genes by P+T (0.001), 11 by P+T (0.05), 10 by lassosum, 41 by SDPR, and 12 by PRS-CS. Among these 38 independent TWAS risk genes identified by OTTERS, gene RP11-378A13.1 (OTTERS p value = \(9.78\times {10}^{-9}\)) was not within 1 MB of any known GWAS risk loci with genomic-control corrected p value < \(5\times {10}^{-8}\) in the UKBB summary-level GWAS data. This gene RP11-378A13.1 was also identified to be a significant TWAS risk gene in blood tissue for systolic blood pressure, high cholesterol, and cardiovascular disease by FUSION1.

Table 2 Independent TWAS risk genes of cardiovascular disease identified by OTTERS

We compared our OTTERS results with the TWAS results shown on TWAS hub (see Data availability) obtained by FUSION using the same UKBB GWAS summary data of cardiovascular disease but using a smaller individual-level reference expression dataset from GTEx V6 (whole blood tissue, n = 338). Of the 38 independent genes that OTTERS identified from TWAS with eQTLGen reference data of whole blood, FUSION only identified 8 of these genes (CLCN6, PSRC1, RP11-378A13.1, CAMK1D, SIDT2, MTHFSD, NTN5, OPRL1) when using the GTEx V6 reference data of the same tissue. FUSION did identify 13 additional OTTERS genes (NPPA, CPEB4, NT5C2, TNNT3, C11orf49, CSK, FES, MBTPS1, ACE, MRI1, HAUS8, RPL28, CTSZ), when considering all available tissue types in GTEx V6 reference data. These genes were identified by FUSION when considering the GTEx V6 reference data of artery, thyroid, adipose visceral, nerve tibial tissues, etc. For example, the most significant gene FES (OTTERS p value = \(2.87\times {10}^{-32}\)) was identified by FUSION using GTEx reference data of artery tibial, thyroid, and adipose visceral omentum tissues, and was also identified as a TWAS risk gene for high blood pressure, which is strongly related to cardiovascular disease46.

Our OTTERS method also identified 17 genes (LINC01093, SERPINB6, CARMIL1, ZSCAN12P1, HCG4P7, HCG4P3, HLA-S, PSPHP1, LPL, PTP4A3, SLCO3A1, RALBP1, SULT2B1, EDN3, ZBTB46, FAM3B, MX1) that were not detected by FUSION using GTEx V6 data, where EDN3 (Endothelin 3, a member of the endothelin family) was shown to be active in the cardiovascular system and play an important role in the maintenance of blood pressure or generation of hypertension47.

By comparing OTTERS results with the ones obtained by individual methods (Table 2, Fig. 4 and Supplementary Fig. S4), we found that all individual methods contributed to the OTTERS results. For example, the gene LINC01093 was only identified by lassosum, while genes CPEB4, SIDT2, and ACE were only detected by PRS-CS and SDPR and the gene EDN3 was only identified by the P+T methods. To better understand the differences among individual methods, we plotted the eQTL weights estimated by P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS for three example genes that were only detected by one or two individual methods (Supplementary Figs. S5S7). For these genes, we plotted the eQTL weights produced by each method with such weights color coded with respect to \(-{{{\log }}}_{10}\)(GWAS p values) from the UKBB GWAS summary statistics and shape coded with respect to the direction of UKBB GWAS Z-score statistics. Generally, significant TWAS p values would be obtained by methods that obtained eQTL weights with relatively large magnitudes for SNPs with relatively more significant GWAS p values.

Fig. 4: Manhattan plot of TWAS results by OTTERS.
figure 4

Manhattan plot of TWAS results by OTTERS using GWAS summary-level statistics of cardiovascular disease and imputation models fitted based on eQTLGen summary statistics. The x-axis represented the genomic position, and the y-axis represented –log10(p values). p values were the genomic-control corrected p values from the Z-score test from TWAS (two-sided). Independently significant TWAS risk genes were labeled.

In Supplementary Fig. S5, we showed the eQTL weights for gene SIDT2, which was a significant risk gene identified by both PRS-CS and SDPR, and had p values < \({10}^{-4}\) by other methods. Compared to lassosum, SDPR had more significant GWAS SNPs colocalized with eQTLs having relatively large weights in the test region, and PRS-CS had more non-significant GWAS SNPs colocalized with eQTLs having zero weights. Compared to the P+T methods, SDPR and PRS-CS based on a multivariate regression model modeled LD among all test SNPs, and thus estimated eQTL weights leading to significant TWAS findings. In Supplementary Fig. S6, we provided the results of gene EDN3, which was only identified by P+T methods (p values \(\le 9.15\times {10}^{-8}\)). Compared to P+T methods, SDPR (p value = \(5.9\times {10}^{-3}\)) and PRS-CS (p value = \(0.0158\)) had fewer significant GWAS SNPs colocalized with eQTLs that had relatively large weights in the test region, while lassosum (p value = \(8.6\times {10}^{-6}\)) assigned relatively large weights to more non-significant GWAS SNPs. In Supplementary Fig. S7, we provided results for gene LINC01093, which was only identified by lassosum. For this gene, SDPR and PRS-CS estimated near-zero weights for most test SNPs with significant GWAS p values in the test region. Most significant GWAS SNPs did not have eQTL test p values < 0.001 or 0.05, and were thus filtered out by P+T methods. lassosum was the only method that produced relatively large eQTL weights that colocalized with GWAS-significant SNPs.

These results were consistent with our simulation study results, demonstrating that the performance of different individual methods depended on the underlying genetic architecture. We do note that there were a handful of genes identified by an individual method that were not significant using OTTERS (Supplementary Table S2). Nonetheless, the omnibus test borrows strength across all individual methods, thus generally achieving higher TWAS power and identifying the group of most robust TWAS risk genes.

By examining the Q-Q plots of TWAS p values, we observed moderate inflation for all methods (Supplementary Fig. S8). Such inflation in TWAS results is not uncommon48,49,50, which could be due to similar inflation in the GWAS summary data and not distinguishing the pleiotropy and mediation effects for considered gene expression and phenotype of interest51 (Supplementary Fig. S9). We also observed a notable inflation in the GWAS p values of cardiovascular disease from UKBB (Supplementary Fig. S9), as we estimated the LD score regression52 intercept to be 1.1 from the GWAS summary data.

We did not consider directly comparing to FUSION in our above TWAS analyses of cardiovascular disease since we used the summary-level reference data eQTLGen. However, to assess the performance of OTTERS and FUSION in a real study where individual-level reference data are available, we performed an additional TWAS analysis of cardiovascular disease in the UK Biobank using the GTEx V8 data of 574 whole blood samples as the reference data. We trained OTTERS Stage I using cis-eQTL summary statistics obtained from these 574 GTEx V8 whole blood samples and reference LD from GTEx V8 WGS samples, and trained FUSION models using individual-level genotype data and gene expression data of the same 574 whole blood samples.

We tested TWAS association for 19,653 genes and identified genes with TWAS p values < \(2.53\times {10}^{-6}\) (Bonferroni corrected significance level) as significant TWAS genes. Training \({R}^{2} > 0.01\) was used to select “valid” GReX imputation models for TWAS (Supplementary Fig. S10). To identify independently significant TWAS genes, we calculated the training \({R}^{2}\) between the GReX predicted by lassosum for of each pair of genes, since lassosum had the best training \({R}^{2}\) (Supplementary Fig. S10). For a pair of genes with the predicted GReX \({R}^{2}\) > 0.5, we only kept the gene with the smaller TWAS p value as the independently significant gene. As a result, OTTERS obtained 34 independently significant TWAS genes, while FUSION identified 21 independently significant TWAS genes (Supplementary Fig. S11). A total of 14 genes were identified by both FUSION and OTTERS (Supplementary Table S3).

These results demonstrate the advantages of OTTERS for using multiple PRS training methods to account for the unknown genetic architecture of gene expression, which is consistent in our simulation results. These results also showed the advantage of using eQTL summary data with a larger training sample size, as more independently significant TWAS genes were identified by using the eQTLGen summary reference data (38 vs. 34), even with a more stringent rule (test instead of training \({R}^{2} > 0.01\)) applied to select test genes with “valid” GReX imputation models.

Computational time

The computational time per gene of different PRS methods depends on the number of test variants considered for the target gene. Thus, we calculated the computational time and memory usage for four groups of genes whose test variants were <2000, between 2000 and 3000, between 3000 and 4000, and >4000, respectively. Among all tested genes in our real studies, the median number of test variants per gene is 3152, and the proportion of genes in each group is 10.3%, 33.4%, 34.5%, and 21.8%, respectively. For each group, we randomly selected ten genes on Chromosome 4 to evaluate the average computational time and memory usage per gene. We benchmarked the computational time and memory usage of each method on one Intel(R) Xeon(R) processor (2.10 GHz). The evaluation was based on 1000 MCMC iterations for SDPR and PRS-CS (default) without parallel computation (Supplementary Table S4). We showed that P+T and lassosum were computationally more efficient than SDPR and PRS-CS, whose speeds were impeded by the need of MCMC iterations. Between the two Bayesian methods, SDPR implemented in C++ uses significantly less time and memory than PRS-CS implemented in Python.

Discussion

Our OTTERS framework represents an omnibus TWAS tool that can leverage summary-level expression and genotype results from a reference sample, thereby robustly expanding the use of TWAS into more settings. To this end, we adapted and evaluated five different PRS methods assuming different underlying genetic models, including the relatively simple method P+T26 with two different p value thresholds (0.001 and 0.05), the frequentist method lassosum27, as well as the Bayesian methods PRS-CS28 and SDPR29 within our omnibus test for optimal inference. We note that additional PRS methods such as MegaPRS30 or PUMAS31 could also be implemented as additional OTTERS Stage I training methods. Higher TWAS power might be obtained by adding more PRS methods in OTTERS Stage I, with additional computation cost. We also note that the existing SMR-HEIDI53 method, which uses summary-level data from GWAS and eQTL studies to test for possible causal genetic effects of a trait of interest that were mediated through gene expression, could also be used as an alternative method besides TWAS. However, the SMR method generally restricts eQTL for consideration, excluding those where the eQTL p values larger than a certain threshold, e.g., 0.05.

In simulation studies, we demonstrated that the performance of each of these five PRS methods depended substantially on the underlying genetic architecture for gene expression, with P+T methods generally performing better for sparse architecture, whereas the Bayesian methods performing better for denser architecture. Consequently, since the genetic architecture of gene expression is unknown apriori, we believe this justifies the use of the omnibus TWAS test implemented in OTTERS for practical use, as this test had near-optimal performance across all simulation scenarios considered. While we developed our methods with summary-level reference data in mind, we note that our prediction methods and OTTERS perform well (in terms of imputation accuracy and power) relative to existing TWAS methods like FUSION when individual-level reference data are available.

In our real data application using UKBB GWAS summary-level data, we compared OTTERS TWAS results using reference eQTL summary data from eQTLGen consortium to FUSION TWAS results using a substantially smaller individual-level reference dataset from GTEx V6. OTTERS identified 13 significant TWAS risk genes that were missed by FUSION using individual-level GTEx V6 reference data of blood tissue, suggesting that the use of larger reference datasets like eQTLGen in TWAS can provide additional findings. Interestingly, the genes missed by FUSION were instead detected using individual-level GTEx reference data of other tissue types that are more directly related to cardiovascular disease. By comparing OTTERS to FUSION when the same individual-level GTEx V8 reference data of whole blood samples were used, we still observed that OTTERS identified more risk genes than FUSION, which we believe is due to the former method accounting for the unknown genetic architecture of gene expression by using multiple regression methods to train GReX imputation models. These applied results were consistent with our simulation results.

Among all individual methods, P+T is the most computationally efficient method. The Bayesian methods SDPR and PRS-CS require more computation time than the frequentist method lassosum as the former set of methods require a large number of MCMC iterations for model fit. By comparing the performance of these five methods in terms of the imputation accuracy and TWAS power in simulations and real applications, we conclude that none of these methods was optimal across different genetic architectures. We found that all methods provided distinct and considerable contributions to the final OTTERS TWAS results. These results demonstrate the benefits of OTTERS in practice, since OTTERS can combine the strength of these individual methods to achieve optimal performance.

To enable the use of OTTERS by the public, we provide an integrated tool (see Code Availability) to (1) Train GReX imputation models (i.e., estimate eQTL weights in Stage I) using eQTL summary data by P+T, lassosum, SDPR, and PRS-CS; (2) Conduct TWAS (i.e., testing gene-trait association in Stage II) using both individual-level and summary-level GWAS data with the estimated eQTL weights; and (3) Apply ACAT-O to aggregate the TWAS p values from individual training methods. Since the existing tools for P+T, lassosum, SDPR, and PRS-CS were originally developed for PRS calculations, we adapted and optimized them for training GReX imputation models in our OTTERS tool. For example, we integrate TABIX54 and PLINK55 tools in OTTERS to extract input data per target gene more efficiently. We also enable parallel computation in OTTERS for training GReX imputation models and testing gene-trait association of multiple genes.

The OTTERS framework does have its limitations. First, training GReX imputation models by all individual methods on average cost ~20 min for all five training models per gene, which might be computationally challenging for studying eQTL summary data of multiple tissue types and for ~20K genome-wide genes. Users might consider prioritizing P+T (0.001), lassosum, and SDPR training methods, as these three provide complementary results in our studies. Second, the currently available eQTL summary statistics are mainly derived from individuals of European descent. Our OTTERS trained GReX imputations model based on these eQTL summary statistics, and the resulting imputed GReX could consequently have attenuated cross-population predictive performance56. This might limit the transferability of our TWAS results across populations. Third, our OTTERS cannot provide the direction of the identified gene-phenotype associations, which should be referred to as the sign of the TWAS Z-score statistic per training method. Last, even though the method applies to integrate both cis- and trans- eQTL with GWAS data, the computation time and availability of summary-level trans-eQTL reference data are still the main obstacles. Our current OTTERS tool only considers cis-eQTL effects. Extension of OTTERS to enable cross-population TWAS and incorporation of trans-eQTL effects is part of our ongoing research but is out of the scope of this work.

Our OTTERS framework using large-scale eQTL summary data has the potential to identify more significant TWAS risk genes than standard TWAS tools that use smaller individual-level reference transcriptomic data and use only a single regression method for training GReX imputation models. This tool provides the opportunity to leverage not only available public eQTL summary data of various tissues for conducting TWAS of complex traits and diseases, but also the emerging summary-level data of other types of molecular QTL such as splicing QTLs, methylation QTLs, metabolomics QTLs, and protein QTLs. For example, OTTERS could be applied to perform proteome-wide association studies using summary-level reference data of genetic-protein relationships such as those reported by the SCALLOP consortium57, and epigenome-wide association studies using summary-level reference data of methylation-phenotype relationships reported by Genetics of DNA Methylation Consortium (GoDMC) (see Data availability). OTTERS would be most useful for broad researchers who only have access to summary-level QTL reference data and summary-level GWAS data. The feasibility of integrating summary-level molecular QTL data and GWAS data makes our OTTERS tool valuable for wide application in current multi-omics studies of complex traits and diseases.

Methods

Traditional two-stage TWAS analysis

Stage I of TWAS estimates a GReX imputation model using individual-level expression and genotype data available from a reference dataset. Consider the following GReX imputation model from \(n\) individuals and \(m\) SNPs (multivariable regression model assuming linear additive genetic effects) within the reference dataset:

$${{{{{{\bf{e}}}}}}}_{g}={{{{{{\bf{X}}}}}}}_{g}{{{{{\bf{w}}}}}}+{{{{{{\boldsymbol{\epsilon }}}}}}}_{g},{{{{{{\boldsymbol{\epsilon }}}}}}}_{g}\sim N\left(0,{\sigma }_{\epsilon }^{2}{{{{{\bf{I}}}}}}\right).$$
(1)

Here, \({{{{{{\bf{e}}}}}}}_{g}\) is a vector representing gene expression levels of gene \(g\), \({{{{{{\bf{X}}}}}}}_{g}\) is an \(n\times m\) matrix of genotype data of SNP predictors proximal or within gene \(g\), \({{{{{\bf{w}}}}}}\) is a vector of genetic effect sizes (referred to as a broad sense of eQTL effect sizes), and \({{{{{{\boldsymbol{\epsilon }}}}}}}_{g}\) is the error term. Here, we consider only cis-SNPs within 1 MB of the flanking 5’ and 3’ ends as genotype predictors that are coded within \({{{{{{\bf{X}}}}}}}_{g}\)19,20,22. Once we configure the model in Eq. (1), we can employ methods like PrediXcan, FUSION, and TIGAR to fit the model and obtain estimates of eQTL effect sizes (\(\hat{{{{{{\bf{w}}}}}}}\)).

Stage II of TWAS uses the eQTL effect sizes (\(\hat{{{{{{\bf{w}}}}}}}\)) from Stage I to impute gene expression (GReX) in an independent GWAS and then test for association between GReX and phenotype. Given individual-level GWAS data with genotype data \({{{{{{\bf{X}}}}}}}_{{{{{{\rm{new}}}}}}}\) and eQTL effect sizes (\(\hat{{{{{{\bf{w}}}}}}}\)) from Stage I, the GReX for \({{{{{{\bf{X}}}}}}}_{{{{{{\rm{new}}}}}}}\) can be imputed by \(\widehat{{{{{{\bf{GReX}}}}}}}={{{{{{\bf{X}}}}}}}_{{{{{{\rm{new}}}}}}}\hat{{{{{{\bf{w}}}}}}}\). The follow-up TWAS would test the association between \(\widehat{{{{{{\bf{GReX}}}}}}}\) and phenotype \({{{{{\bf{y}}}}}}\) based on a generalized linear regression model, which is equivalent to a gene-based association test taking \(\hat{{{{{{\bf{w}}}}}}}\) as test SNP weights. When individual-level GWAS data are not available, one can apply FUSION and S-PrediXcan test statistics to summary-level GWAS data as follows:

$${Z}_{g,{FUSION}}=\frac{{\sum }_{j=1}^{J}({\hat{w}}_{j}{Z}_{j})}{\sqrt{{\hat{{{{{{\bf{w}}}}}}}}^{\prime}{{{{{\bf{V}}}}}}\hat{{{{{{\bf{w}}}}}}}}},\, {Z}_{g,\, S-{PrediXcan}}=\frac{{\sum }_{j=1}^{J}({\hat{w}}_{j}{{\hat{\sigma }}_{j}Z}_{j})}{\sqrt{{\hat{{{{{{\bf{w}}}}}}}}^{\prime}{{{{{\bf{V}}}}}}\hat{{{{{{\bf{w}}}}}}}}}$$
(2)

where \({Z}_{j}\) is the single variant Z-score test statistic in GWAS for the jth SNP, \(j=1,\ldots,\,J\), for all test SNPs that have both eQTL weights with respect to the test gene \(g\) and GWAS Z-scores; \({\hat{\sigma }}_{j}\) is the genotype standard deviation of the jth SNP; and \({{{{{\bf{V}}}}}}\) denotes the genotype correlation matrix in FUSION Z-score statistic and genotype covariance matrix in S-PrediXcanZ-score statistic of the test SNPs. In particular, \({\hat{\sigma }}_{j}\) and \({{{{{\bf{V}}}}}}\) can be approximated from a reference panel with genotype data of samples of the same ancestry such as those available from the 1000 Genomes Project58. If \(\hat{{{{{{\bf{w}}}}}}}\) are standardized effect sizes estimated assuming standardized genotype \({{{{{{\bf{X}}}}}}}_{g}\) and gene expression \({{{{{{\bf{e}}}}}}}_{g}\) in Eq. (1), FUSION and S-PrediXcan Z-score statistics are equivalent13. Otherwise, the S-PrediXcan Z-score should be applied to avoid false-positive inflation.

TWAS Stage I analysis using summary-level reference data

We now consider a variation of TWAS Stage I to estimate cis-eQTL effect sizes \(\hat{{{{{{\bf{w}}}}}}}\) based on a multivariable regression model (Eq. (1)) from summary-level reference data. We assume that the summary-level reference data provide information on the association between a single genetic variant j (\(j=1,\ldots,m\)) and expression of gene g. This information generally consists of effect size estimates (\({\widetilde{w}}_{j},j=1,\ldots,m\)) and p values derived from the following single variant regression models:

$${{{{{{\bf{e}}}}}}}_{g}={{{{{{\bf{x}}}}}}}_{j}{w}_{j}+{{{{{{\boldsymbol{\epsilon }}}}}}}_{j},{{{{{{\boldsymbol{\epsilon }}}}}}}_{j}\sim N\left(0,{\sigma }_{{\epsilon }_{j}}^{2}{{{{{\bf{I}}}}}}\right),\, j=1,\ldots,m.$$
(3)

Here, \({{{{{{\bf{x}}}}}}}_{j}\) is an \(n\times 1\) vector of genotype data for genetic variant \(j\), and \({{{{{{\boldsymbol{\epsilon }}}}}}}_{j}\) is the error term. Since eQTL summary data are analogous to GWAS summary data where gene expression represents the phenotype, we can estimate the eQTL effect sizes \(\hat{{{{{{\bf{w}}}}}}}\) using marginal least squared effect estimates (\({\widetilde{w}}_{j},\,j=1,\ldots,m\)) and p values from the QTL summary data as well as reference LD information of the same ancestry26,27,28,29. Although all PRS methods apply to the TWAS Stage I framework, we only consider four representative methods as follows:

P+T: the P+T method selects eQTL weights by LD clumping and p value Thresholding26. Given threshold \({P}_{T}\) for p values and threshold \({R}_{T}\) for LD \({R}^{2}\), we first exclude SNPs with marginal p values from eQTL summary data greater than \({P}_{T}\) or strongly correlated (LD \({R}^{2}\) greater than \({R}_{T}\)) with another SNP having a more significant marginal p value (or Z-score statistic value). For the remaining selected test SNPs, we use marginal standardized eQTL effect sizes from eQTL summary data as eQTL weights for TWAS in Stage II. We considered \({R}_{T}=0.99\) and \({P}_{T}=({{{{\mathrm{0.001,0.05}}}}})\) in this paper and implemented the P+T method using PLINK 1.955 (see Code availability). We denote the P+T method with \({P}_{T}\) equal to 0.001 and 0.05 as P+T (0.001) and P+T (0.05), respectively.

Frequentist lassosum: with standardized \({{{{{{\bf{e}}}}}}}_{g}\) and \({{{{{{\bf{X}}}}}}}_{g}\), we can show that the marginal least squared eQTL effect size estimates from the single variant regression model (Eq. (3)) is \(\widetilde{{{{{{\bf{w}}}}}}}{{{{{\boldsymbol{=}}}}}}{{{{{{\bf{X}}}}}}}_{g}^{{{{{{\rm{T}}}}}}}{{{{{{\bf{e}}}}}}}_{g}{{{{{\boldsymbol{/}}}}}}n\) and that the LD correlation matrix is \({{{{{\bf{R}}}}}}\)=\({{{{{{\bf{X}}}}}}}_{g}^{{{{{{\rm{T}}}}}}}{{{{{{\bf{X}}}}}}}_{g}{{{{{\boldsymbol{/}}}}}}n\). That is,

$${{{{{{\bf{X}}}}}}}_{g}^{{{{{{\rm{T}}}}}}}{{{{{{\bf{e}}}}}}}_{g}=n\widetilde{{{{{{\bf{w}}}}}}} \, {{{{{\rm{and}}}}}} \, {{{{{{\bf{X}}}}}}}_{g}^{{{{{{\rm{T}}}}}}}{{{{{{\bf{X}}}}}}}_{g}=n{{{{{\bf{R}}}}}}.$$
(4)

By approximating \(n{{{{{\bf{R}}}}}}\) by \(n{{{{{{\bf{R}}}}}}}_{s}\)\({{{{{{\boldsymbol{(}}}}}}{{{{{\bf{R}}}}}}}_{s}{{{{{\boldsymbol{=}}}}}}\left(1-s\right){{{{{{\bf{R}}}}}}}_{r}+s{{{{{\bf{I}}}}}}{{{{{\boldsymbol{)}}}}}}\) with a tuning parameter \(0 < s < 1\), a reference LD correlation matrix \({{{{{{\bf{R}}}}}}}_{r}\) from an external panel such as one from the 1000 Genomes Project58, and an identity matrix \({{{{{\bf{I}}}}}}{{{{{\boldsymbol{)}}}}}}\) in the LASSO32 penalized loss function, the frequentist lassosum method27 can tune the LASSO penalty parameter and \(s\) using a pseudovalidation approach and then solve for eQTL effect size estimates \(\hat{{{{{{\bf{w}}}}}}}\) by minimizing the approximated LASSO loss function requiring no individual-level data (see details in Supplementary Methods).

Bayesian SDPR: Bayesian DPR method33 as implemented in TIGAR22 estimates \(\hat{{{{{{\bf{w}}}}}}}\) for the underlying multivariable regression model in Eq. (1) by assuming a normal prior \(N(0,{\sigma }_{w}^{2})\) for \({w}_{j}\) and a Dirichlet process prior59\({DP}(H,\alpha )\) for \({\sigma }_{w}^{2}\) with base distribution \(H\) and concentration parameter \(\alpha\). SDPR29 assumes the same DPR model but can be applied to estimate the eQTL effect sizes \(\hat{{{{{{\bf{w}}}}}}}\) using only eQTL summary data (see details in Supplementary Methods).

Bayesian PRS-CS: the PRS-CS method28 assumes the following normal prior for \({w}_{j}\) and non-informative scale-invariant Jeffreys prior on the residual variance \({\sigma }_{\epsilon }^{2}\) in Eq. (1):

$${w}_{j} \sim N\left(0,\frac{{\sigma }_{\epsilon }^{2}}{n}{\psi }_{j}\right),\, p\left({\sigma }_{\epsilon }^{2}\right)\propto {\sigma }_{\epsilon }^{2},\, {\psi }_{j} \sim {Gamma}\left(a,\, {\delta }_{j}\right),\, {\delta }_{j} \sim {Gamma}\left(b,\, \phi \right),$$

where local shrinkage parameter \({{{{{{\rm{\psi }}}}}}}_{{{{{{\rm{j}}}}}}}\) has an independent gamma-gamma prior and \(\phi\) is a global-shrinkage parameter controlling the overall sparsity of \({{{{{\bf{w}}}}}}\). PRS-CS sets hyper parameters \(a=1\) and \({{{{{\rm{b}}}}}}\) = 1/2 to ensure the prior density of \({{{{{{\rm{w}}}}}}}_{{{{{{\rm{j}}}}}}}\) to have a sharp peak around zero to shrink small effect sizes of potentially false eQTL towards zero, as well as heavy, Cauchy-like tails which assert little influence on eQTLs with larger effects. Posterior estimates \(\hat{{{{{{\bf{w}}}}}}}\) will be obtained from eQTL summary data (i.e., marginal effect size estimates \(\widetilde{{{{{{\bf{w}}}}}}}\) and p values) and reference LD correlation matrix \({{{{{\bf{R}}}}}}\) by Gibbs Sampler (see details in Supplementary Methods). We set \(\phi\) as the square of the proportion of causal variants in the simulation and as \({10}^{-4}\) per gene in the real data application.

OTTERS framework

As shown in Fig. 1, OTTERS first trains GReX imputation models per gene g using P+T, lassosum, SDPR, and PRS-CS methods that each infers cis-eQTLs weights using cis-eQTL summary data and an external LD reference panel of similar ancestry (Stage I). Once we derive cis-eQTLs weights for each training method, we can impute the respective GReX using that method and perform the respective gene-based association analysis in the test GWAS dataset using the formulas given in Eq. (2) (Stage II). We thus derive a set of TWAS p values for gene g; one p value for each training model that we applied. We then use these TWAS p values to create an omnibus test using the ACAT-O34 approach that employs a Cauchy distribution for inference (see details in Supplementary Methods). We refer to the p value derived from ACAT-O test as the OTTERS p value.

Marginal eQTL effect sizes

In practice of training GReX imputation models using reference eQTL summary data, the marginal standardized eQTL effect sizes were approximated by \({\widetilde{{{{{{\rm{w}}}}}}}}_{j} \, \approx \, {Z}_{j}/\sqrt{{{{{{\rm{median}}}}}}(n_{g,j})}\), where \({Z}_{j}\) denotes the corresponding eQTL Z-score statistic value by single variant test and \({{{{{{\rm{median}}}}}}}(n_{g,j})\) denotes the median sample size of all cis-eQTLs for the target gene \(g\). The median cis-eQTL sample size per gene was also taken as the sample size value required by lassosum, SDPR, and PRS-CS methods, for robust performance. Since summary eQTL datasets (e.g., eQTLGen) were generally obtained by meta-analysis of multiple cohorts, the sample size per test SNP could vary across all cis-eQTLs of the test gene. The median cis-eQTL sample size ensures a robust performance for applying those eQTL summary data-based methods.

LD clumping

We performed LD clumping with \(R_{T}\) = 0.99 for all individual methods in both simulation and real studies. Using PRS-CS as an example, we also showed that LD clumping does not affect the GReX imputation accuracy compared to no clumping in real data testing (Supplementary Fig. S12).

LD blocks for lassosum, PRS-CS, and SDPR

LD blocks were determined externally by ldetect60 for lassosum and PRS-CS, while internally for SDPR, which ensures that SNPs in one LD block do not have nonignorable correlation (\({R}^{2}\) > 0.1) with SNPs in other blocks.

Simulate GWAS Z-score

Given gene expression \({{{{{{\bf{e}}}}}}}_{g}\) simulated from the multivariate regression model \({{{{{{\bf{e}}}}}}}_{g}={{{{{{\bf{X}}}}}}}_{g}{{{{{\bf{w}}}}}}+{{{{{{\boldsymbol{\epsilon }}}}}}}_{g}\) with standardized genotype matrix \({{{{{{\bf{X}}}}}}}_{g}\) and \({{{{{{\boldsymbol{\epsilon }}}}}}}_{g}\sim N\left(0,\, (1-{h}_{e}^{2}){{{{{\bf{I}}}}}}\right.\), we assume GWAS phenotype data of \({n}_{{gwas}}\) samples are simulated from the following linear regression model

$${{{{{\bf{y}}}}}}={h}_{p}\left({{{{{{\bf{X}}}}}}}_{g}{{{{{\bf{w}}}}}}\right)+{{{{{{\boldsymbol{\epsilon }}}}}}}_{p},\, {{{{{{\boldsymbol{\epsilon }}}}}}}_{p}\sim N\left(0,{{{{{\bf{I}}}}}}\right){{{{{\boldsymbol{.}}}}}}$$

Conditioning on true genetic effect sizes, the GWAS Z-score test statistics of all test SNPs will follow a multivariate normal distribution, \({MVN}\left({{{{{{\bf{\Sigma }}}}}}}_{g}{{{{{\bf{w}}}}}}\sqrt{{n}_{{gwas}}{h}_{p}^{2}},{{{{{{\bf{\Sigma }}}}}}}_{g}\right),\) where \({{{{{{\bf{\Sigma }}}}}}}_{g}\) is the correlation matrix of the standardized genotype \({{{{{{\bf{X}}}}}}}_{g}\) from test samples, and \({h}_{p}^{2}\) denotes the amount of phenotypic variance explained by simulated \({{{{{{\rm{GReX}}}}}}={{{{{\bf{X}}}}}}}_{g}{{{{{\bf{w}}}}}}\)38. Thus, for a given GWAS sample size, we can generate GWAS Z-score statistic values from this multivariate normal distribution.

FUSION using individual-level reference data

To train GReX imputation models by FUSION with individual-level reference data, we trained Best Linear Unbiased Predictor model61, Elastic-net regression62, LASSO regression32, and single best eQTL model as implemented in the FUSION tool (see Code availability). Default settings were used to train GReX imputation models by FUSION in our simulation and real studies. LASSO regression was performed only for genes with positively estimated expression heritability. The eQTL weights of the best-trained GReX imputation model will be used to conduct TWAS by FUSION.

GTEx V8 dataset

GTEx V8 dataset (dbGaP phs000424.v8.p2) contains comprehensive profiling of WGS genotype data and RNA-sequencing (RNA-seq) transcriptomic data across 54 human tissue types of 838 donors. The GTEx V8 WGS genotype data of all samples were used to construct reference LD in our studies. The GTEx V6 RNA-seq data of whole blood samples were used to train GReX imputation models by FUSION, and the GTEx V8 RNA-seq data of additional whole blood samples (n = 315) were used to test GReX imputation accuracy in our studies. GTEx V8 RNA-seq data of all whole blood samples (n = 574) were also used as reference data for comparing the performance of OTTERS and FUSION.

eQTLGen consortium dataset

The eQTLGen consortium23 dataset was generated based on meta-analysis across 37 individual cohorts (n = 31,684) including GTEx V6 as a sub-cohort. eQTLGen samples consist of 25,482 blood (80.4%) and 6202 peripheral blood mononuclear cell (19.6%) samples. We considered SNPs with minor allele frequency > 0.01, Hardy–Weinberg p value > 0.0001, call rate > 0.95, genotype imputation \({r}^{2}\) > 0.5 and observed in at least two cohorts23. We only considered cis-eQTL (within ±1 MB around gene transcription start sites) with a test sample size > 3000. As a result, we used cis-eQTL summary data of 16,699 genes from eQTLGen to train GReX imputation models for use in OTTERS in this study.

UK Biobank GWAS data of cardiovascular disease

Summary-level GWAS data of Cardiovascular Disease from UKBB (n = 459,324, case fraction = 0.319)35 were generated by BOLT-LMM based on the Bayesian linear mixed model per SNP63 with assessment centered, sex, age, and squared age as covariates. Although BOLT-LMM was derived based on a quantitative trait model, it can be applied to analyze case–control traits and has a well-controlled false-positive rate when the trait is sufficiently balanced with a case fraction ≥10% and samples are of the same ancestry. The tested dichotomous cardiovascular disease phenotype includes a list of sub-phenotypes: hypertension, heart/cardiac problem, peripheral vascular disease, venous thromboembolic disease, stroke, transient ischemic attack (tia), subdural hemorrhage/hematoma, cerebral aneurysm, high cholesterol, and other venous/lymphatic diseases.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.