## Introduction

Genome-wide association studies (GWASs) have shown that most disease-associated variants reside in noncoding regions1,2,3, raising challenges in biological interpretation and target gene identification4. These findings also lead to the hypothesis that many genetic variants can affect complex traits by regulating gene expression levels, which has motivated large-scale expression quantitative trait loci (eQTL) analyses5,6,7 and transcriptome-wide association studies (TWASs)8,9,10,11,12,13. TWASs integrate expression reference panels (eQTL studies with matched individual-level expressions and genetic data) with complex trait GWAS results to discover gene-trait associations. First, an expression reference panel is used to learn a per-gene expression prediction model by regressing assayed gene expression levels on cis-eQTL genotypes (i.e., single nucleotide polymorphisms (SNPs) within 1 megabase of the gene transcription start site and transcription end site). Second, statistical associations are estimated between predicted gene expression levels for GWAS samples and the trait of interest. TWASs have garnered interest within the human genetics community and have deepened our understanding of the genetic basis of many complex traits14,15.

Despite these encouraging findings, the size of the expression reference panels primarily determines the number of analyzable genes, and hence the power of TWASs. Analyzable genes are defined as genes with satisfactory gene expression prediction models (i.e., prediction accuracy R2 ≥ 0.01). For example, building expression prediction models with Genotype-Tissue Expression (GTEx) project v7p data yielded more than twice as many prediction models (i.e., analyzable genes) than were developed using GTEx v6p data16. For whole blood tissue, the number of analyzable genes increased from 2057 to 6006 solely owing to the increase in the size of the expression reference panel (from 338 samples17 to 369 samples16). Others have also observed that the number of analyzable genes can be significantly increased when using a slightly larger expression reference panel. For example, Zhou et al.13 show that among the 44 overlapping tissues in GTEx, the average number of analyzable genes increased from 4,570 (v6p) to 7,213 (v8) for one popular TWAS method PrediXcan8 when the average sample size increased from 160 (v6p) to 332 (v8). More importantly, perhaps due to the small sample sizes of available expression reference panels, the current standard practice of TWASs is to only analyze genes with model performance R2 ≥ 0.018,9,11. This practice may fail to capture genes with low expression heritability but large causal effect sizes on the trait of interest, as suggested in previous literature1. It is of great interest to construct more powerful gene expression prediction models, especially for genes with low expression heritability.

One potential approach to improving the power of TWASs is to combine individual-level expression reference panel data from several consortia or studies, thereby increasing the sample size of the expression reference panel. While this is straightforward, privacy concerns and subject consent can preclude access to individual-level expression reference panel data, making this approach challenging or practically infeasible. On the other hand, one may use summary-level expression panels (often publicly available) with much larger sample sizes to build expression prediction models. However, to date, there is limited exploration of how one can build expression prediction models using a summary-level expression panel.

In this work, we introduce the Summary-level Unified Method for Modeling Integrated Transcriptome (SUMMIT), a method that integrates summary-level expression reference panel data, derived from much larger sample sizes, with trait GWAS results to identify associated genes for the trait of interest. Specifically, we build gene expression prediction models for blood based on the eQTL summary-level data generated by the eQTLGen consortium6. To date, the eQTLGen consortium has conducted the largest meta-analysis involving 31,684 blood samples from 37 cohorts6, and the corresponding eQTL summary-level data have been made publicly available. Through simulation studies and analyses of GWAS summary statistics from 24 complex traits, we show that SUMMIT improves the accuracy of expression prediction in blood, successfully builds expression prediction models for genes with low expression heritability, and outperforms benchmark methods for identifying risk genes. Additionally, we conduct a case study on COVID-19 severity and identify 11 putatively causal genes.

## Results

### SUMMIT overview

We develop SUMMIT, which extends the conventional TWAS methods8,9,10,11,12, by leveraging eQTL summary-level data to predict expression levels. SUMMIT consists of three main steps. First, for each gene, we train expression prediction models using a penalized regression framework with eQTL summary-level data (e.g., eQTLGen6 with sample size of 31,684). Next, we test associations between the predicted gene expression levels and the trait of interest for each fitted expression prediction model with satisfactory performance (e.g., with R2 ≥ 0.005). Finally, as p-values from different gene expression prediction models can be correlated, we apply the Cauchy combination test18,19 to aggregate p-values from the fitted prediction models and the combined p-value from the Cauchy combination test effectively quantifies the overall gene-trait associations. The Cauchy combination test is a computationally efficient p-value combination method that provides an accurate p-value approximation for highly significant results (which are of interest) and does not require the correlation structure among the combined p-values to be estimated.

### Simulation results

In the simulation studies, we first evaluated the accuracy of the expression imputation models generated by SUMMIT and benchmark methods and the corresponding statistical power. Next, we studied the impact of sample size on expression prediction accuracy and TWAS power. We verified that SUMMIT recovered the information of the individual-level expression reference panel from summary-level data, and the improvement in expression prediction accuracy was adequately translated into a higher power of sequential TWASs Fig. 1.

First, we observed that SUMMIT performed better than two widely used competing methods, TWAS-fusion and PrediXcan, yielding a higher average imputation R2 with respect to different gene expression heritability values ($${h}_{e}^{2}$$) and proportions of causal SNPs (pcausal) (Fig. 2a). When $${h}_{e}^{2}=0.01$$ and pcausal = 0.2, the average imputation R2 of 1000 replications was estimated to be 0.693% by SUMMIT, showing 1735% improvement compared with PrediXcan and 305% improvement compared with TWAS-fusion. Importantly, such improvements in the expression prediction models result in consistently higher TWAS power under different sparsity levels (Fig. 2b). As a note, TWAS power is defined as the discovery rate of associations between predicted expression levels and phenotypic outcomes using simulated independent GWAS data. When $${h}_{e}^{2}=0.01$$ and pcausal = 0.2, the power of SUMMIT was 0.992 while those of PrediXcan and TWAS-fusion were 0.028 and 0.201, respectively. In addition, we observed that SUMMIT achieved higher average imputation R2 than Lassosum, a pipeline that is also capable of leveraging summary-level data.

The current standard practice of TWASs is to only analyze genes with imputation R2 ≥ 0.01 and not consider genes with lower prediction performance (i.e., genes with imputation R2 between 0.005 and 0.01). However, such genes may have larger causal effect sizes on the trait of interest1. To evaluate the performance of different methods under low heritability, we simulated data with $${h}_{e}^{2}=0.005$$. Figure 2a shows that SUMMIT achieved satisfactory performance under these scenarios. When $${h}_{e}^{2}=0.005$$ and pcausal = 0.2, SUMMIT estimated the average imputation R2 at 0.29%, which was much higher than the values yielded by TWAS-fusion (0.057%; 401% improvement) and PrediXcan (0.011%; 2460% improvement). This is because SUMMIT leverages summary-level eQTL data with a larger sample size. Furthermore, SUMMIT also achieved higher average imputation R2 than Lassosum because SUMMIT leverages the genetic distance to estimate the LD matrix and combines results from multiple penalties.

Next, we evaluated the impact of the sample size of the expression reference panel (Supplementary Fig. 2). As expected, the imputation R2 increased as the sample size increased. For the setting of $${h}_{e}^{2}=0.05$$ and pcausal = 0.2, when the sample size increased from 300 to 31,684, the average imputation R2 increased from 0 to 0.0474, highlighting the advantages of using a larger expression reference panel. Importantly, the imputation models became more stable (i.e., decreased in variance) as the sample size increased. Additionally, we confirmed that the imputation results from SUMMIT (average imputation R2: 0.0469) were highly similar to those from analyses of individual-level data (average imputation R2: 0.0474), confirming that SUMMIT can capture individual-level information from summary-level data.

Finally, we conducted confirmatory simulation studies (Fig. 2c) to verify that the gains in TWAS power came from an improved expression prediction accuracy. We varied N within (300, 600, 3000, 10,000, 31,684), and $${h}_{e}^{2}$$ within (0.005, 0.01, 0.1), and we set $${h}_{p}^{2}=0.2$$ and pcausal = 0.05. We observed that the TWAS power and prediction accuracy were highly correlated. As the sample size of the expression reference panel increased, the expression prediction models became more accurate, leading to higher TWAS power. Notably, due to the setup (i.e., the two-sample framework) of the simulations, the gains in the sample size of the expression reference panel could only interact with the TWAS power through better prediction models. The results were similar for pcausal = 0.01 (Supplementary Fig. 7).

To consider the potential impact of genetic architecture, we considered two additional randomly selected genes, and the results were similar (Supplementary Figs 36). Furthermore, we ran the simulations 5,000,000 times (5000 runs for each of 1000 computed weights) under the null hypothesis to evaluate the Type 1 error rates, confirming that all methods maintained well-controlled Type 1 error rates (Supplementary Fig. 8).

In summary, these results demonstrate the potential of SUMMIT for building expression prediction models and conducting subsequent association studies, especially for genes with low expression heritability.

### SUMMIT improves the expression imputation accuracy

We compared the accuracy of the expression prediction models developed using SUMMIT and five benchmark methods, Lassosum, MR-JTI, TWAS-fusion, PrediXcan, and UTMOST for whole blood tissue. We trained the SUMMIT and Lassosum models with eQTLGen summary data, and the other four benchmark methods were trained with GTEx data. For a fair comparison, we compared the number of genes with estimated R2 ≥ 0.01 and only focused on genes that appear in the eQTLGen summary data. The R2 for MR-JTI, TWAS-fusion, PrediXcan, and UTMOST, were based on cross validation and were provided by the original authors, and the R2 for SUMMIT and Lassosum were calculated based on the additional subjects in the GTEx version 8 data, who were not included in the meta-analysis of eQTLGen and thus can be viewed as an independent external dataset. Compared with the benchmark methods, Lassosum (8249 genes), MR-JTI (9576 genes), TWAS-fusion (5411 genes), PrediXcan (7512 genes), and UTMOST (7236 genes), SUMMIT developed satisfactory prediction models for more genes (9749 genes with R2 ≥ 0.01). Importantly, SUMMIT could build prediction models for the majority (8936 out of 12,230; 73.1%) of genes that could be analyzed by any of the benchmark methods (Fig. 3a). In addition, SUMMIT was able to establish prediction models of 1836 additional genes that were ignored by benchmark methods that leveraged individual-level data, showing consistent improvement by using a large training dataset. Furthermore, compared with Lassosum, SUMMIT achieved marginally higher prediction accuracy in different quantiles (T ≈ 0.017 and p ≈ 0.077, by one-sided Kolmogorov-Smirnov test). Compared with the other four benchmark methods, SUMMIT achieved significantly higher prediction accuracy in different quantiles (MR-JTI: T ≈ 0.080 and p < 2.2 × 10−16; PrediXcan: T ≈ 0.089 and p < 2.2 × 10−16; TWAS-fusion: T ≈ 0.240 and p < 2.2 × 10−16; and UTMOST: T ≈ 0.076 and p < 2.2 × 10−16; all by one-sided Kolmogorov-Smirnov test).

### SUMMIT identifies more associations than competing methods

To evaluate the performance in identifying significant associations, we applied SUMMIT to the GWAS summary statistics of 24 traits (Ntotal ≈ 5,600,000 without adjusting for sample overlap across studies, Supplementary Data 1) and compared the results with those of the benchmark methods (for all genes with R2 ≥ 0.01). The association results for SUMMIT are summarized in Supplementary Data 1. While SUMMIT analyzed all genes with R2 ≥ 0.005 and applied Bonferroni correction accordingly, we focused on the genes with R2 ≥ 0.01 for a fair comparison (Fig. 3b). Compared with the benchmark methods, SUMMIT identified more associations for each trait analyzed, showing 50% improvement compared with Lassosum (T = 334.5 and p ≈ 0.013; one-sided by the paired Wilcoxon rank test), 69% improvement compared with MR-JTI (T = 349 and p ≈ 0.005; one-sided), 108% improvement compared with TWAS-fusion (T = 362 and p ≈ 0.002; one-sided), 91% improvement compared with PrediXcan (T = 335 and p ≈ 0.005; one-sided), and 63% improvement compared with UTMOST (T = 343 and p ≈ 0.008; one-sided).

Because different methods test different sets of genes, we also compared the methods over a common set of 3980 genes that could be analyzed by all the methods (Fig. 3c). Again, SUMMIT maintained an edge over the competing methods, showing 16% improvement compared with the second-best-performing method in terms of association pairs identified, Lassosum.

Importantly, SUMMIT was applicable in analyzing genes with low expression heritability (0.005 ≤ R2 < 0.01), which have been largely ignored by benchmark methods. Out of the 11,585 genes with R2 ≥ 0.005, 1836 had a testing R2 between 0.005 and 0.01. For these 1836 genes, we identified 659 gene-trait associations (Fig. 3b). In comparison, for the remaining 9749 genes, we identified 3339 gene-trait associations, indicating that genes with relatively smaller R2 may be as important as those with larger R2. This finding is in line with the fact that genes with low expression heritability have substantially larger causal effect sizes on complex traits1.

### SUMMIT achieves higher predictive power for identifying "silver standard" genes

We compared different methods in identifying the likely causal genes that mediate the associations between GWAS loci and traits of interest. Following Barbeira et al.20, we used a set of 1,258 likely causal gene-trait pairs curated by using the Online Mendelian Inheritance in Man (OMIM) database21 and a set of 29 gene-trait pairs based on rare variant results from exome-wide association studies22,23,24, which provide orthogonal information that is independent of the GWAS results. These genes are counted as “silver standard” genes. Both sets of gene-trait pairs can be found in Supplementary Data 2.

Figure 3d shows that SUMMIT yielded good sensitivity and specificity for identifying the silver standard genes and achieved the highest AUC (0.777) among all the methods compared. All methods achieved relatively good sensitivity and specificity, showcasing the potential predictive ability of TWAS-type methods to prioritize putative causal genes. At a Bonferroni-corrected significance threshold of 5.21 × 10−6, SUMMIT identified 69 genes in the silver standard gene list, whereas Lassosum, the second-best-performing method in terms of AUC, identified 60 (15% improvement). Again, perhaps due to the increase in the sample size of the expression reference panel, the methods based on the summary-level expression reference panel (i.e., SUMMIT and Lassosum) achieved a higher AUC than methods based on the individual-level expression reference panel. In summary, perhaps due to the improvement in the expression prediction models, SUMMIT achieved higher predictive power in terms of prioritizing likely causal genes.

As a note, including imputation models with testing R2 < 0.01 increased the burden of multiple tests. To study this, we evaluated SUMMIT’s performance for genes with R2 ≥ 0.01 under a less stringent p-value threshold (as models with R2 < 0.01 were excluded). We confirmed that that the differences in the p-value threshold had only a negligible impact on SUMMIT in our real data analyses (Supplementary Fig. 9). SUMMIT identified 3399 gene-trait associations for genes with R2 ≥ 0.01 using the less stringent threshold and identified 3339 gene-trait associations for genes with R2 ≥ 0.01 when using the more stringent threshold.

### SUMMIT identifies risk genes for COVID-19 severity

We leveraged GWAS summary data from the COVID-19 host genetics initiative (HGI)25 to identify risk genes for COVID-19 severity. Using SUMMIT, we identified significant associations of 17 genes with COVID-19 severity (B2 outcome) by comparing patients hospitalized with COVID-19 and controls at a Bonferroni-corrected significance threshold of 4.33 × 10−6 (Fig. 4). In comparison, the competing methods PrediXcan, TWAS-fusion, UTMOST, and MR-JTI identified 1, 6, 2, and 1 significant genes, respectively (Supplementary Table 1). For the 17 genes identified by SUMMIT, 11 were prioritized by the fine-mapping method FOGS (Table 1). We further validated these 11 genes by analyzing COVID-19 by comparing very severe confirmed respiratory COVID-19 versus population controls (A2). Of them, 10 were validated at p < 0.05.

For some of these 11 putative causal genes related to COVID-19 severity, there is already prior knowledge supporting their potential links with COVID-19. To elaborate, SNP rs1015164, which lies near the antisense transcribed sequence RP11-24F11.2, has been associated with HIV set-point viral load26,27 and CD4+ T-cell counts. Such chemokine receptor-ligand interactions mediating the traffic of inflammatory cells and pathogen-associated immune responses could plausibly be related to COVID-19 severity. For FLT1P1, its expression has been reported to be positively associated with predicted neutrophil count28. This may mediate the genetic link between this gene and COVID-19 severity. Another identified gene, CCR5, is known to play a role in immune cell migration and inflammation. A study found that CCR5 blockade in critical COVID-19 patients induced decreased inflammatory cytokines, increased CD8 T cells, and decreased SARS-CoV-2 RNA in plasma29. For OAS1, both predicted and measured protein levels are inversely associated with COVID-19 susceptibility and severity, which is consistent with the current study’s findings30. Two of the other genes, namely, OAS3 and IFNAR2, were identified in our earlier work of COVID-19 TWASs using complementary methods and designs31.

## Discussion

By leveraging the summary-level expression reference panel with a much larger sample size, our method SUMMIT improved the prediction accuracy of built expression prediction models, which in turn increased the power of identifying risk genes for complex traits.

Through simulations and analyses of the GWAS results for 24 traits, we demonstrated the performance gain of SUMMIT over existing methods. Briefly, we demonstrated that SUMMIT improved the expression imputation accuracy (built more expression prediction models with R2 ≥ 0.01), identified more associations, and achieved higher power in identifying “silver standard” genes. Importantly, SUMMIT was applicable in analyzing genes with low expression heritability (R2 between 0.005 and 0.01), which have larger causal effect sizes on complex traits1 but have not been well captured by existing methods.

SUMMIT can be viewed as a type of gene-based Mendelian randomization (MR) and can provide valid causal interpretations when all genetic variants used in the expression prediction models (with nonzero weights) are valid instrumental variables32,33,34. However, with the widespread horizontal pleiotropy of genetic variables35, valid instrumental variable assumptions may be violated, and thus, we recommend that practitioners use multiple complementary methods jointly to identify likely causal genes. For example, we can apply fine-mapping approaches such as FOCUS36 and FOGS37 to further prioritize likely causal genes by modeling the linkage disequilibrium and correlation among TWAS signals. In addition to fine mapping, it can be useful to complement the TWAS/MR-type approaches with colocalization (in the sense of38), which aims to identify causal genetic variants for both gene expression and complex traits. Notably, the existence of colocalized genetic variants (especially those in the cis-acting region) implies that the same variants are responsible for variations in both expression and complex traits, indicating that a causal link between expression and complex traits may exist.

Both SUMMIT and Lassosum39 are motivated by the recent progress in the estimation of polygenic risk scores using summary-level GWAS data40,41. As a result, both Lassosum and SUMMIT construct the primary loss function using penalized regression. However, Lassosum and SUMMIT are different, and SUMMIT is tailored to eQTL summary statistics in the following respects. First, SUMMIT adds an additional step to estimate the LD matrix by utilizing genetic distance information. Second, Lassosum uses only the LASSO penalty, while SUMMIT considers five different types of penalties. As a result, we have confirmed that SUMMIT achieves much better performance in terms of prediction accuracy and subsequent statistical power in both simulations (Fig. 2) and real data analyses (Fig. 3). Additionally, SUMMIT shares similarities with CoMM-S47 as they both use summary-level eQTL data to identify gene-trait associations.

There are several limitations of the current study. First, the summary data of eQTLGen are for whole blood of subjects of European ancestry; thus, the built gene expression prediction models would be applicable only to blood tissue of European ancestry subjects. While SUMMIT can be applied equally to other tissues and ancestry, the corresponding summary eQTL data would be needed for such extensions. Second, several TWAS methods such as UTMOST11 and MR-JTI13 have been proposed to leverage expressions from other tissues or functional annotations to improve the prediction accuracy of expression prediction models. Functional annotation databases such as FAVOR42 may also provide prior information to downweight SNPs that may not contribute to gene expression. We expect that the number of analyzable genes could be increased further if we leveraged information from either other tissues or functional annotations. Third, similar to most existing TWAS methods, the results of SUMMIT imply causality only when valid instrumental variable assumptions are satisfied. A partial solution is to apply fine-mapping to prioritize likely causal genes. However, the robustness of SUMMIT would be significantly improved if we could relax these stringent valid instrumental variable assumptions. We leave this exciting topic for future research.

SUMMIT43 integrates summary-level eQTL data with GWAS summary statistics via advanced statistical methods. When combined with fine-mapping and functional validations, its findings may yield insights into the genetic basis of diseases and benefit the development of new therapeutic strategies.

## Methods

### Penalized regression model for expression prediction

Consider the following linear regression model for estimating the genetically regulated components of gene expression:

$${{{{{{{\bf{Y}}}}}}}}=\mathop{\sum }\limits_{j=1}^{p}{w}_{j}{{{{{{{{\bf{X}}}}}}}}}_{j}+{{{{{{{\boldsymbol{\epsilon }}}}}}}},$$
(1)

where Y is the N-dimensional vector of gene expression levels of a gene of interest (corrected for important covariates such as age, sex, and principal components of genotypes), $${{{{{{{\bf{X}}}}}}}}=({{{{{{{\bf{X}}}}}}}}^{\prime},\cdots \,,{{{{{{{\bf{X}}}}}}}}^{\prime} )^{\prime}$$ is the N × p standardized genotype matrix of pcis-SNPs around the gene (within 1 MB of the gene transcription start site and end site), the p-dimensional vector $${{{{{{{\bf{w}}}}}}}}=({w}_{1},\cdots \,,{w}_{p})^{\prime}$$ is the cis-eQTL effect size, and ϵ is random noise with a mean of zero.

We estimate w using a penalized regression framework. Specifically, the objective function is

$$f({{{{{{{\bf{w}}}}}}}})= \frac{({{{{{{{\bf{Y}}}}}}}}-{{{{{{{\bf{X}}}}}}}}{{{{{{{\bf{w}}}}}}}})^{\prime} ({{{{{{{\bf{Y}}}}}}}}-{{{{{{{\bf{X}}}}}}}}{{{{{{{\bf{w}}}}}}}})}{N}+{J}_{\lambda }({{{{{{{\bf{w}}}}}}}})=\frac{{{{{{{{\bf{Y}}}}}}}}^{\prime} {{{{{{{\bf{Y}}}}}}}}}{N}+{{{{{{{\bf{w}}}}}}}}^{\prime} \left(\frac{{{{{{{{\bf{X}}}}}}}}^{\prime} {{{{{{{\bf{X}}}}}}}}}{N}\right){{{{{{{\bf{w}}}}}}}}\\ -2{{{{{{{\bf{w}}}}}}}}^{\prime} \frac{{{{{{{{\bf{X}}}}}}}}^{\prime} {{{{{{{\bf{Y}}}}}}}}}{N}+{J}_{\lambda }({{{{{{{\bf{w}}}}}}}}),$$
(2)

where Jλ(  ) is a penalty term. Since the performance of different penalties may vary under different genetic architectures, we consider several penalties, including LASSO44, elastic net45, the minimax concave penalty (MCP)46, the smoothly clipped absolute deviation (SCAD)47, and MNet48. Note that the objective function (Equation (2)) is a function of the marginal statistics $${{{{{{{\bf{X}}}}}}}}^{\prime} {{{{{{{\bf{Y}}}}}}}}/N$$ and the linkage disequilibrium (LD) matrix $${{{{{{{\bf{X}}}}}}}}^{\prime} {{{{{{{\bf{X}}}}}}}}/N$$, and does not require the individual-level data to be observed and stored. This allows us to build expression prediction models using eQTL summary-level data, which are computed using a much larger sample size. That is, we rewrite the objective function as

$$f({{{{{{{\bf{w}}}}}}}})=\frac{{{{{{{{\bf{Y}}}}}}}}^{\prime} {{{{{{{\bf{Y}}}}}}}}}{N}+{{{{{{{\bf{w}}}}}}}}^{\prime} {{{{{{{\bf{R}}}}}}}}{{{{{{{\bf{w}}}}}}}}-2{{{{{{{\bf{w}}}}}}}}^{\prime} {{{{{{{\bf{r}}}}}}}}+{J}_{\lambda }({{{{{{{\bf{w}}}}}}}}),$$
(3)

where $${{{{{{{\bf{r}}}}}}}}={{{{{{{\bf{X}}}}}}}}^{\prime} {{{{{{{\bf{Y}}}}}}}}/N=({r}_{1},\cdots \,,{r}_{p})^{\prime}$$ is a p-dimensional vector of standardized marginal effect size for cis-SNPs (i.e., correlation between cis-SNPs and gene expression levels), and $${{{{{{{\bf{R}}}}}}}}={{{{{{{\bf{X}}}}}}}}^{\prime} {{{{{{{\bf{X}}}}}}}}/N$$ is the LD matrix of the cis-SNPs. We use the z-scores provided in the summary-level eQTL dataset to estimate r (denoted by $$\tilde{{{{{{{{\bf{r}}}}}}}}}$$) and use a shrinkage estimator (illustrated below) with an LD reference panel (such as that of the 1000 Genomes Project49) to estimate R (denoted by $$\tilde{{{{{{{{\bf{R}}}}}}}}}$$). We add an L2 penalty term $$\theta {{{{{{{\bf{w}}}}}}}}^{\prime} {{{{{{{\bf{w}}}}}}}}$$ (where θ ≥ 0) to the objective function, which ensures a unique solution upon optimization. Note that $${{{{{{{\bf{Y}}}}}}}}^{\prime} {{{{{{{\bf{Y}}}}}}}}/N$$ does not depend on w and can be ignored when optimizing f. Thus, the final objective function that we optimize can be written as,

$$\tilde{f}({{{{{{{\bf{w}}}}}}}})={{{{{{{\bf{w}}}}}}}}^{\prime} \tilde{{{{{{{{\bf{R}}}}}}}}}{{{{{{{\bf{w}}}}}}}}-2{{{{{{{\bf{w}}}}}}}}^{\prime} \tilde{{{{{{{{\bf{r}}}}}}}}}+\theta {{{{{{{\bf{w}}}}}}}}^{\prime} {{{{{{{\bf{w}}}}}}}}+{J}_{\lambda }({{{{{{{\bf{w}}}}}}}}).$$
(4)

The estimates $$\hat{{{{{{{{\bf{w}}}}}}}}}$$ can be obtained by the coordinate descent algorithm50, which solves the univariate penalized regression problem sequentially and iteratively. Briefly, suppose that $$({\hat{w}}_{1}^{(t)},\ldots,{\hat{w}}_{p}^{(t)})$$ are the coefficients in the t-th iteration of the coordinate descent algorithm. Define $${z}_{j}^{(t)}={\tilde{r}}_{j}-{\sum }_{l\ne j}{\tilde{R}}_{jl}{\hat{w}}_{l}^{(t)}.$$ When Jλ(w) is the LASSO penalty ($${J}_{\lambda }({{{{{{{\bf{w}}}}}}}})=\mathop{\sum }\nolimits_{j=1}^{p}\lambda|{w}_{j}|$$), we can update wj as

$${\hat{w}}_{j}^{(t+1)}=\left\{\begin{array}{ll}\frac{{z}_{j}^{(t)}-\lambda }{1+\theta }&{z}_{j}^{(t)} \, > \,\lambda \\ \frac{{z}_{j}^{(t)}+\lambda }{1+\theta }&{z}_{j}^{(t)} < -\lambda \\ 0&{{{{{{{\rm{otherwise}}}}}}}}\end{array}\right.$$
(5)

for j = 1, …, p and t = 0, 1, … .

The convergence properties of the coordinate descent algorithm guarantee a local minimum for $$\hat{{{{{{{{\bf{w}}}}}}}}}$$50. We give the details of the optimization, including the choices of the initial starting values, λ, and θ, and the updating formulas for the other penalties, in the Supplementary Note 1.

### Estimating the standardized marginal effect size $$\tilde{r}$$ and LD matrix $$\tilde{R}$$

The standardized marginal effect size rj is often not provided in the eQTL summary-level data, but it can be approximated well by $${\tilde{r}}_{j}={Z}_{j}/\sqrt{{N}_{j}-1+{Z}_{j}^{2}}$$, where Zj and Nj are the z-score and sample size for cis-SNP j, respectively. The eQTL summary-level data combine the results from multiple cohorts and thus the sample size for each SNP may vary. To obtain an unbiased estimation, we use the SNP-specific sample size Nj instead of the largest sample size (cohort size)51.

The objective function (4) involves an estimated LD correlation matrix $$\tilde{{{{{{{{\bf{R}}}}}}}}}$$. Instead of using the sample correlation matrix estimated from a reference panel such as 1000 Genomes Project49 data, we use the shrinkage estimator of the LD matrix52,53,54, which stabilizes the results by shrinking the off-diagonal entries toward zero. Specifically, we first calculate the sample LD correlation matrix from a reference panel. Each entry in the LD correlation matrix is then multiplied by the factor $$\exp (-\frac{2{N}_{e}{c}_{ij}}{m})$$, where Ne is the effective population size, m is the sample size of the data for generating the genetic map, and cij is the genetic distance between sites i and j in centimorgans. The entries are set to zero if the factor $$\exp (-\frac{2{N}_{e}{c}_{ij}}{m})$$ is less than a prespecified threshold c. Following others52,53, we use the genetic distance generated from 1000 Genomes OMNI arrays with Ne = 11,400 and m = 183 and the prespecified threshold c is set to 1 × 10−3.

### Model training and evaluation

We trained our expression prediction models by using the cis-eQTL summary-level data from eQTLGen6, which consist of effect sizes of >11 million SNPs from 31,684 blood samples. Following PrediXcan8, SNPs in the vicinity of the given gene (within 1 Mbp of the gene transcription start site and end site) were used as the cis-genotype information. Furthermore, we filtered out all SNPs with minor allele frequency (MAF) < 0.01 and those that were nonbiallelic, ambiguous or not included in the HapMap3 SNP set8.

We used both genotype and gene expression data from the GTEx project (version V7, dbGaP Accession number phs000424.v7.p2, https://www.gtexportal.org/home/datasets)55 to select the tuning parameters. The processed gene expression values in whole blood (N = 369) were downloaded from the GTEx website. Briefly, the RPKMs in each sample were standardized and normalized by quantile transformation. The expression for each gene was further adjusted for sex, genotyping platform, 35 PEER factors and three genotype-based principal components (PCs) and the residuals were used as the processed expression levels. We used the squared correlation between the predicted and observed expressions (that is, R2) to select the best tuning parameters. Notably, the subjects in GTEx v6 (N = 336; 1.1%) were meta-analyzed in eQTLGen6 and may result in suboptimal tuning parameters.

We used independent data of subjects who were included in GTEx v8 but not in GTEx v7 (N = 309) for external validation. Notably, the subjects in GTEx v8 were not meta-analyzed in eQTLGen and thus can be viewed as an independent dataset for external validation. Because genes with low expression heritability have substantially larger causal effect sizes on complex traits1, we selected models with R2 ≥ 0.005 instead of the commonly used criterion of R2 ≥ 0.01. The threshold (R2 ≥ 0.005) was justified by an informal theoretical investigation using a well-established statistical theory by Cramer56. Briefly, assuming a standard multiple regression model, Cramer56 showed that under the null hypothesis of β  = 0, R2 follows a beta distribution, i.e., $${R}^{2} \sim {{{{{{{\mathcal{B}}}}}}}}((p-1)/2,(n-p)/2)$$. In SUMMIT, we used the eQTL-gen summary-level data with n = 31,684 and the median number of SNPs with nonzero weights for each gene was p = 34, leading to $${R}^{2} \sim {{{{{{{\mathcal{B}}}}}}}}(16.5,15825)$$ under the null hypothesis. The rejection region ≈ (0.00263, 1] (under the transcriptome-wide significance level of α = 0.05/16884  3.0 × 10−6). The above derivation, however, ignores the impact of regularization induced by penalized regression. To consider the potential impact of regularization, we propose using a slightly conservative threshold of R2 ≥ 0.005 for SUMMIT. As a note, formally considering the regularization bias is nontrivial and requires additional assumptions; and we leave such interesting topics for future research.

### Association analyses with individual expression prediction models

When individual-level GWAS data (genotype data Xnew, phenotype Pnew, and covariance matrix Cnew) are available, one can apply a generalized linear regression model

$$f(E[{{{{{{{{\bf{P}}}}}}}}}_{{{{{{{{\rm{new}}}}}}}}}|{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{new}}}}}}}}},{{{{{{{{\bf{C}}}}}}}}}_{{{{{{{{\rm{new}}}}}}}}}])=\alpha {{{{{{{{\bf{C}}}}}}}}}_{{{{{{{{\rm{new}}}}}}}}}+\beta {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{new}}}}}}}}}\hat{{{{{{{{\bf{w}}}}}}}}}$$
(6)

to test H0 : β = 0, where f(  ) is a link function, and $${{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{new}}}}}}}}}\hat{{{{{{{{\bf{w}}}}}}}}}$$ is the predicted genetically regulated expression for the trait of interest.

When only summary-level GWAS data are available, one can apply a burden-type test:

$$\tilde{Z}={{{{{{{\bf{Z}}}}}}}}\hat{{{{{{{{\bf{w}}}}}}}}}/\sqrt{\hat{{{{{{{{\bf{w}}}}}}}}}^{\prime} {{{{{{{\bf{V}}}}}}}}\hat{{{{{{{{\bf{w}}}}}}}}}},$$
(7)

where Z is the vector of z-scores for all cis-SNPs and V is the LD matrix of analyzed SNPs (which can be estimated by using a population reference panel such as that of the 1000 Genomes Project49).

### Association analyses with multiple expression prediction models

To further improve the power, we apply the Cauchy combination test18 to integrate information from K models with R2 ≥ 0.005. Specifically, we use the following test statistics:

$$T=\mathop{\sum }\limits_{j=1}^{K}{\tilde{R}}_{j}^{2}\tan \{(0.5-{p}_{j})\pi \},$$
(8)

where pj is the p-value for model j and $${\tilde{R}}_{j}^{2}$$ is calculated by $${R}_{j}^{2}/\mathop{\sum }\nolimits_{j=1}^{k}{R}_{j}^{2}$$. T approximately follows a standard Cauchy distribution, and the p-value can be calculated as $$0.5-\arctan (T)/\pi$$. Notably, we use $${\tilde{R}}_{j}^{2}$$ as the weights when combining multiple expression prediction models because a larger $${\tilde{R}}_{j}^{2}$$ indicates a better expression prediction model. The Cauchy combination test has been widely used in the human genetics community18,57, because the p-value approximation is accurate for highly significant results (which are of interest) and there is no need to estimate the correlation structure among the combined p-values.

One may be interested in the association direction for a specific gene of interest. For a majority of the significant genes identified by SUMMIT, all the expression prediction models yield the same association direction. When the expression prediction models provide conflicting association directions, we determine the association direction by majority voting. In the rare situation in which the number of models indicating positive associations is the same as the number of models indicating negative associations, we declare the association direction unknown.

### Simulation study design

We conducted simulation studies to evaluate how the sample size of the expression reference panel impacts the expression prediction accuracy and the subsequent power of TWASs. Additionally, we evaluated whether using the summary-level eQTL data yielded similar performance to that of using the individual-level expression reference panel. Specifically, we used data from the UK Biobank and randomly chose genotype data from 31,684 (to match the sample size of the eQTLGen data) unrelated white British individuals as training data, genotype data from an additional 369 (to match the sample size of the GTEx v7 data) unrelated white British individuals as tuning data, and genotype data from an additional 10,000 unrelated white British individuals as test data. The imputed data of 877 cis-SNPs (with MAF > 1%, Hardy-Weinberg p-value > 10−6, and imputation “info” score > 0.4) of the arbitrarily chosen gene CHURC1 were used for our main simulations. We also considered several other randomly selected genes (Supplementary Figs. 36).

We simulated gene expression levels and phenotype values by Eg = Xw + ϵe and Y = βEg + ϵp, respectively. X is the standardized genotype matrix, w is the effect size, the scalar β is the association coefficient, $${{{{{{{{\boldsymbol{\epsilon }}}}}}}}}_{e} \sim N(0,1-{h}_{e}^{2})$$, and $${{{{{{{{\boldsymbol{\epsilon }}}}}}}}}_{p} \sim N(0,1-{h}_{p}^{2})$$, where $${h}_{e}^{2}$$ and $${h}_{p}^{2}$$ are the expression heritability (i.e., the proportion of gene expression variance explained by SNPs) and phenotypic heritability (i.e., the proportion of phenotypic variance explained by gene expression levels), respectively. We randomly selected pcausal, that is, the proportion of SNPs that are causal, and generated its effect size wj from N(0, 1). The effect sizes for the remaining noncausal SNPs were set to 0. We rescaled the effect sizes w and β to achieve the targeted $${h}_{e}^{2}$$ and $${h}_{p}^{2}$$.

To evaluate the performance of the proposed SUMMIT method, we performed an association scan on the whole simulated training data (Eg, X) and computed the summary-level data (i.e., z-scores) using a linear regression. To study the impact of the sample size of the training data, we also built prediction models using training data of different sample sizes (300, 600, 3000, 10,000, 31,684). We compared SUMMIT with two widely used methods, PrediXcan8 and TWAS-fusion9. Furthermore, we investigated the idea of using a polygenic risk score method (e.g., Lassosum39) to train the expression prediction models. We trained models with PrediXcan and TWAS-fusion using individual-level data of 670 samples (to match the sample size of blood tissue in the GTEx v8 data). As a note, in addition to Lassosum, we only compared SUMMIT with PrediXcan and TWAS-fusion in simulations because all of these methods focus on single-tissue information. While leveraging cross-tissue information can further improve the performance as demonstrated in UTMOST11 and MR-JTI13, it is not our focus here, and thus, we did not compare cross-tissue methods such as UTMOST and MR-JTI in our simulations, leaving such interesting topics for future research.

We considered comprehensive scenarios that varied the proportion of causal SNPs pcausal (0.01, 0.05, 0.1, 0.2), expression heritability $${h}_{e}^{2}$$ (0.005, 0.01, 0.1), and phenotypic heritability $${h}_{p}^{2}$$ (0.1, 0.2, 0.5, 0.8). For each scenario, we repeated the simulations 1000 times. The statistical power was calculated as the proportion of 1000 repeated simulations with a p-value less than the genome-wide significance threshold 0.05/20,000 = 2.5 × 10−6.

### Comparison with existing methods

We further compared SUMMIT with several TWAS methods, including Lassosum39, MR-JTI13, PrediXcan8, TWAS-fusion9, and UTMOST11, for whole blood tissue in the following respects. Lassosum is a polygenic risk score method that can be used to build expression prediction models with a summary-level reference panel. After building the expression prediction models, we apply the standard TWAS framework to obtain the results. PrediXcan uses Elastic Net to build gene expression prediction models; TWAS-fusion applies several methods, including BLUP, BSLMM, Elastic Net, LASSO, and TOP1 to build expression prediction models. MR-JTI and UTMOST leverage cross-tissue information when building gene expression prediction models. All four TWAS methods are based on an individual-level expression reference panel, while our method SUMMIT and Lassosum are based on a summary-level expression reference panel.

First, we compared the prediction accuracy (in terms of R2) estimated by different methods. Notably, while the prediction performances of the models developed using competing methods were estimated through cross validation, the prediction performances of the models developed using SUMMIT and Lassosum were estimated in an external testing dataset. This difference may slightly favor PrediXcan and TWAS-fusion. The difference in R2 across genes was tested by the one-sided Kolmogorov-Smirnov test, a nonparametric test that calculates the largest distance between the empirical distribution functions to determine whether two distributions are equivalent.

Second, we compared different methods by analyzing GWAS summary statistics for 24 complex traits. The details of the 24 traits are summarized in Supplementary Data 1. We used the Bonferroni correction for each method with different significance thresholds as different methods have different numbers of analyzable genes. To make a fair comparison, we also evaluated a common gene set that can be analyzed by all methods and used the same Bonferroni-corrected significance threshold to determine the significant gene sets. The numbers of significant genes identified by the different methods were further compared by the Wilcoxon signed-rank test, which compares two matched samples to test whether their population mean ranks differ.

Third, as a TWAS can be viewed as a special case of Mendelian randomization58, we further compared different methods in terms of identifying the causal genes that mediate the associations between GWAS loci and the traits of interest. Following Barbeira et al.20, we curated a set of likely causal gene-trait pairs using information that was independent of the GWAS results. Briefly, we utilized the OMIM database21 and rare variant results from exome-wide association studies22,23,24, obtaining 1, 287 gene-trait pairs. We used LDetect to partition the genome into approximately independent LD blocks59 and refined the gene-trait pairs by considering only the genes that were located in LD blocks with at least one genome-wide significant variant, leading to 148 likely causal gene-trait pairs (among 24 distinct traits). We compared different methods by the area under the receiver operating characteristic curve (AUC).

### Applications to COVID-19 GWAS data

To identify genes associated with COVID-19 severity, we applied SUMMIT-derived models to GWAS summary data from the COVID-19 HGI (Release 5 (January 2021))25. The detailed information of participating studies, quality control, and analyses are included on the COVID-19 HGI website (https://www.covid19hg.org/results/). Briefly, data from 9, 986 hospitalized COVID-19 patients and 1, 877, 672 population controls were used in the current analyses. Hospitalized COVID-19 cases included patients who (1) had laboratory confirmed SARS-CoV-2 infection (RNA- and/or serology-based) and (2) were hospitalized due to corona-related symptoms. The controls are subjects who are not cases. Only individuals of European ancestry were included to ensure a homogeneous population structure for the analyses. A fixed-effect meta-analysis of the individual participating studies was performed and variants with imputation quality > 0.6 were retained.

We applied the fine-mapping method FOGS37 to prioritize likely causal genes for COVID-19 severity. We evaluated the associations of the identified genes with an additional COVID-19 phenotype. Briefly, we leveraged A2_ALL_eur (Europeans; 5, 101 cases and 1, 383, 241 controls) to compare very severe confirmed respiratory COVID-19 vs. population controls.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.