Gene-based association tests using GWAS summary statistics and incorporating eQTL

Although genome-wide association studies (GWAS) have been successfully applied to a variety of complex diseases and identified many genetic variants underlying complex diseases via single marker tests, there is still a considerable heritability of complex diseases that could not be explained by GWAS. One alternative approach to overcome the missing heritability caused by genetic heterogeneity is gene-based analysis, which considers the aggregate effects of multiple genetic variants in a single test. Another alternative approach is transcriptome-wide association study (TWAS). TWAS aggregates genomic information into functionally relevant units that map to genes and their expression. TWAS is not only powerful, but can also increase the interpretability in biological mechanisms of identified trait associated genes. In this study, we propose a powerful and computationally efficient gene-based association test, called Overall. Using extended Simes procedure, Overall aggregates information from three types of traditional gene-based association tests and also incorporates expression quantitative trait locus (eQTL) information into a gene-based association test using GWAS summary statistics. We show that after a small number of replications to estimate the correlation among the integrated gene-based tests, the p values of Overall can be calculated analytically. Simulation studies show that Overall can control type I error rates very well and has higher power than the tests that we compared with. We also apply Overall to two schizophrenia GWAS summary datasets and two lipids GWAS summary datasets. The results show that this newly developed method can identify more significant genes than other methods we compared with.

statistics asymptotically follow a multivariate normal distribution with the correlation matrix being the corrected LD matrix among the genetic variants in a gene.
To increase statistical power in identifying genes that are associated with complex diseases, PrediXcan 20 and transcriptome-wide association study 12,21 (TWAS) were developed by incorporating expression quantitative trait locus (eQTL) data into GWAS. As pointed out by Zhang et al. 15 , PrediXcan and TWAS can be viewed as a simple weighted linear combination of genetic variants with an eQTL-derived weight. In fact, the genetic architecture of complex traits is rarely known in advance and is likely to vary from one region to another across the genome and from one trait to another 15 . Therefore, only considering one single eQTL-derived weight, such as in PrediXcan and TWAS, may lose statistical power in identifying significant genes. Zhang et al. 15 developed an omnibus test (OT) using Cauchy combination method to integrate association evidence obtained by BT, SKAT, and SKATO based on GWAS summary data with multiple eQTL-derived weights. They showed that OT using multiple eQTL-derived weights had some potential advantages.
Inspired by the advantage of OT, in this paper, we propose a more powerful and computationally efficient method, called Overall, to aggregate the information from three types of traditional gene-based association tests (BT, SKAT, SKATO) with multiple eQTL-derived weights using GWAS summary statistics. To combine information from the three gene-based association tests, the Overall method utilizes the extended Simes procedure 5,22 . To apply the Overall method, we first need to estimate the correlation matrix among the three gene-based association tests with eQTL-derived weights under the null hypothesis. We provide an estimation method using a replication procedure 23,24 . The replication procedure only needs to be performed once to obtain the correlation matrix for each gene. Then, the p-values of Overall can be analytically computed without using permutations. To calculate the p-values of the three types of gene-based association tests (BT, SKAT, SKATO) using GWAS summary statistics with eQTL-derived weights, we use the "sumFREGAT" package in R (https:// cran.r-proje ct. org/ web/ packa ges/ sumFR EGAT/ index. html) 9 . Once we obtain the p-values of these three tests, the p-value of our proposed method can be easily calculated using its theoretical distribution. Extensive simulation studies show that Overall can control type I error rates well and has higher power than the comparison methods across most of the simulation settings. Similar to Zhang et al. 15 , we apply our method to two schizophrenia (SCZ) and two lipids trait (HDL) GWAS summary data sets. Compared with OT and other tests, the proposed method can identify more significant genes. More interestingly, some significant genes reported by GWAS catalog are only identified by our proposed method.

Statistical models and methods
Statistical models. Consider a set of M genetic variants in a gene. Let Z = (Z 1 , . . . , Z M ) T be an M × 1 vector of Z-scores of the M genetic variants. Note that the Z-scores is either directly provided by publicly available GWAS summary statistics or calculated from a GWAS individual-level genotype and phenotype data set. We are interested in testing the null hypothesis H 0 that none of the genetic variants in the gene is associated with a trait, whereas the alternative hypothesis is that at least one genetic variant in the gene is associated with a trait. Following Svishcheva et al. 9 , Gusev et al. 12 , and Yang et al. 25 , we assume Z = (Z 1 , . . . , Z M ) T ∼ MVN(0, R) under the null hypothesis, where R is the correlation matrix among Z , which can be estimated by LD among the genetic variants in the gene 12,13 . If individual-level data are not available, LD can be estimated using external reference panels (i.e., 1000 Genomes Project 16 ). However, if the sample size of a reference panel is small, LD may not be estimated correctly so that it will induce statistical noise (i.e., inflated type I error rates or large numbers of false positives) 17,18 . One way to correct the estimated LD is to use a regularization procedure by adding a statistical white Gaussian noise 9,19 . Let I M be an M × M identity matrix, and the corrected correlation matrix U can be defined as where a is a scalar tuning parameter which represents the coefficient of proportionality between the corrected correlation matrix U and the original R estimated using an external reference panel. The optimal tuning parameter a can be estimated by maximizing the log-likelihood function of the distribution of Z ∼ MVN(0, U) , that is, Then the corrected correlation matrix Û =âR + 1 −â I M . Therefore, under the null hypothesis, we consider Suppose that there are a total of K different eQTL-derived weights from gene expression data (i.e., Genotype-Tissue Expression (GTEx) project (https:// gtexp ortal. org/ home/)), denoted as Ŵ k = diag Ŵ k 1 , . . . ,Ŵ k M for k = 0, 1, . . . , K , where Ŵ 0 = diag(1, . . . , 1) represents a status without using any weights. In order to avoid the influence of the scale among genetic variants within each weight, we first standardize the eQTL-derived weights W k as W k m =Ŵ k m M m=1 Ŵ k m for m = 1, . . . , M . Based on the k th standardized weight W k , the weighted Z-score W k Z follows a multivariate normal distribution. That is, We extend the three types of gene-based association tests, BT 6 , SKAT 7 , and SKATO 8 , to incorporate the eQTL-derived weights based on GWAS summary statistics 9,26 . For the kth eQTL-derived weight, the three gene-based test statistics can be written as 1] log (L(Z : 0, U)) . Overall method. To aggregate information from these three gene-based association tests with multiple eQTL-derived weights, we develop a novel method, called Overall, which utilizes the extended Simes procedure 5,22 . Let p k BT , p k SKAT , p k SKATO be the p-values of BT, SKAT, SKATO with kth eQTL-derived weight, k = 0, 1, . . . , K , respectively, where k = 0 denotes a status without using any weights. Thus, there are a total of L = 3(K + 1) p-values from these three gene-based association tests with different weights. Let p (1) , . . . , p (L) be a sequence of the ascending p-values, where p (1) = min k=0,...,K p k BT , p k SKAT , p k SKATO and p (L) = max k=0,...,K p k BT , p k SKAT , p k SKATO . Overall combines these L p-values using the extended Simes procedure 5,22 , and the p-value of Overall is defined as where m e is the effective number of p-values among the L gene-based association tests with multiple weights, p (l) is the l th element of the ascending p-values, and m e(l) is the effective number of p-values among the top l association tests. We use a more robust measure to obtain the effective numbers m e and m e(l) , which was proposed by Li et al. 5 . The values of m e(l) and m e can be estimated as where i denotes the ith eigenvalue of the correlation matrix of p-values from L association tests with multiple weights (the estimation of will be discussed in the next section), I(·) is an indicator function. If the L association tests are independent, all eigenvalues i equal 1, and m e(l) = l for l = 1, . . . , L ; if the L association tests are perfectly dependent, then 1 = l which is the number of tests used to calculate m e(l) and the other eigenvalues equal 0. In this case, m e(l) = l − (l − 1) = 1 for l = 1, . . . , L.
The R codes and a sample data set for the implementation of Overall are available at github https:// github. com/ xuewe ic/ Overa ll.
Estimation of under the null hypothesis. To apply our proposed method, we need to estimate the correlation matrix of p-values under the null hypothesis. Since the exact correlations among all L gene-based association tests are unknown, we perform the estimation procedure with B replications. For each replicate b , b = 1, . . . , B , we implement the following two steps: Step 1: We first generate a new Z-score vector Z null under the null hypothesis. That is, Z null follows a multivariate normal distribution with mean 0 and variance-covariance matrix R , where R can be estimated by LD among the genetic variants in a gene using external reference panels (i.e., 1000 Genomes Project).
Step 2: We use the regularization procedure to obtain the corrected correlation matrix of Z-scores Û . Then, SKATO and the corresponding p-values p SKATO depend on the corrected correlation matrix Û , and the standardized eQTL-derived weights W k for k = 0, 1, . . . , K.
To estimate the correlation matrix of p-values used in the Overall method, we use the sample correlation matrix of the p-values obtained from the replications. We denote the sample correlation matrix of p-values as ˆ . For example, ˆ 12 is the (1,2)-element of ˆ which is the estimated correlation between BT and SKAT without using any weights. If we let p 0 BT = p be a B × 1 vector of the p-values of SKAT without using any weights obtained from the replications, then the sample correlation of p-values between these two tests is defined as ˆ 12 = cor p 0 BT , p 0 SKAT , where cor(·) is the sample correlation. The estimation procedure to estimate is independent of our proposed method, therefore we only need to perform this procedure once for each gene. After we estimate , the p-value of Overall can be computed analytically without using permutations.

Simulation studies
Materials and comparison methods. In our studies, we use four data sets to obtain the eQTL-derived weights downloaded from the functional summary-based imputation website (http:// gusev lab. org/ proje cts/ fusio n/# refer ence-funct ional data). The resources to obtain the four eQTL-derived weights are listed in Table 1.
For each eQTL data set, we use the weights estimated by the Best Linear Unbiased Prediction (BLUP) 27 . We compare our proposed method with three existing methods, OT 15 , S-PrediXcan 29 , and S-TWAS 12 . These three methods are all based on GWAS summary statistics and incorporate eQTL-derived weights. Here, we briefly introduce these three methods.
OT: For a total of K different eQTL-derived weights and the three gene-based association tests (BT, SKAT, SKATO), OT aggregates information from different weights and tests by using the Cauchy combination method 30 . The test statistic of OT is defined as BT π + tan 0.5 − p k SKAT π + tan 0.5 − p k SKATO π and the corresponding p-value of the test statistic can be approximated by S-PrediXcan: For a given eQTL-derived weight, provided by a matrix where σ m is the estimated standard deviation of the m th SNP in a gene and σ is the estimated standard deviation of the predicted expression of a gene. The p-value of S-PrediXcan can be computed as p k S-TWAS: For a given eQTL-derived weight, provided by a vector  Table S1 gives a summary of these 18 genes. We can see from Supplementary Table S1, the number of SNPs in a gene is ranging from 23 to 359 and the average per-SNP LD score in a gene is ranging from 12.72 to 170.85. We simulate a Z-score vector from a multivariate normal distribution with mean 0 and variance-covariance matrix R , Z ∼ MVN(0, R) , where R is the LD matrix of each gene which can be estimated using the 1000 Genomes Project (unrelated Europeans in 1000 Genomes in Phase 3; ftp:// ftp. 1000g enomes. ebi. ac. uk/ vol1/ ftp/). First, we use B = 10 4 replications to estimate under the null hypothesis, where the estimated matrix is denoted by ˆ . Then, we denote ˆ 0 as the correlation matrix of p-values by using B 0 replications. We vary the value of B 0 from 16 to 5000, and test the null hypothesis that the two correlation matrices, ˆ 0 and ˆ , are the same by using "lavaan" package (https:// CRAN.R-proje ct. org/ packa ge= lavaan) 31 . Supplementary Figure S1 shows that the p-values for the hypothesis testing in each gene are greater than 0.05 after B 0 = 1000 replications for all of the 18 genes. Therefore, we recommend using 1000 replications to obtain ˆ for each gene under the null hypothesis. Consequently, 1000 replications are used in the following sessions to evaluate the type I error rates and powers of Overall.
Type I error rates. To evaluate if our proposed method can control type I error rates, we perform simulations based on the aforementioned 18 genes. For each of the 18 genes, we generate Z-score vectors under the null hypothesis, Z ∼ MVN(0, R) , where R is the LD matrix of the gene estimated using the 1000 Genomes project. Then, we use the regularization procedure to obtain the corrected correlation matrix of Z-scores Û , and calculate the three types of gene-based association tests, BT, SKAT, and SKATO, with or without the four eQTL-derived weights (NTR, YFS, METSIM, CMC) based on the corrected correlation matrix Û . Finally, we apply our proposed method to combine the p-values using the estimated correlation matrix of p-values, ˆ , with 1000 replications. We generate simulated data to mimic real lipids data which we will use in "Real data analysis" section. Gene AGTRAP is associated with lipids trait HDL 15 , There are a total of 23 genetic variants in gene AGTRAP. The LD block structure of these 23 genetic variants is shown in Supplementary Fig. S2. Supplementary Figure S3 shows the estimated correlation matrix ˆ for this gene. We use 10 7 replications to evaluate type I error rates of Overall www.nature.com/scientificreports/ for gene AGTRAP at 5 × 10 −2 , 1 × 10 −2 , 1 × 10 −3 , 1 × 10 −4 ,1 × 10 −5 , and 1.75 × 10 −6 significance levels. With 10 7 replications, a Bonferroni corrected significance level of 1.75 × 10 −6 can be reached to obtain the empirical type I error rates (i.e., for 28,625 genes in the real data analysis section, the Bonferroni corrected significance level is 0.05/28625 = 1.75 × 10 −6 at 5% significance level). We further evaluate type I rates based on the other 17 genes. To save computational time, we use 2 × 10 5 replications to evaluate type I error rates of Overall for the 17 genes at significance levels of 1 × 10 −2 , 1 × 10 −3 , and 1 × 10 −4 . Table 2 and Supplementary Table S2 show the estimated type I error rates of Overall under various nominal significance levels for gene AGTRAP and the other 17 genes, respectively. From these tables, we can see that our proposed method can control type I error rates very well at different significant levels.

Power comparison.
To evaluate the performance of the Overall method, we use several simulations to compare the power of Overall with the power of OT, S-PrediXcan, S-TWAS, and three types of gene-based association tests with and without eQTL-derived weights. We use BEST to represent the test with the maximum power among the three traditional gene-based association tests with and without an eQTL-derived weight, S-TWAS.B and S-PrediXcan.B to represent the maximum power of S-TWAS and S-PrediXcan with each of the eQTL-derived weights, respectively. Following the simulation settings in Nagpal et al. 32 and Zhang et al. 15 , we generate individual-level genotypes, phenotypes, and different gene expression levels using the following steps: (1) The genotype data are generated using the haplotypes of a gene obtained from the 1000 Genomes Project reference panel. To generate the genotype of an individual, X g , we select two haplotypes according to the haplotype frequencies from the haplotype pool and then remove genetic variants with MAF < 0.05. (2) We consider K different weights derived from gene expression data which can be estimated using BLUP.
To generate a vector of weights, w k , for the kth gene expression level, we randomly select causal variants according to the proportion of causal variants, p causal . Then, the effect sizes for the kth gene expression levels and M causal causal variants can be generated from a standard normal distribution, w mk ∼ N(0, 1) for m = 1, . . . , M causal , where M causal = M × p causal ; otherwise, w mk = 0 . After we rescaled the weights to ensure the targeted expression heritability h 2 e , we generate the kth gene expression level by E k = X g w k + ε e with each element of random error ε e follows N 0, 1 − h 2 e . (3) Let E = (E 1 , . . . , E K ) be the matrix of gene expression levels. Phenotypes are generated by using a formula Y = Eβ + ε p with each element of random error ε p follows N 0, 1 − h 2 p , where β = (β 1 , . . . , β K ) T is a vector of genetic effect sizes which can be assigned based on the phenotypic heritability h 2 p . (4) The Z-score vector is estimated from individual-level genotype and phenotype data using beta coefficient and its standard deviation estimated based on the ordinary least squares method in linear regression.
In our simulation studies for power comparison, we consider two genes, AGTRAP and C3orf22, from the 18 genes used in the type I error evaluation and K = 4 and K = 20 eQTL-derived weights. AGTRAP contains 458 haplotypes for 23 genetic variants (11 common variants and 12 rare variants; MAF ranging from 0 to 0.39775); C3orf22 contains 295 haplotypes for 42 variants (18 common variants and 24 rare variants; MAF ranging from 0 to 0.43558). Supplementary Figure S2 shows the LD block structure of the 23 genetic variants at AGTRAP and the Table 2. Estimated type I error rates at different significance levels with 10 7 replications. The subscript denotes BT, SKAT, and SKATO using eQTL-derived weights; CMC, METSIM, NTR, and YFS indicate the resources to obtain the eQTL-derived weights. 0 indicates the methods without using eQTL-derived weights. We also consider two different directions of genetic effects: β 1 = · · · = β K (Scenario 1: Uni-directional effects) and β 1 = · · · = β K/2 = −β K/2+1 = · · · = −β K (Scenario 2: Bi-directional effects). For each simulation scenario, we vary the proportion of gene expression heritability  www.nature.com/scientificreports/ and the phenotypic heritability with different values of h 2 e and h 2 p . We consider the sample size to be 2000 (unless it is specified) and the power is calculated as the proportion of 1000 replications with p-value < 1.75 × 10 −6 . Figure 1 (Supplementary Fig. S4) show the power comparisons based on gene AGTRAP (and C3orf22) with K = 4 under the Uni-directional effects ( β 1 = β 2 = β 3 = β 4 ) with different p causal . We consider two settings here. First, we vary phenotypic heritability h 2 p with a fixed expression heritability h 2 e = 0.2 ( Fig. 1a and Supplementary Fig. S4a). Second, we vary the expression heritability h 2 e with a fixed phenotypic heritability h 2 p = 0.2 ( Fig. 1b and Supplementary Fig. S4b). Figure 2 (Supplementary Fig. S5) shows power comparisons based on gene AGTRAP (and C3orf22) under the Bi-directional effects ( β 1 = β 2 = −β 3 = −β 4 ) with different p causal for K = 4 . We also consider two simulation settings, power against the phenotypic heritability h 2 p with a fixed expression heritability h 2 e = 0.2 and power against the expression heritability h 2 e with a fixed phenotypic heritability h 2 p = 0.2 . The pattern of the power in Fig. 2 (Supplementary Fig. S5) is similar to what we observe in Fig. 1 (Supplementary Fig. S4). These figures show that (1) Overall and OT perform uniformly better than BEST, S-TWAS.B, and S-PrediXcan.B. We can see that Overall and OT boost power significantly due to integrating association evidence by different traditional tests and multiple eQTL-derived weights. Overall is slightly more powerful than OT in all of the scenarios.  Fig. S6).
Furthermore, we consider simulation settings with noise to the eQTL. We consider simulation settings by adding less noise to the eQTL from the most relevant tissues and more noise to those from the less relevant tissues. For the Uni-direction scenario, we consider the first study being the most relevant tissue, where To remove noise in LD matrix computed from a reference sample, we shrink the observed LD matrix toward an identity matrix with the shrinkage parameter estimated by maximum likelihood. To evaluate how well this regulation process performs, we compare the powers of three traditional gene-based association tests with and without eQTL-derived weights, OT, and Overall based on corrected and uncorrected LD structure. We use the same simulation settings as those in Supplementary Figs. S7 and S8. Supplementary Figure S10 shows the power comparison results based on gene C3orf22 under Uni-directional effects and Bi-directional effects with noise to eQTL. We can see that the powers of these tests based on corrected LD structure perform better than those based on uncorrected LD structure in most of the settings.

Real data analysis
To evaluate the performance of our proposed method, we apply Overall, OT, the three traditional tests with and without eQTL-derived weights, S-PrediXcan, and S-TWAS to the GWAS summary statistics data sets used in Zhang et al. 15 : two SCZ GWAS summary data sets and two lipid GWAS summary data sets. We estimate the LD between genetic variants using the 1000 Genomes Project reference panel 16 , and obtain the corrected matrix of Z-score after the regularization procedure. We consider four eQTL-derived weights estimated by the BLUP method using the resources listed in Table 1 (NTR, YFS, METSIM, CMC).
Application to the SCZ GWAS summary data. We consider two SCZ GWAS summary data sets, SCZ1 and SCZ2, which can be downloaded from the Psychiatric Genomics Consortium website (https:// www. med. unc. edu/ pgc/ resul ts-and-downl oads/) 33 . SCZ1 is a meta-analysis of SCZ GWAS data set with 13,833 cases and 18,310 controls. SCZ2 is a more recent and larger SCZ GWAS summary data set with 36,989 cases and 113,075 controls for partial validation 34 . In our real data analysis, we define a gene to include all of the SNPs from 20 kb upstream to 20 kb downstream of the gene and test the association between each gene and the trait. We consider all genes according to the GENCODE version 35 (GRCh37) human comprehensive gene annotation list which can be downloaded from the GENCODE website (https:// www. genco degen es. org/ human/ relea se_ 35lif t37. html). www.nature.com/scientificreports/ To make fair comparisons among all these weighted tests, the genetic variants are removed if there is at least one weight missing in the four eQTL-derived weights. After pruning, there are 26,575 genes in SCZ1 and 17,823 genes in SCZ2 left in our final analyses. Therefore, the Bonferroni corrected significance level for gene-based association analysis is defined as 0.05 divided by the number of genes. First, we apply BT, SKAT, and SKATO with and without an eQTL-derived weight, OT, Overall, S-PrediXcan, and S-TWAS to the SCZ1 and SCZ2 data sets. Table 3 (SCZ1 and SCZ2) shows the number of genes identified by each method for the SCZ data sets, respectively. As we can see in Table 3, Overall identifies more genes than all of the other methods for two SCZ Table 3. The numbers of genes identified by each method for the two SCZ data sets. The subscript denotes BT, SKAT, and SKATO using eQTL-derived weights; CMC, METSIM, NTR, and YFS indicate the resources to obtain the eQTL-derived weights. 0 indicates the methods without using any weights. SCZ1 indicates the number of genes identified by each method for SCZ1 data; SCZ2 indicates the number of genes identified by each method for SCZ2 data; SCZ overall indicates the number of overlapping genes identified by both SCZ1 and SCZ2 data sets; GWAS SCZ1 and GWAS SCZ2 indicate the numbers of genome-wide significant genes that are reported in the GWAS catalog and are also identified by each method for SCZ1 and SCZ2, respectively.   From Fig. 3, we can see that Overall identifies all of the genes identified by OT for SCZ1; for SCZ2, there are two genes identified by OT but failed to be identified by Overall; there are 66 and 24 genes identified only by Overall for SCZ1 data and SCZ2, respectively. We further investigate the 90 genes identified only by Overall for the SCZ data sets by searching the GWAS catalog (https:// www. ebi. ac. uk/ gwas/). Among the 66 genes for the SCZ1 data set, there are six genes reported in the GWAS catalog; among the 24 genes for the SCZ2 data set, there are six genes reported in the GWAS catalog (Table 4). We also use these two SCZ GWAS data sets for partial validation. Table 3 shows that there are 45 overlapping genes identified by Overall using SCZ1 and SCZ2 data sets and only 17 overlapping genes identified by OT using both SCZ1 and SCZ2 data sets. Furthermore, we search for genome-wide significant SNPs ( p < 5 × 10 −8 ) from the two SCZ GWAS summary data sets and consider the genes covering at least one genome-wide significant SNP from 20 kb upstream to 20 kb downstream of the gene. There are 63 genomewide significant genes for SCZ1, and 2422 genome-wide significant genes in SCZ2. Table 3 (GWAS SCZ1 and GWAS SCZ2 ) summarizes the numbers of genome-wide significant genes that are identified by each method for the two SCZ data sets. Among the 63 genome-wide significant genes for the SCZ1 data set, Overall identifies the largest number of genes, followed by SKAT 0 and SKATO 0 ; OT, S-PrediXcan NTR and S-TWAS NTR only identify 6 genes. Meanwhile, among 2422 genome-wide significant genes for SCZ2, Overall identifies 167 genes; OT identifies 166 genes; SKATO and SKATO 0 identify 153 genes; S-TWAS YFS and S-PrediXcan YFS only identify 58 and 72 genes respectively.
Application to the lipids GWAS summary data. We consider two lipids GWAS summary data sets, HDL1 and HDL2, which can be downloaded at the Center for Statistical Genetics (CSG) at the University of Michigan. HDL1 is a meta-analysis of HDL GWAS data set with about 100,000 samples downloaded at the website (http:// csg. sph. umich. edu/ willer/ public/ lipid s2010/) 44 . HDL2 is the follow-up data with about 189,000 samples for partial validation downloaded at the Global Lipids Genetics Consortium (http:// csg. sph. umich. edu/ willer/ public/ lipid s2013/) 45 . We perform the same analysis as we did in the previous section for the two SCZ GWAS summary data sets. After pruning and removing the genetic variants with missing weights, there are 17,389 genes in HDL1 and 16,917 genes in HDL2. Table 5 (HDL1 and HDL2) shows the number of genes identified by each method for the two lipids data sets, respectively. As we can see from Table 5, among the three traditional gene-based association tests with and without eQTL-derived weights, SKATO 0 and BT 0 identify the largest number of genes in HDL1 and HDL2, respectively; Among the four S-PrediXcan tests, S-PrediXcan YFS and S-PrediXcan CMC identify the largest number of genes in HDL1 and HDL2, respectively; for the four S-TWAS tests, S-TWAS YFS and S-TWAS CMC identify the largest number of genes in HDL1 and HDL2, respectively. For the HDL1 data set, Overall identifies the largest number of genes (249), followed by OT that identifies 233 genes; for the HDL2 data set, BT 0 identifies the largest number of genes (836), followed by Overall and OT, where Overall identifies 765 genes and OT identifies 688 genes. In Fig. 4, we compare genes identified by SKATO 0 , S-PrediXcan YFS , and S-TWAS YFS , along with Overall and OT for the HDL1 data set and genes identified by BT 0 , S-PrediXcan CMC , S-TWAS CMC , Overall, and OT for the HDL2 data set. Again, we observe that Overall identifies the largest number of genes for the HDL1 data set and the second most for the HDL2 data set; all genes identified by OT are also identified by Overall; 82 and 24 genes are identified only by Overall and OT for the HDL1 and HDL2 data sets, respectively; there are 13 and 6 genes only identified by Overall for the HDL1 and HDL2 data sets, respectively. We search the GWAS catalog (https:// www. ebi. ac. uk/ gwas/). Table 6 shows that five out www.nature.com/scientificreports/ of 13 genes identified only by Overall based on HDL1 data have been reported, and one out of 6 genes has been reported on HDL2 data in the GWAS catalog. We also use these two HDL GWAS data sets for partial validation by looking for the number of overlapping genes identified by both of the data sets (Table 5, HDL overlap ). There are 177 overlapping genes identified by Overall for both SCZ1 and SCZ2 data sets and 167 overlapping genes identified by OT for both SCZ1 and SCZ2 data sets. www.nature.com/scientificreports/ Same as the analyses for the SCZ GWAS summary data sets, we search for genome-wide significant SNPs ( p < 5 × 10 −8 ) from the two lipids GWAS summary statistics. There are 1911 genome-wide significant genes for HDL1 and 2682 genome-wide significant genes for HDL2. Table 5 (GWAS HDL1 and GWAS HDL2 ) summarizes the numbers of genome-wide significant genes that are identified by each method for the two lipids data sets. Among the 1911 genome-wide significant genes for the HDL1 data set, Overall identifies the largest number of genes (122), followed by OT (120), then SKAT 0 (104); S-TWAS YFS only identifies 29 genes and S-PrediXcan YFS identifies 31 genes. Meanwhile, among 2682 genome-wide significant genes for HDL2, Overall identifies the largest number of genes (192); OT and SKATO 0 identify 190 genes; S-TWAS METSIM and S-PrediXcan METSIM identify 112 and 118 genes. respectively.

Discussions
In this paper, we develop a powerful and computationally efficient method, Overall, for gene-based association studies using GWAS summary data. Overall aggregates information from three traditional types of gene-based association tests (BT, SKAT, SKATO) and also incorporates eQTL data. Both our simulation studies and real data analysis confirm that our proposed method can control type I error rates correctly and has very good performance compared with other comparison methods. In "Real data analysis", Overall identify more significant genes than other methods, and there are some genes reported by GWAS catalog which are only identified by Overall.
There are some advantages of our proposed method. First, Overall adaptively aggregates information from multiple gene-based association tests. Most combination tests (i.e., Fisher's combination test 61 ) assume that the p-values should be calculated from independent tests. To combine information from highly correlated gene-based association tests, Overall utilizes the extended Simes procedure 5,22 . It is shown that this procedure to combine multiple tests is stable and effective regardless of whether the tests are highly correlated 24,62 . Second, Overall is more powerful than the traditional gene-based association tests, some popular transcriptome association tests (i.e., S-PrediXcan 29 and S-TWAS 12 ), and other eQTL weighted combination tests (i.e., ominous test 15 ). By aggregating information from different tests and incorporating multiple eQTL-derived weights, Overall can achieve a higher statistical power under a variety of situation settings. Meanwhile, our simulation studies and real data analyses show that the extended Simes procedure is more powerful than the Cauchy combination method, especially if the proportion of causal variants in a gene is small. Third, the p-values of Overall can be analytically computed without using permutations, therefore, Overall is computationally efficient. Finally, using the regularization procedure to correct the estimated LD can reduce the potential statistical noise in the LD estimation if LD is estimated using a reference panel with small sample size. In addition, Overall can be easily applied to genetic association studies with either individual-level data or GWAS summary statistics.
In this paper, we combine three types of traditional gene-based association tests (BT, SKAT, SKATO). However, the combination procedure used in the paper is very general. Other more powerful gene-based association tests can also be combined using the same approach, such as some state-of-the-art methods (i.e., S-TWAS 12 , E-MAGMA 63 , and SMR 64 ).
In this current study, we utilize the weights derived from four single tissue gene expression studies (CMC, METSIM, NTR, YFS). Although the extended Simes procedure in Overall allows us to employ more eQTLderived weights from a number of studies (i.e., GTEx gene expression version 8 65 et al.), there is a possibility that the noise can be increased with the increment in the number of unrelated studies. Therefore, the power of the combination tests (i.e., Overall and OT) might be attenuated. Thus, to obtain the most robust identification of phenotypic associated genes in a real data analysis with the Overall method, we suggest incorporating eQTL datasets from the most relevant tissues to the phenotype. The last but the most important thing is that population stratification can be confounded association results 66,67 . Systematic minor allele frequency difference between transcriptomic studies of different cohorts and no matching between the estimated LD structure of Genomes Project with that in the study may increase the chances of false positive findings. Therefore, we need to eliminate false positive findings possibly caused by population stratification 68,69 . When applying the Overall method, the population of GWAS summary dataset, external reference panel (i.e., 1000 Genomes Project) used to estimate LD structure, and eQTL-derived weights should be consistent.
In this study, the computational time of the proposed method is acceptable even if the estimated correlation matrix of multiple tests is obtained by the replication procedure. Meanwhile, the estimation procedure is independent of gene-based association tests, therefore we only need to perform this procedure once for each GWAS summary dataset. For example, there are a total of 29,008 gene in the 1000 Genomes Project and we www.nature.com/scientificreports/ use 1000 replicates to estimate the correlation matrix of multiple tests for each gene. We perform this using the high-performance computing (HPC) cluster (Intel Xeon E5-2670 2.6 GHz, 16 GB RAM). The computational time for all genes is about 36 h CPU time with 500 nodes. Then, the p-value of the proposed method can be computed analytically which is independently performed in each GWAS summary dataset. The computational time for each GWAS dataset is about 1 h CPU time with 10 nodes.

Data availability
The data that support the findings of this study are publically available and the links to the data are provided in the article.