MetaXcan: Summary Statistics Based Gene-Level Association Method Infers Accurate PrediXcan Results

To gain biological insight into the discoveries made by GWAS and meta-analysis studies, eﬀective integration of functional data generated by large-scale eﬀorts such as the GTEx Project is needed. PrediXcan is a gene-level approach that addresses this need by estimating the genetically determined component of gene expression. These predicted expression traits can then be tested for association with phenotype in order to test for mediating role of gene expression levels. Furthermore, due to the polygenic nature of many complex traits, eﬀorts to aggregate multiple GWAS studies and conduct meta-analyses have successfully increased our ability to identify variants of small eﬀect sizes. To take advantage of the results generated by these eﬀorts and to avoid the problems associated with accessing and handling individual-level data (e.g. consent limitations, large computational/storage costs) we have developed an extension of PrediXcan. The new method, MetaXcan, infers the results of PrediXcan using only summary statistics from large-scale GWAS or meta-analyses. Here we show that the concordance between PrediXcan and MetaXcan is excellent when the right reference population is used ( R 2 > 0 . 95) and robust to population mismatches ( R 2 > 0 . 85). We provide open source local and web-based software for easy implementation


Introduction
Over the last decade, GWAS have been successful in identifying genetic loci that robustly associate with multiple complex traits. However, the mechanistic understanding of these discoveries is still limited, hampering the translation of this knowledge into actionable targets. Studies of enrichment of expression quantitative trait loci (eQTLs) among trait-associated variants [1,2] show the importance of gene expression regulation. Direct quantification of the contribution of different functional classes of genetic variants showed that 80% of phenotype variability (in 12 diseases) can be attributed to DNAase I hypersensitivity sites, further highlighting the importance of transcript regulation in determining phenotypes [3].
Many transcriptome studies have been conducted where genotype and expression levels are assayed for a large number of individuals [4][5][6][7]. The most comprehensive transcriptome dataset, in terms of tissues covered, is the GTEx Project, a large-scale effort where DNA and RNA are collected from multiple tissue samples from nearly 1000 deceased individuals and sequenced to high coverage [8]. This remarkable resource provides a comprehensive cross-tissue survey of the functional consequences of genetic variation at the transcript level.
To integrate knowledge generated from these large-scale transcriptome studies and shed light on disease biology, we developed PrediXcan [9], a gene-level association approach that tests the mediating effects of gene expression levels on phenotypes. This is implemented on GWAS/sequencing studies (i.e. studies with genome-wide interrogation of DNA variation and phenotypes) where transcriptome levels are imputed with models trained in measured transcriptome datasets (e.g. GTEx). These predicted expression levels are then correlated with the phenotype and provides the basis for a gene-level association test that addresses some of the key limitations of GWAS [9]. analysis studies are needed rather than individual level genotype and phenotype data.
We will show here that our new method, termed MetaXcan, is a fast, accurate, and efficient way to scale up implementation of PrediXcan and take advantage of the large sample sizes made available through meta-analysis of GWAS.

Results
We have derived an analytic expression that allows us to compute the outcome of PrediXcan using only summary statistics from genetic association studies. Details of the derivation are shown in the Methods section. In Figure 1, we illustrate the mechanics of MetaXcan in relation to traditional GWAS and our recently published PrediXcan method.
For both GWAS and PrediXcan, the input is the genotype matrix and phenotype vector. GWAS computes the regression coefficient of the phenotype on each marker in the genotype matrix and generates SNP-level results. PrediXcan starts by estimating the genetically-regulated component of the transcriptome (using weights from the publicly available PredictDB database) and then computes regression coefficients of the phenotype on each predicted gene expression level generating gene-level results.
MetaXcan, on the other hand, can be viewed as a shortcut that uses the output from a GWAS study to generate the output from PrediXcan. Since MetaXcan only depends summary statistics, it can effectively take advantage of large-scale meta analysis results, avoiding the computational and regulatory burden of handling large amounts of protected individual level data. Figure 2 shows the main analytic expression used by MetaXcan for the Z-score (effect size divided by its standard error) of the association between predicted gene expression and the phenotype. The input variables are the weights used to predict the expression of a given gene w lg , the variance and covariances of the markers included in the prediction of the expression level of the gene, and the GWAS coefficient for each marker. The last factor in the formula can be computed exactly in principle, but we would need some additional information that is unavailable in typical GWAS output. Fortunately, we have found that this factor is very close to 1 and dropping it from the formula does not affect the accuracy of the . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; Figure 1. This figure illustrates the MetaXcan method in relationship to GWAS and PrediXcan. Both GWAS and PrediXcan take genotype and phenotype data as input. GWAS computes the regression coefficients of Y ∼ X l using the model Y = X l b + , where Y is the phenotype and X l the individual dosage. The output is the table of SNP-level results. PrediXcan, in contrast, starts first by predicting/imputing the transcriptome. Then it calculates the regression coefficients of the phenotype Y on each gene's predicted expression T g . The output is a table of gene-level results. MetaXcan computes the gene-level association results using directly the output from GWAS.

MetaXcan formula
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; Figure 2. MetaXcan formula. This plot shows the formula to infer PrediXcan gene-level association results using summary statistics. The different sets involved in input data are shown. The study set is where the regression coefficient between the phenotype and the genotype is obtained from. The training set is the reference transcriptome dataset where the prediction models of gene expression levels are trained. The reference set, in general 1000 Genomes, is used to compute the variances and covariances (LD structure) of the markers used in the predicted expression levels. Both the reference set and training set values are pre-computed and provided to the user so that only the study set results need to be provided to the software. The crossed out term was set to 1 as an approximation, since its calculation depends on generally unavailable data. We found this approximation to have negligible impact on the results.

results.
The approximate formula we will use is as follows: where • w lg is the weight of SNP l in the prediction of the expression of gene g, •β l is the GWAS regression coefficients for SNP l, • se(β l ) is standard error ofβ l , •σ l is the estimated variance of SNP l, and •σ g is the estimated variance of the predicted expression of gene g.
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; The inputs are based, in general, on data from three different sources: • study set, • training set, • population reference set. The study set is the main dataset of interest from which the genotype and phenotypes of interest are gathered. The regression coefficients and standard errors are computed based on individual-level data from the study set. Training sets are the reference transcriptome datasets used for the training of the prediction models (GTEx, DGN, Framingham, etc.) thus the weights w lg are computed from this set.
Finally, the reference sets (e.g. 1000 Genomes) are used to derive variance and covariance (LD) properties of genetic markers, which will usually be different from the study sets.
In the most common use scenario, the user will only need to provide GWAS results using his/her study set. The remaining parameters are pre-computed, and download information can be found at the https://github.com/hakyimlab/MetaXcan resource.
Next we will show the performance of the method, measured as the concordance (R 2 ) between PrediXcan and MetaXcan results.

Performance in simulated data
We first compared MetaXcan and PrediXcan using simulated phenotypes generated from a normal distribution, using a single transcriptome model trained on Depression Genes and Network's (DGN) Whole Blood data set [4] downloaded from PredictDB (http://predictdb.org). As genotypes we used three ancestral subsets of the 1000 Genomes project: Africans (n=662), East Asians (n=504), and Europeans (n=503). Each set was taken in turn as reference and study set yielding a total of 9 combinations as shown in Figure 3. For each population combination, we computed PrediXcan association results for the simulated phenotype and compared them with results generated from our MetaXcan approach in a scatter plot. This allowed us to assess the effect of ancestral differences between study and reference sets.
As expected, when the study and reference sets are the same, the concordance between MetaXcan and PrediXcan is 100% whereas for sets of different ancestral origin the R 2 drops a few percentage points, with the biggest loss (down to 85%) when the study set is African and the reference set is Asian. This  confirmed that our formula works as expected and that the approach is robust to ethnic differences between study and reference sets.
Performance in cellular growth phenotype from 1000 genomes cell lines Next we tested with an actual cellular phenotype. Intrinsic growth, a cellular phenotype, was computed based on multiple growth asays for over 500 cell lines from the 1000 Genomes project [11]. We used a subset of values for Europeans (EUR), Africans (AFR), Asians (EAS) individuals.
We compared Z-scores for intrinsic growth generated by PrediXcan and MetaXcan for different combinations of reference and study sets, using whole blood prediction model trained in the DGN cohort.
The results are shown in Figure 4. Consistent with our simulation study, the MetaXcan results closely match the PrediXcan results. Again, the best concordance occurs when reference and study sets share similar continental ancestry while differences in population slightly reduce concordance. Compared to  the plots for the simulated phenotypes, the diagonal concordance is slightly lower than 1. This is due to the fact that more individuals were included in the reference set than in the study set, thus the study and reference sets were not identical for MetaXcan.

We show the comparison of MetaXcan and PrediXcan results for two diseases: Bipolar Disorder (BD)
and Type 1 Diabetes (T1D) from the WTCCC in Figure 5. Other disease phenotypes exhibited similar performance (data not shown). Concordance between MetaXcan and PrediXcan is over 95% in for both diseases (BD R 2 = 0.956 and T1D R 2 = 0.958). The very small discrepancies are explained by differences in allele frequencies and LD between the reference set (1000 Genomes) and the study set (WTCCC). Given this high concordance, we do not expect much improvement when using a reference set that is more similar to the study set. We verified this and, as expected, found that using control individuals from WTCCC as reference set improved the concordance only marginally (0.1%).
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; Figure 5. Comparison of PrediXcan results and MetaXcan results for a Type I Diabetes study, and a Bipolar Disorder study. Study data was extracted from Wellcome Trust Case Control Consortium, and MetaXcan reference population were the European individuals from Thousand Genomes Project (same as in previous sections) -It is worth noting that the PrediXcan results for diseases were obtained using logistic regression whereas MetaXcan formula is based on linear regression properties. As observed before [12], when the number of cases and controls are relatively well balanced (roughly, at least 25% of cases and controls), linear regression approximation yields very similar results to logistic regression. This high concordance also shows that the approximation where we drop the term does not significantly affect the results.

Software
We make our software publicly available on a GitHub repository: https://github.com/hakyimlab/ MetaXcan. Instructions for obtaining the weights and covariances for different tissues can be found there.
A short working example can be found on the GitHub page; more extensive documentation can be found on the project's wiki page.
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016;

Discussion
Here we present MetaXcan, a scalable, accurate, and efficient method for integrating reference transcriptome studies to learn about the biology of complex traits and diseases. Our method extends PrediXcan, which maps genes to phenotypes by testing the mediating effects of gene expression levels. This is implemented by predicting gene expression levels and correlating these traits with phenotypes. MetaXcan is a shortcut that uses SNP-level association results and combines them to reproduce the results of PrediXcan, without the need to use individual level data.
MetaXcan shares most of the benefits of PrediXcan: a) it directly tests the regulatory mechanism through which genetic variants affect phenotype; b) it provides gene-level results which are better functionally characterized than genetic variants, easier to validate within model systems, and carry a smaller multiple testing burden; c) the direction of the effects are known, facilitating identification of therapeutic targets; d) reverse causality is largely avoided since predicted expression levels are based on germline variation, which are not affected by onset of disease; e) it can be systematically applied to existing GWAS studies; f) tissue-specific analysis can be performed using all the models we have made available through PredictDB (http://predictdb.org).
The difference between the reference sets (used to estimate LD and allele frequencies) and study set (used to compute GWAS/meta analysis summary statistics) is the main cause of the small differences between MetaXcan and PrediXcan results. We have shown here that even when the populations are quite different, the concordance is very high. Thus, MetaXcan is robust to ancestral differences between study and reference sets.
Even though the method was derived with linear regression in mind, in case-control designs, the approximation generates results that are in almost full concordance with exact results generated with PrediXcan and logistic regression.
Methods similar in spirit to PrediXcan have been reported [10]. Gusev et al also propose a method comparable to MetaXcan that is based only on summary statistics. Their method, called Transcriptome-Wide Association Study (TWAS), imputes the SNP level z-scores into gene level z-scores using the method Pasaniuc and others have published [13]. This approach is equivalent to predicting expression levels using BLUP/Ridge Regression, which has been shown to be suboptimal for prediction. This is due to the fact that the local architecture of gene expression traits is sparse so that highly polygenic models underperform . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; more sparse prediction models such as LASSO or Elastic Net with mixing parameters 0.5 or greater [14].
In contrast, MetaXcan is not restricted to one imputation or prediction scheme. It infer the results of PrediXcan using summary statistics through an analytic formula. Thus it can be applied to linear models based on SNP data.
In summary, we present an accurate and computationally efficient gene-level association method that integrates functional information from reference transcriptome dataset into GWAS and large scale metaanalysis results to inform the biology of complex traits.

Derivation of MetaXcan Formula
The goal of MetaXcan is to infer the results of PrediXcan using only GWAS summary statistics. Individual level data are not needed for this algorithm. We will define some notations for the derivation of the analytic expressions of MetaXcan.

Notation and Preliminaries
Y is the n-dimensional vector of phenotype for individuals i = 1, n. X l is the allelic dosage for SNP l.
T g is the predicted expression (or estimated GREx, genetically regulated expression).
We model the phenotype as linear functions of X l and T g Y = X l β l + η Y = T g γ g + , whereγ g andβ l are the estimated regression coefficients of Y regressed on T g and X l , respectively.γ g is the result (effect size for gene g) we get from PrediXcan whereasβ l is the result from a GWAS for SNP l.
We will denote as Var and Cov the operators that computes the sample variance and covariances, i.e.
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; where X is the n × p matrix of SNP data andX is a n × p matrix where column l has the column mean of X l (p being the number of SNPS in the model for gene g).
With this notation, our goal is to infer PrediXcan results (γ g and its standard error) using only GWAS results (β l and se), estimated variances of SNPs (σ 2 l ), covariances between SNPs in each gene model (Γ g ), and prediction model weights w lg .
Next we list the properties and definitions used in the derivation: The proportion of variance explained by the covariate (T g or X l ) can be expressed as . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; Var(T g ) =σ 2 g can be computed aŝ where W g is the vector of w lg for SNPs in the model of g Calculation of regression coefficient γ ĝ γ g can be expressed aŝ Calculation of standard error of γ g Also from the properties of linear regression we know that In this equation, σ Y /n is not necessarily known but can be estimated using the analogous equation (7) for beta se(β l ) =σ Notice that the right hand side of (9) is dependent on the SNP l while the left hand side is not. This . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016; equality will hold only approximately in our implementation since we will be using approximate values forσ 2 l , i.e. from reference population, not the actual study population.

Calculation of Z score
To assess the significance of the association, we need to compute the ratio of the effect size γ g and standard error se(γ g ), or Z score, with which we can compute the p value as p = 2 pnorm(−|Z g |) Zg =γ Based on results with actual and simulated data we have found that the last approximation does reduce power since the deviation is only noticeable when the correlation between the SNP or the predicted expression and the phenotype is large, i.e. large effect sizes. When the effects are large the loss of power is compensated by the large effect size.
. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/045260 doi: bioRxiv preprint first posted online Mar. 23, 2016;