Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits

Transcriptome-wide association studies integrate gene expression data with common risk variation to identify gene-trait associations. By incorporating epigenome data to estimate the functional importance of genetic variation on gene expression, we generate a small but significant improvement in the accuracy of transcriptome prediction and increase the power to detect significant expression-trait associations. Joint analysis of 14 large-scale transcriptome datasets and 58 traits identify 13,724 significant expression-trait associations that converge on biological processes and relevant phenotypes in human and mouse phenotype databases. We perform drug repurposing analysis and identify compounds that mimic, or reverse, trait-specific changes. We identify genes that exhibit agonistic pleiotropy for genetically correlated traits that converge on shared biological pathways and elucidate distinct processes in disease etiopathogenesis. Overall, this comprehensive analysis provides insight into the specificity and convergence of gene expression on susceptibility to complex traits.


Weighted elastic net (WENet) model utilized in EpiXcan
The elastic net (ENet) linear regression model is implemented in PrediXcan 1 . Criterion can be written as and the ridge penalty, which can be formulated as the inner product of with respect to matrix , < , > W. In cases where α = 1 and λ ≠ 0, the method is reduced to Lasso. If λ= 0, the method is even more simplified as a standard regression model without penalties. If all the penalty factors are 1's, i.e., matrix is identity matrix, the model is reduced to standard ENet without penalty weights.

SNP priors
We first prepare eQTL statistics (computed with MatrixEQTL 4 ) and SNP annotations (extracted from REMC https://egg2.wustl.edu/roadmap/web_portal/). For each eQTL tissue, we use the matched REMC tissue to extract the corresponding annotations (Supplementary Data 8). We then provide them as input of qtlBHM 5 that utilizes a Bayesian hierarchical model to calculate priors, which is a measure of SNP causality. Priors are derived from chromHMM 6 and, for each tissue, SNPs in the same state are assigned the equivalent priors based on the chromHMM tracks in which they are located. The REMC tissues that match eQTL tissues and prior statistics for all tissues of this study, for each annotation category, are provided in Supplementary Data 8. SNP priors for a given dataset can be calculated using our pipeline at https://bitbucket.org/roussoslab/epixcan; for the SNP priors included in this study, we offer them in the predictor databases as a direct download at https://icahn.mssm.edu/EpiXcan.

Data-driven equation that rescales SNP priors to penalty factors
The higher the estimated SNP priors given by qtlBHM 5 , the higher the likelihood that the SNP has an important effect on gene expression. On the other hand, higher penalty factors in the WENet model denote a smaller effect on gene expression. Thus, optimal equations must be found to properly rescale priors to penalty factors. For this study, we developed a method based on Bézier curves employing a shifting-window strategy to approximate the data-driven rescaling function. Theoretically, we can have a different rescaling equation for each gene but, for simplicity and computational resource efficiency, for this study we opt to use one rescaling equation for all models of a given tissue. The steps of this method using the CMC tissue dataset as a template are as follows: 1) We perform PrediXcan and obtain the target R 2 CV for all genes. We then select 8 genes that are representative for different levels of R 2 CV that can be found in the study. For CMC we select the following genes (and provide the target R 2 CV for each one in parentheses): DDX11 (0.7631), ADAM15 (0.3029), C1orf112 (0.1497), C1RL (0.5098), ERBB3 (0.0204), ECT2L (0.0131), SEPT1 (0.0053), ZNF346 (0.0050) 2) We simulate 500 genotypes using HAPGEN2 7 . We use haplotypes from the 1000 Genomes Project 8 and a fine-scale recombination map 7 to simulate genotypes. We further filter the genotypes to include SNPs with MAF of at least 5%. We then keep the SNP structure for the cis SNPs of the 8 genes selected.
3) For each of the genes from (1), we perform simulations to select best rescaling function. All rescaling is based on quadratic Bézier curves. A n-th order Bézier curve is defined by a set of points, P0 through Pn, which are control points. The first (P0) and last (Pn) control points are usually called the starting and ending control points of the curve. All other points are intermediate control points that do not lie on the curve, however, they decide the shapes of the curve. A n-th order Bézier curve has n+1 control points, n-1 of which are intermediate control points -a brief introduction of Bézier curves is enclosed in the next section. We then: a. Define region of prior-to-penalty factor mapping. As shown in Supplementary Figure 23b, to map priors to penalty factors, we first define an area with x ∈ [0, maximal SNP prior], corresponding to the SNP priors and y ∈ [0, 1], corresponding to penalty factors.
b. Shifting-window policy. We then divide the rectangle region of prior-to-penalty factor mapping into several sub-windows. In each of the sub-windows, we have ranges of both the penalty factors and the priors. As described above, penalty factors should decrease with increasing values of priors, so that important SNPs can have a larger effect on transcriptomic imputation.
We set the upper bound of penalty factors at 1 (as in the ENet model employed by PrediXcan) to which the minimal value of priors will be mapped (Supplementary Figure 23b). Since we do not have a lower bound of the penalty factors to which the maximal priors will be mapped, we map the maximal prior to the lowest rescaled factor value (y2) in each sub-window starting from 0 and going all the way to 1 with step size 0.1 (step size is arbitrarily set) (e.g. in d. Perform simulations to assess performance for each rescaling equation. We apply all possible Bézier curves as rescaling candidates to compute the R 2 CV(simulation) for 100 times of gene-specific simulations. For each simulation we use the 500 simulated genotypes from (2) and assess performance (R 2 CV(simulation)) against simulated gene expression. Simulated gene expression is calculated by equation (2). Instead, we choose PrediXcan predictors as the effect estimates, and noise is normally distributed with a given standard deviation (SD=0.3). e. Select the best performing rescaling equation in each sub-window from (3b). The rescaling function with the highest improvement of R 2 CV(simulation) from (3d) is chosen as the optimal rescaling in the sub-window. We select for maximal improvement as determined by: In Supplementary Figure 26, we show the optimal rescaling equations for each sub-window for gene DDX11.
f. Select the best sub-window-specific performing rescaling equation from (3e) for each gene (from (1)). For every optimal rescaling equation from each sub-window (e.g. Supplementary Figure   26), we perform another 100 simulations as described in (3d). We select the best performing one based on Supplementary Equation 3, which is the optimal rescaling equation for the gene.
g. Select the best gene-specific performing rescaling equation to use for the tissue. For each of the 8 genes from (1), we can see the best performing rescaling equations (Supplementary Figure 27).
We perform EpiXcan using each of these rescaling equations and select the one that performs best based on R 2 CV improvement. Finally, to conserve computational resources, we skew the rescaling equation from CMC, to fit the maximal priors (Supplementary Data 8) for all other tissues (Supplementary Figure 28). Here, we provide a framework for data-driven adaptive rescaling. Depending on the needs of each study, researchers may opt to estimate from scratch tissue-specific rescaling equations, or even use gene-specific rescaling equations.

Bézier curves with interpolation functions
The n-th order Bézier curve is determined by n+1 control points, of which n-1 are intermediate control points.
For briefness, we list quadratic Bézier curves here and the function is given as As variable t varies from 0 to 1, y is a function of x with second order. t is an intermediate variable, which is used to define Bézier function. and , k=0,1,…,n are the coordinates of control points.
Note that in Supplementary Equation 4, x and y are denoted by separate functions with respect to variable t. We deduce a function of y with respect to x by eliminating t to calculate rescaled values, which are the penalty weights or factors that used in EpiXcan. Here x stores primitive priors, from which we obtain the penalty factors y.
As we stated earlier, more intermediate control points are necessary for higher order Bézier interpolations.
Theoretically, the more control points we use and the higher order of the interpolations, the higher accuracy will be achieved. To balance accuracy and computation complexity, we use quadratic (second order/degree) Bézier interpolations to approximate the rescaling equations and apply them to the EpiXcan approach. Selecting quadratic Bézier interpolation also has the benefit of not having to control for counter-intuitive increase of penalty factors with increase in priors that can happen in the middle of the curve with higher order equations.

Genotype preprocessing
We remove samples with call rate < 0.95, sex mismatch (genetic sex different from pedigree sex) and autosomal heterozygosity deviation (|Fhet|>0.2). We remove variants with call rate < 0.95, Hardy-Weinberg equilibrium (HWE) p value < 1.0×10 -6 . We identify related individuals using identity by descent analysis (IBD). One of each pair of related individuals (piHAT > 0.2) is removed at random. Since the entire STARNET cohort and the majority of the CMC cohort include individuals of EUR ancestry, we use genotype and gene expression information only from this population to train the models. For this, we merge the samples with 1000 genome EUR subset and do principal component analysis (PCA) using ~25,000 pruned and thinned variants. We plot first and second principal components and define an ellipsoid based on 1000G EUR samples (Supplementary Figure 29).
Those that lie 8 SD away from the center of this ellipsoid are considered as genetic outliers and removed (Supplementary Figure 29). For post imputation, we remove variants with INFO < 0.8, minor allele frequency (MAF) < 0.01, more than one alternative allele. We also remove ambiguous alleles (A/T, G/C), indels (insertions and deletions) and variants without RS identifier.

Estimating adjusted R 2
We compare the performance of EpiXcan and PrediXcan models using adjusted R 2 (for both R 2 CV and R 2 PP), which control for sample size of the training dataset (same for both methods) and for the number of predictors in the model (differs for each gene between methods). We group the training samples a priori prior to the cross validation and use the same groupings in both EpiXcan and PrediXcan.
We use Wilcoxon and one-sample sign tests for the statistical comparisons of adjusted R 2 CV between PrediXcan and EpiXcan models.

Implementations and comparisons with BSLMM and DPR
To compare the performance of EpiXcan, PrediXcan, BSLMM and DPR (VB and MCMC) methods, we utilize the CMC data for training and cross-validation performance (R 2 CV) comparisons and leverage the HBCC data set as an independent testing dataset to compare the R 2 PP (Supplementary Figure 8). We run BSLMM with its default parameters (e.g. SNP filtering with 1% MAF) enclosed in the GEMMA package 9 . For the DPR method 10

Enrichment of clinically significant genes
To compare the clinical significance of the gene-trait associations identified by EpiXcan and PrediXcan, we compile sets of known gene-trait associations by utilizing five different archives: 1. ClinVar 11 : a public archive of relationships among human sequence variation and phenotypes. We only keep the subset of entries that: a) have at least one current submission interpreting as pathogenic or likely pathogenic, b) provide a gene name, c) provide a phenotype. This dataset allows trait-specific gene associations that have high confidence but returns a limited number of genes for each trait. 3. SoftPanel 13 : a method for grouping diseases and related disorders for generation of customized diagnostic gene panels. For traits that have a corresponding ICD-10 number, we use the respective disorder or disorder group and extract the relevant gene sets. For traits that the latter extraction method does not yield any genes (either due to no ICD-10 classification equivalent or due to lack of genes identified with that method), we use the keyword-based search of SoftPanel which queries the OMIM database for keyword-matching disorders. The underlying design of the tool allows for even "softer" associations of the genes with the trait, thus providing a larger trait-specific list of genes when compared with the clinVar dataset and OMIM CS. 14 : this dataset contains gene-phenotype associations from mouse lines. From this, we can infer trait-specific gene associations for the respective human ortholog genes. Direct phenotype overlap with human traits is challenging, as, in most cases, the mouse phenotype is more descriptive and does not use names of human diseases, disorders or syndromes; therefore, phenotype categories are used to query this database. 15 : this dataset provides probabilities of loss of function intolerance (pLI) for each gene; the higher the pLI the higher the likelihood that this gene performs an essential function. This dataset does not provide trait-specific information but serves as an unbiased dataset to rank the "indispensability" of the genes. Of note is that the high pLI subset corresponds to pLI ≥ 0.9 (extreme loss of function intolerant genes) and is, by design, non-trait-specific (thus the higher number of observations in Figure 2d). p value is calculated with the one sample sign test against a theoretical median of 0 (H0: ̃= 0). The ratio is the number of Δ[z] measurements in favor of EpiXcan to the respective number for PrediXcan.

Assessment of computational drug repurposing (CDR) pipeline performance
To objectively assess the computational drug repurposing pipeline performance, we compare the predictions against sets of real-world indications from two different sources: disease modifying: "a drug that therapeutically changes the underlying or downstream biology of the disease", b) symptomatic: "a drug that treats a significant symptom of the disease" and c) non-indication: "a drug that neither therapeutically changes the underlying or downstream biology, nor treats a significant symptom of the disease". • For obesity, isoniazid is not an FDA indication.
• For coronary artery disease, iodixanol is used in CCTA diagnostically (radiographic contrast agent) and is not an FDA indication.
• For Crohn's disease, cyanocobalamin is used to treat malabsorption caused by Crohn's but is not an FDA indication for the disease.
• For type 2 diabetes mellitus: a) fenofibrate is mentioned because it was not shown to reduce coronary artery disease morbidity and mortality in patients with type 2 DM, b) mifepristone is not to be used for the treatment of type 2 diabetes mellitus unrelated to endogenous Cushing's syndrome • For ulcerative colitis, lidocaine is mentioned as one of the active ingredients for an FDA-approved interarticular joint kit that can also be used for intramuscular injection of triamcinolone acetonide for ulcerative colitis.

EpiXcan has better performance than PrediXcan
(1) Performance evaluation in brain tissue. Dorsolateral pre-frontal cortex (DLPFC) gene expression and CommonMind Consortium (CMC) genotype data are utilized as one of the training sets for our approach. Human brain collection core (HBCC) and Genotype-Tissue Expression (GTEx) brain tissue transcriptome data are used only for verification as test datasets (Supplementary Table 1). (2) Performance evaluation in cardiometabolic tissues. The Stockholm-Tartu Atherosclerosis Reverse Network Engineering Task (STARNET) dataset for seven tissues and the GTEx dataset for six tissues (same tissues as STARNET, excluding mammary artery) serve as training and test datasets, respectively and vice versa (Supplementary Table 1).
First, we use cross-validation to evaluate prediction performance. The majority of the reference panel genes (>90%) are contained in the EpiXcan-trained predictor database and the overall R 2 CV is better than PrediXcan trained PredictDB's (Figure 1, Supplementary Figures 3, 4) with significant pair-wise Wilcoxon test p value regarding all the datasets that we utilized (Supplementary Data 1). We list the numbers of genes with R 2 CV ≥ 0.01 from both models. Using 0.01 as the R 2 CV cut-off, we detect more genes with EpiXcan having good performance. In addition, EpiXcan has lower root-mean-square error (RMSE) values, further indicating increased performance (Supplementary Data 1).
Finally, we use independent test datasets to evaluate prediction performance and external model validity. We use the CMC dataset 22 to train the brain tissue model using both EpiXcan and PrediXcan. Afterwards, we first use the trained database of predictors (PredictDBs) to predict transcriptomes using HBCC genotype data 22 . We show that, when using EpiXcan, higher correlations between predicted and observed expression (in HBCC brain) are obtained (Figure 1; Supplementary Figure 5-7), with pairwise Wilcoxon test p value < 9.0 × 10 -16 (Supplementary Data 2). We then use the CMC-trained PredictDB's to predict GTEx brain tissue expression and compare it with observed expression values from 13 different brain regions in that cohort. For all the brain regions, EpiXcan improves prediction performance (Figure 1; Supplementary Figure 5, 6). Similarly, for cardiometabolic tissues we use 7 trained STARNET models to predict corresponding 6 GTEx transcriptomes and vice versa and, overall, observe better predictive correlation for EpiXcan-trained models (Figure 1;   Supplementary Figure 5, 7).

GTA colocalization property comparisons for EpiXcan and PrediXcan
To investigate whether the uniquely identified GTAs from EpiXcan still exhibit good co-localization properties as previously shown for PrediXcan 23 , we limit our analysis to the GTAs identified in our previous SMR study (p_SMR ≤ 0.05) 24 . We then classify them into GTAs with either good co-localization properties (p_HET ≥ 0.05) or not (p_HET < 0.05 rejecting the null hypothesis that there is a single causal variant affecting both gene expression and trait variation): For EpiXcan, we identify 189 GTAs that display good co-localization properties and 312 that do not. Similarly, for PrediXcan, we identify 135 GTAs that display good co-localization properties and 178 that do not. We find no significant difference in the ability of the two methods to uniquely identify GTAs with good co-localization properties (Pearson's χ 2 p value = 0.14).

Comparison of drug repurposing predictions with So et al.
So et al. 25 performed computational drug repurposing based on PrediXcan tissue gene expression prediction models from ten different GTEx brain regions to identify candidate therapeutic compounds. The five traits that are shared between our studies are: (1) AD = Alzheimer's Disease, (2) ADHD = Attention-Deficit/Hyperactivity Disorder, (3) ASD = Autism Spectrum Disorder, (4) Schizophrenia, and (5) Anxiety. The authors provided the top 100 compounds for each brain area predicted to normalize the gene-trait signature. For each trait, we combine all the predicted compounds from different brain regions in a single list containing all compounds that appear at least once. We then examine whether compounds that we predict to either normalize the gene-trait signature (Trait GReX antagonism) or induce a disease-like state (Trait GReX agonism) were included in this list or not (Supplementary Figure 22a). We notice that, out of all the common traits, only in schizophrenia are we more likely than not to find an agreement between our predictions and theirs (χ 2 p value = 0.026, OR = 1.31) and, looking back in Figure 3b, we see that the concordance of our predictions is higher when there are disproportionally more predictions originating from brain tissue (CMC). Since one of the main differences in our approach is that we leverage predictions from a more diverse set of tissues for our drug repurposing pipeline, we further explore this relationship as follows. By calculating enrichment scores (as in Figure 3b, described in Online Methods) for significant GTAs identified in brain tissue versus all other tissues pooled together (Supplementary Figure 22b), we show that the higher the brain tissue enrichment score, the higher the concordance between our predictions and theirs (Spearman's ρ = 1, p value < 2.2 × 10 -16 , Supplementary Figure   22c). Of note is that we exclude the traits "anxiety, case/control" and "anxiety, factor scores" from the analysis because they yield no significant GTAs in brain tissue in our study. Thus, we conclude (as summarized in the Discussion) that there is concordance in our computational drug repurposing pipeline findings especially when brain tissue prediction models contribute disproportionally more GTAs, such as in schizophrenia, but, overall, there is a high level of dissimilarity in our predictions.

Optimal coefficients in Supplementary Equation 2 are estimated by Supplementary Equation 6:
̂= argmin ℂ WENet ( , , ) Grouping effect of the WENet model is given in Theorem 1.
From Zou et al. 26 , we know the association between predicted gene expression and a trait, while taking in to consideration the linkage disequilibrium among SNPs.    EpiXcan is higher than that of PrediXcan across all the tissues considered. The adjusted R 2 CV is employed to assess the prediction performance calculated from the cross validation R 2 CV after adjusting for the number of SNPs used by the prediction model for each gene.
Expected R 2 CV corresponds to the null distribution. Statistical significance is evaluated using the pairwise Wilcoxon test.
Supplementary Figure 5. Comparison of model performance during prediction in independent datasets (∆ ). We used EpiXcan and PrediXcan models to predict expression levels in relevant brain and cardiometabolic independent datasets. We compare the adjusted predictive performance (R 2 PP) by estimating the ΔR 2 PP (EpiXcan R 2 PP minus PrediXcan R 2 PP). Across all datasets, the overall delta value is positive indicating that EpiXcan outperforms PrediXcan. p value from one-sample sign test is provided to compare whether the shift of the ΔR 2 PP values is different than zero (H0: ̃= 0). The numbers in parentheses indicate the occasions where ΔR 2 PP is higher in EpiXcan (ΔR 2 PP > 0; left number) and PrediXcan (ΔR 2 PP < 0; right number); ratio is estimated by dividing the occasions of ΔR 2 PP > 0 with ΔR 2 PP < 0. The red vertical line shows the mean of delta value. (c-d) Comparison of R 2 PP and imputation computational duration when training in CMC and predicting HBCC brain tissue.   Figure 3b) and, in this case, all other tissues except CMC are pooled together. The brain tissue enrichment scores for the evaluated traits are provided in the matrix and are used in c. Traits are ordered based on Ward hierarchical clustering (c) Scatter plot of brain tissue enrichment scores from b and log10OR from a. The data suggest that the higher the brain tissue enrichment score for the trait, the higher the likelihood that our results will be concordant with So et al. (Spearman's ρ = 1, p value < 2.2 × 10 -16 ). For the whole analysis, the traits "anxiety, case/control" and "anxiety, factor scores" are excluded because they yield no significant GTAs in CMC (brain tissue) in our study after adjusting the p values for multiple correction as described in the main text.

Supplementary Tables
Supplementary Table 1. Overview of gene expression and genotype datasets included in current study.
"Number of Genes" is the number of genes that are detectable in each dataset. "Predictor" indicates the datasets used to train PrediXcan and EpiXcan models. "Observed" indicates the datasets used to verify accuracy of predictions.