Isoform-level transcriptome-wide association uncovers genetic risk mechanisms for neuropsychiatric disorders in the human brain

Bhattacharya, Arjun; Vo, Daniel D.; Jops, Connor; Kim, Minsoo; Wen, Cindy; Hervoso, Jonatan L.; Pasaniuc, Bogdan; Gandal, Michael J.

doi:10.1038/s41588-023-01560-2

Download PDF

Article
Open access
Published: 30 November 2023

Isoform-level transcriptome-wide association uncovers genetic risk mechanisms for neuropsychiatric disorders in the human brain

Arjun Bhattacharya^1,2,3,
Daniel D. Vo ORCID: orcid.org/0000-0003-2057-969X^4,5,
Connor Jops^4,5,
Minsoo Kim^6,7,
Cindy Wen^6,7,8,
Jonatan L. Hervoso⁸,
Bogdan Pasaniuc ORCID: orcid.org/0000-0002-0227-2056^3,8,9^na1 &
…
Michael J. Gandal ORCID: orcid.org/0000-0001-5800-5128^4,5,6,7,10^na1

Nature Genetics volume 55, pages 2117–2128 (2023)Cite this article

7265 Accesses
3 Citations
55 Altmetric
Metrics details

Subjects

Abstract

Methods integrating genetics with transcriptomic reference panels prioritize risk genes and mechanisms at only a fraction of trait-associated genetic loci, due in part to an overreliance on total gene expression as a molecular outcome measure. This challenge is particularly relevant for the brain, in which extensive splicing generates multiple distinct transcript-isoforms per gene. Due to complex correlation structures, isoform-level modeling from cis-window variants requires methodological innovation. Here we introduce isoTWAS, a multivariate, stepwise framework integrating genetics, isoform-level expression and phenotypic associations. Compared to gene-level methods, isoTWAS improves both isoform and gene expression prediction, yielding more testable genes, and increased power for discovery of trait associations within genome-wide association study loci across 15 neuropsychiatric traits. We illustrate multiple isoTWAS associations undetectable at the gene-level, prioritizing isoforms of AKT3, CUL3 and HSPD1 in schizophrenia and PCLO with multiple disorders. Results highlight the importance of incorporating isoform-level resolution within integrative approaches to increase discovery of trait associations, especially for brain-relevant traits.

Cell-type-specific cis-eQTLs in eight human brain cell types identify novel risk genes for psychiatric and neurological disorders

Article 01 August 2022

Multi-tissue transcriptome analyses identify genetic mechanisms underlying neuropsychiatric traits

Article 13 May 2019

Brain expression quantitative trait locus and network analyses reveal downstream effects and putative drivers for brain-related diseases

Article Open access 23 February 2023

Main

Recently, the number of genetic associations with complex traits identified by genome-wide association studies (GWAS) has increased considerably^1,2. However, translating these associations into concrete molecular mechanisms remains a great obstacle for the field. As GWAS hits predominantly localize within non-coding regions, often within large blocks of linkage disequilibrium (LD), a major challenge is prioritizing the underlying causal variant(s) and identifying their putative functional impact on nearby target genes. Numerous methods, including transcriptome-wide association studies (TWAS), have been developed to integrate population-level transcriptomic reference panels with GWAS summary statistics to prioritize genes at trait-associated loci^{3,4,5,6,7,8,9,10,11,12,13,14,15}. TWASs impute the cis-component of gene expression predicted by common variants into an association cohort, thereby reducing multiple comparisons and increasing interpretability by identifying a set of genes that may underlie the genetic association^3,4.

Previous integrative analyses have largely focused on total gene expression as the molecular outcome, and not the distinct transcript isoforms of a gene generated through alternative splicing, a tissue-specific gene regulatory mechanism present in ~90% of human genes that vastly expands the genome’s coding and regulatory potential^16,17,18,19. Compared with other tissues, brain-expressed genes are longer, contain more exons, and exhibit the most complex splicing pattern, contributing to the evolutionary and phenotypic complexity of the human brain^20,21,22,23. While Gencode v40 annotates 4.0±7.28 isoforms per gene (mean ± standard deviation), specific neuronal genes are individually known to have >1000 unique isoforms^24,25. Independent of gene expression, splicing dysregulation has been implicated in disease^{20,21,22,26,27,28}, especially for neuropsychiatric disorders^10,20,22,29. Local splicing events can be difficult to measure and integrate across multiple large-scale datasets. Splicing is often coordinated across a gene, yielding many non-independent features that increases multiple testing burden. In contrast, transcript-isoform abundance can be rapidly estimated across large-scale RNA-sequencing (RNA-seq) datasets using pseudoalignment methods^30,31. Furthermore, in the brain, isoform-level expression changes have shown greater enrichment for schizophrenia (SCZ) heritability than gene or local splicing changes^{20,29,32,33,34}. However, to fully integrate transcript-isoform quantifications with GWASs, innovative computational methods are needed that jointly model the highly correlated isoforms of the same gene.

Here, we present isoform-level TWAS (isoTWAS), a flexible approach for complex trait mapping by integrating genetic effects on isoform-level expression with GWAS. Using simulations and data from the Genotype-Tissue Expression (GTEx) Project³⁵ and the PsychENCODE Consortium^20,22, we show that isoTWAS provides several advantages compared with gene-level methods. First, for transcriptomic prediction, the correlation between isoforms provides additional information unavailable when only gene-level expression is modeled. This leads to improved prediction accuracy³⁶ of >80% of individual isoforms, with a median of ~1.8- to 2.4-fold improvement, and of total gene expression by 25–70%. Consequently, this doubles the number of testable features in the trait mapping step. Third, divergent patterns of genetic effects across isoforms can be leveraged to provide a more granular hypothesis for a mechanism underlying the single-nucleotide polymorphism (SNP)–trait relationship. Finally, the isoTWAS framework jointly captures expression and splicing disease mechanisms while maintaining a well-controlled false discovery rate. Using GWAS data for 15 neuropsychiatric traits, isoTWAS greatly increases discovery of gene-level trait associations, uncovering associations at ~60% more GWAS loci compared to traditional gene-level TWAS. These results stress the need to shift focus to transcript isoforms to increase discovery of transcriptomic mechanisms underlying genetic associations with complex traits.

Results

The isoTWAS framework

isoTWAS prioritizes genes with transcript isoforms whose cis-genetic component of expression is significantly associated with a complex trait. We first jointly model the expression of distinct isoforms of a gene as a matrix while accounting for their pairwise correlation structure^3,4,24,35. Here, we assume that (1) local genetic variants directly modulate expression of an isoform and (2) the abundance of a gene is the sum of the abundance of its isoforms, computed as transcripts per million (TPM) (Extended Data Fig. 1a)^30,31,37,38. Integrating isoform-level expression into trait mapping may prioritize discoveries in disease mapping missed by gene-level integration, as in a setting where a gene has multiple isoforms but only one is associated with the trait (Fig. 1a). By modeling the genetic architectures of isoforms of a gene simultaneously, isoTWAS provides a deeper understanding of potential transcriptomic mechanisms that underlie genetic associations.

**Fig. 1: Isoform-centric approach for complex trait mapping and prioritization of disease mechanisms at a genetic locus.**

The isoTWAS framework contains three steps (Fig. 1b). First, we build multivariate predictive models of isoform-level expression from all SNPs within 1 Mb in well-powered functional genomics training datasets (for example, GTEx³⁵ and PsychENCODE^20,22) using one of four multivariate penalized predictive frameworks^39,40,41,42. As a baseline for comparison, we modeled each individual isoform independently with univariate regularized regressions^4,41,43,44 (Methods). Model performance was assessed via 5-fold cross-validation (CV).

Second, we use these models to impute isoform expression into an external GWAS cohort and quantify the association with the target GWAS phenotype. If individual-level genotypes are available, isoform expression can be directly imputed as a linear combination of the SNPs in the models, and these associations can be estimated through appropriate regression analyses. If only GWAS summary statistics are available, imputation and association testing is conducted simultaneously through a weighted burden test⁴.

Third, isoTWAS performs stepwise hypothesis-testing procedure to account for multiple comparisons and control for local LD structure. Isoform-level P values are first aggregated to the gene-level using the aggregated Cauchy association test (ACAT)⁴⁵, where false discovery rates are controlled, and then individual isoforms of prioritized genes are subjected to post-hoc family-wise error control⁴⁶ (Extended Data Fig. 1b and Methods). After this step, a set of isoforms are identified whose cis-genetic components of expression are associated with the trait of interest⁴. For these isoforms, we apply a rigorous permutation test by permuting the SNP-to-isoform effects to generate a null distribution. This permutation test assesses how much signal is added by isoform expression, given the GWAS architecture of the locus, and controls for large LD blocks⁴. Lastly, we can perform isoform-level Bayesian fine mapping at loci with significant trait associations to identify the minimal credible set of isoforms that contains the ‘causal’ isoform and to assign individual posterior inclusion probabilities (Methods). isoTWAS is available as an R package⁴⁷.

Improved isoform and gene expression prediction

Previous work demonstrates that isoform-level quantifications from short-read RNA-seq, when propagated to the gene-level, can lead to more accurate gene expression estimates and differential expression inference^37,38. We therefore hypothesized that our multivariate SNP-based imputation of isoform expression, when aggregated to the gene level, would outperform traditional gene-level (for example, TWAS) models. To evaluate total gene expression predictions of TWAS and isoTWAS models across multiple genetic architectures, we conducted simulations across 22 different gene loci using European-ancestry reference data⁴⁸. At each gene locus, we controlled expression heritability and simulated 2–10 distinct isoforms, varying the proportion of causal isoform-level quantitative trait loci (isoQTLs; p_causal) and their sharing between isoforms (p_shared) (Methods and Fig. 2a).

**Fig. 2: IsoTWAS models predict gene expression with more accuracy than TWAS models in simulated data.**

For isoTWAS, multivariate elastic net⁴¹ demonstrated the greatest CV prediction of isoform expression across most simulation settings (Fig. 2b, Extended Data Fig. 2a and Supplementary Data 1). For total gene expression prediction, the optimal isoTWAS models in sum outperformed the optimal TWAS model, particularly at sparser isoQTL architectures, with median absolute increase in adjusted R² of 0.6–3.5% (Fig. 2c, Extended Data Fig. 2b and Supplementary Data 2). Performance gains decreased with denser isoQTL architectures, although we expect approximately 0.1–1% quantitative trait locus (QTL) sparsity (that is, 1–10 causal expression, or e-, and isoQTLs per gene or isoform)³⁵. In simulations, isoTWAS prediction of gene expression also increases as the proportion of shared non-zero effect SNPs across isoforms decreases (Fig. 2b,c, Extended Data Fig. 2b and Supplementary Data 2).

Next, we assessed predictive performance in GTEx data from 48 tissues (13 brain) with sufficient sample sizes (N > 100) for all genes with multiple expressed isoforms (Supplementary Table 1 and Methods). Altogether, we built predictive models for 50,000 to 80,000 isoforms across 8,000 to 12,000 unique genes per tissue that met CV cutoffs (Methods, Extended Data Figs. 3–5 and Supplementary Table 2).

We considered three criteria to evaluate the prediction of both the multivariate and isoform-centric approaches of isoTWAS: (1) the number of isoforms imputed using multivariate/univariate models with CV R² > 0.01, (2) the number of unique genes with >1 isoform imputed at CV R² > 0.01 and (3) the number of unique genes with total gene expression imputed at CV R² > 0.01 using isoTWAS (summed) or TWAS models. At the isoform level (criterion 1), through multivariate modeling, we trained 2.3- to 2.5-fold more models at CV R² > 0.01 across the 48 tissues, compared to univariate approaches (Fig. 3a). isoTWAS improved prediction for 79–82% of isoforms with a median increase of ~1.8- to 2.4-fold increase in adjusted R² (Extended Data Fig. 3a,b and Supplementary Table 2). Concordant with simulations, multivariate elastic net outperformed other methods, indicating that leveraging the shared genetic architecture between isoforms aids in marginal prediction of each isoform (Extended Data Fig. 3c and Supplementary Table 2). Additionally, multivariate models were particularly powerful in brain tissues compared to other tissues in GTEx, showing significantly improved performance compared to univariate models (Fig. 3b; P = 0.011 from ordinary least squares regression of median percent increase in CV R² across tissue, adjusted for sample size). This suggests more shared isoQTL architecture in brain tissues than others, which isoTWAS leverages for improved prediction. These gains in prediction accuracy translate into increased power in trait association⁴⁹.

**Fig. 3: Multivariate isoform-level models overperform gene-level models in predicting total gene expression.**

At the gene level (criteria 2 and 3), isoTWAS increased the number of genes with testable models in the trait mapping step and improved prediction of total gene expression. The number of unique genes with >1 isoTWAS model at CV R² > 0.01 (inclusion criterion for isoTWAS trait mapping) was 1.9–2.5 times larger than the number of unique genes with TWAS models achieving CV R² > 0.01 for gene expression prediction (Fig. 3c, Extended Data Fig. 4a and Supplementary Table 2). For a given gene, isoTWAS models (summed) outperformed TWAS models in prediction of total gene expression by a median of 25–70% in CV (Extended Data Fig. 4b) with a 50–80% increase in the number of genes that are predicted at CV R² > 0.01 (Fig. 3d and Extended Data Fig. 5). We replicated these gains in total gene expression prediction using an independent, out-of-sample QTL dataset of adult cortex from PsychENCODE/AMP-AD (Methods). Multivariate isoTWAS models outperformed univariate TWAS models in predicting total gene expression, with a 15.2% median percent increase in adjusted R² when training in GTEx and testing in PsychENCODE/AMP-AD and 23.9% vice versa (Fig. 3e and Supplementary Table 3).

As genes differ in the number and expression patterns of their constituent isoforms, gene length, SNP density, quantification accuracy, and other relevant factors, we characterized their impact on isoTWAS performance (Methods, Supplementary Note, Extended Data Fig. 6 and Supplementary Data 3 and 4). We also evaluated the impact of reference transcriptome annotation fidelity by generating a synthetic dataset quantified using a reference annotation masking the dominant isoforms for a set of genes (Extended Data Fig. 3d). We discuss these evaluations in detail in Supplementary Note.

In total, as predictive performance is positively related to power to detect trait associations⁴⁹, both the increased number and accuracy of trainable imputation models using isoTWAS have strong implications for increased discovery⁴⁹.

Calibrated null and improved power across architectures

We next introduced GWAS data for complex traits into our simulation framework to benchmark the false positive rate (FPR) and power of isoTWAS (Methods). First, the FPR is controlled at 0.05 for isoform-level mapping using ACAT (Extended Data Fig. 7a and Supplementary Data 5). For a simulated trait, we modeled causal effect architectures for a genomic locus with 2–10 isoforms under three scenarios (Methods, Fig. 4 and Extended Data Fig. 7b): (1) where the true trait effect is from only total gene expression, (2) where there is only one ‘effect isoform’ with a non-zero effect on the trait and (3) where there are two effect isoforms with varying magnitudes of association. Scenario 1 showed clear increases in power for TWAS over isoTWAS, but this advantage decreased with increased causal proportion of isoQTLs and proportion of shared isoQTLs (Fig. 4a and Supplementary Data 6). For scenarios 2 and 3, as effects on the trait varied across isoforms of the same gene (Fig. 4b,c and Supplementary Data 7 and 8), isoTWAS showed clear increases in power over TWAS across most scenarios and causal effect architectures and particularly in settings with one effect isoform or two divergent effect isoforms. However, when the effect sizes of two effect isoforms converged, TWAS and isoTWAS demonstrated similar power (Fig. 4c).

**Fig. 4: IsoTWAS improves power to detect gene-trait associations in simultations, especially when genetic effects differ across isoforms.**

Finally, we assessed the performance of probabilistic fine mapping in identifying the true effect isoform in our simulation framework of genes with 5 or 10 isoforms (Methods, Extended Data Fig. 7c and Supplementary Data 9). The sensitivity of 90% credible sets (proportion of credible sets containing the true effect isoform) was undercalibrated, likely due to difficulties in fine mapping when QTL horizontal pleiotropy is high⁵⁰. With increasing proportions of shared isoQTLs, the sensitivity of 90% credible sets decreased and the mean set size increased. Our simulation results suggest that varied isoQTL architectures and isoform–trait effects for isoforms of the same gene are key features that influence power gains in isoform-centric modeling.

Improved trait mapping across 15 neuropsychiatric GWAS

To explore our central hypothesis that isoform-centric multivariate prediction improves discovery for complex trait mapping, particularly for brain relevant traits, we next compared isoTWAS/TWAS trait mapping across 15 neuropsychiatric traits. To maximize discovery, we trained both isoTWAS and TWAS models using a large adult brain functional genomics reference panel (N = 2,115), composed of frontal cortex samples from PsychENCODE and AMP-AD Consortia^20,51, and using a developmental²² prefrontal cortex (N = 205) dataset (Methods, Fig. 5 and Extended Data Fig. 8). In the adult cortex, we trained models for 15,127 genes using isoTWAS passing the CV R² > 0.01 cutoff, compared to 14,283 genes using gene-level TWAS. In the developing cortex, despite a smaller sample size, 16,504 and 10,535 models for genes were successfully trained using isoTWAS and TWAS, respectively (Methods and Supplementary Table 1).

**Fig. 5: Isoform-level trait mapping increases discovery of genetic associations over gene-level trait mapping.**

We applied these models to perform trait mapping using summary statistics from 15 brain-related GWAS^{52,53,54,55,56,57,58,59,60,61,62,63,64,65,66} (Methods, Fig. 5a and Extended Data Fig. 8a) using the stepwise hypothesis-testing procedure (false discovery rate-adjusted P < 0.05 and within-locus permutation P_ACAT < 0.05). We detected more trait-associated genes with isoTWAS compared with TWAS, across adult (2,595 versus 1,589 genes) and developmental (4,062 versus 890 genes) reference panels, respectively (Extended Data Fig. 8b and Supplementary Data 10–13). Across both reference panels and all 15 traits, isoTWAS detected 3,436 unique gene and 5,377 unique isoform–trait associations (Extended Data Fig. 8c). Of the 1,335 genes with multiple isoform–trait associations, 661 genes exhibited distinct isoform-level associations in different directions.

We next compared the performance of isoTWAS/TWAS in prioritizing candidate mechanisms within independent, high-confidence GWAS-significant loci⁶⁷. Across a combined 1,149 GWAS loci, isoTWAS identified significant associations within 323, compared with 201 detected by TWAS, a ~ 60% increase in discovery (Fig. 5b, Methods and Supplementary Table 4). Of the 287 GWAS loci identified for SCZ⁶⁸, isoTWAS prioritized genes within 70 and 86 unique loci across adult and developmental cortex, respectively, compared with 56 and 29 loci for TWAS (Fig. 5b). Furthermore, 96% of gene-level TWAS associations (193/201) were concordantly identified by isoTWAS. Likewise, the standardized effect sizes for significant gene- and isoform-level associations were highly correlated (r = 0.84, P < 2.2 x 10⁻¹⁶; Fig. 5c). Finally, to explore whether these isoTWAS-specific associations were capturing true disease signal, we compared the rate at which each method prioritized constrained genes (probability of loss-of-function intolerance, pLI ≥ 0.9; Supplementary Tables 5–8), which are known to be substantially enriched for disease associations⁶⁹. Across adult and developmental panels, respectively, isoTWAS prioritized 724 and 385 constrained genes compared to 106 and 200 with TWAS (Fisher’s exact test, adult: P = 0.048, developmental: P = 1.23 × 10⁻⁵). Altogether, isoTWAS not only recovers the vast majority of TWAS associations but also increases discovery of candidate GWAS mechanisms, particularly for genes intolerant to protein-truncating variation⁷⁰.

To investigate whether this increase in trait mapping discovery reflected true biological signal rather than test statistic inflation due to the increased number of tests (~4-fold increase in number of tests), we next compared the null distributions across methods for results (Extended Data Fig. 9). As the genomic inflation factor is not a reliable measure in TWAS settings⁷¹, we estimated inflation in gene-level test statistics using an empirical Bayes approach (Methods). There were no significant differences between TWAS and isoTWAS in the 95% credible intervals for test statistic inflation (Fig. 5d). Using a heuristic to estimate increases in effective sample size (Methods), we observed an approximate increase in effective sample size of 10–20% when using isoTWAS compared to TWAS (Fig. 5e and Supplementary Table 9). These analyses indicate that isoTWAS discovery is well-calibrated to the null and facilitates increased discovery in real data compared to gene-level TWAS.

We empirically compared probabilistic fine mapping⁵⁰ of results from isoTWAS and gene-level TWAS (Methods and Extended Data Fig. 8d). Here, we conducted fine mapping in loci with one or more significant trait-associated genes/isoforms (adjusted P < 0.05 and permutation P < 0.05) within 1 Mb of one another, termed risk regions. Overall, the mean number of genes in a risk region using TWAS was 3.15 compared to 3.90 using isoTWAS; the mean number of genes in a 90% credible set using TWAS was 1.33 compared to 1.25 using isoTWAS. On average, there were 1.54 isoforms per gene in a risk region and 1.27 isoforms per gene in a 90% credible set. Isoform-centric modeling presents unique challenges for fine mapping due to potentially high levels of horizontal pleiotropy and remains an important and open question for the field. Nevertheless, isoTWAS identified a comparable number of genes in risk regions compared with TWAS, and the combination of two-step trait mapping, permutation testing, and probabilistic fine mapping maintained narrow credible set sizes.

Lastly, we compared discovery using isoTWAS to discovery using local splicing-event-based trait mapping. For the developmental brain dataset, we calculated intron usage using LeafCutter⁷² and transformed these usage percentages to M-values⁷³. Then, for all introns mapped to a given gene, we used all SNPs within 1 Mb of a splicing event to predict its usage and mapped trait associations for these splicing events using isoTWAS’s multivariate framework (Methods). Overall, when aggregated to the gene-level, across 15 traits, we found that isoTWAS prioritized features at ~40% more independent GWAS loci (167 loci) than splicing-event-based trait mapping (119 loci), with 108 loci (90.7%) jointly identified (Fig. 5f), using the same developmental brain reference panel. Taken together, isoTWAS’s specific focus on modeling isoforms of a gene provided gains in trait association discovery over considering only total gene expression or intron usage.

isoTWAS identifies trait associations undetectable by TWAS

Overall, isoTWAS prioritized dozens of candidate risk genes and mechanisms in the developing and adult brain for 15 neuropsychiatric traits. These isoTWAS-prioritized genes were enriched for relevant pathways consistent with the biology of the underlying trait: cell proliferation for brain volume (BV), calcium channel activity for SCZ and neuroticism (NTSM), and proteasome regulation in Alzheimer’s disease (ALZ) (Extended Data Fig. 10a). In the Supplementary Note, we discuss several examples of trait associations for which isoTWAS prioritized a highly constrained gene within a GWAS locus (Supplementary Tables 5–8)^{74,75,76,77,78,79}.

A main advantage of isoTWAS over TWAS is the identification of trait associations for isoforms of genes, where the gene itself is not associated with the trait. We illustrate several examples of isoTWAS-prioritized isoforms, all in the adult cortex, for genes with limited or distinct expression QTLs (Fig. 6, Extended Data Fig. 10b and Supplementary Data 14), with exon/intron structure shown in Supplementary Figs 1–4. First, we detected a SCZ association with ENST00000492957, an isoform of AKT3 (1q43-144, pLI = 1), which encodes a serine/threonine-protein kinase that regulates cell life cycle (e.g., growth, proliferation and survival). AKT3 has shown effects on anxiety, spatial-contextual memory, and fear extinction in mice, and loss-of-function of AKT3 causes learning and memory deficits^80,81. Within the GWAS locus, there was a strong overlapping isoQTL signal (P < 10⁻⁵⁰) but only one eQTL with P < 10⁻⁶, which is in low LD with the GWAS-significant SNPs (Fig. 6a). The lead isoQTL (rs4430311) showed a significant, negative association with ENST00000492957, but a nominally significant positive association with AKT3 expression. Interestingly, a different isoform of AKT3 (ENST00000681794) was prioritized in an association with BV, which also has a GWAS association at this locus (Extended Data Fig. 10b). The two distinct isoforms of AKT3 have distinct 3’ transcript structures, close to the lead isoQTL of ENST00000681794. These results suggest a complex role of AKT3 isoforms with brain-related traits to be explored further.

**Fig. 6: isoTWAS implicates isoforms of AKT3, CUL3, HSPD1, and PCLO in genetic associations with psychiatric traits.**

Similarly, we found a strong isoQTL signal for ENST00000409096 but a weak eQTL signal of its gene CUL3 in the 2q36.2 locus (pLI = 0.99), in another association with SCZ (Fig. 6b). CUL3 is involved in cell cycle regulation, protein trafficking and signal transduction, and its dysregulation is a potential mechanism for both SCZ and autism spectrum disorder (ASD) risk⁸². Next, isoform ENST00000678969 of HSPD1, encoding a mitochondrial heat shock protein, was associated with SCZ risk (pLI = 0.99, 2q33.1) and showed a similar pattern across GWAS, eQTL and isoQTL signals (Fig. 6c). HSPD1 is among multiple non-MHC immune genes implicated in SCZ and has roles in brain hypomyelination⁸³. Lastly, ENST00000423517, an isoform of PCLO, was associated with multiple traits in the cross-disorder (CDG) GWAS (meta-analysis of attention deficit hyperactivity disorder, bipolar disorder, major depression and SCZ, pLI = 1, 7q21.11). Again, we found a strong isoQTL but not eQTL signal, with the CDG risk allele negatively associated with isoform expression. PCLO is involved in the presynaptic cytoskeletal matrix, establishing active synaptic zones, and synaptic vesicle trafficking; rare variants of PCLO in diverse populations have been recently implicated in risk of SCZ and ASD^84,85. Altogether, these results highlight the importance of incorporating isoform-level regulation for prioritizing novel candidate GWAS risk mechanisms, as implemented in our isoTWAS framework.

Discussion

We present isoTWAS, a framework that integrates genetic and isoform-level transcriptomic variation with GWAS to identify gene expression-trait associations and prioritize a set of isoforms of the gene that best explains the association. We provide an extensive set of isoform-level predictive models^86,87,88 and software to train models and conduct isoform-level trait mapping with GWAS summary statistics⁴⁷.

isoTWAS presents several advantages over gene-level TWAS or univariate modeling of isoform expression. First, modeling expression at the isoform-level can detect isoQTL architectures that vary across isoforms and are not captured by gene-level eQTLs. Second, joint multivariate isoform-level modeling improved predictive accuracy of isoform and total gene expression. Third, aggregating isoform-level associations to the gene-level substantially increased power to detect trait associations. We attribute this increase in power to three features: (1) isoform-level modeling in isoTWAS increases the number of imputable genes by >2-fold, (2) isoTWAS models improve gene-level prediction up to 35% and (3) isoTWAS jointly models expression and splicing regulation, capturing underlying complex trait mechanisms. Finally, as genetic control of isoform expression is often more tissue- and cell-type-specific than eQTLs^26,35, we hypothesize that isoTWAS is more capable of uncovering context-specific trait associations.

Recent work has highlighted alternative splicing as a promising mechanism underlying complex traits not captured through eQTLs^20,22,26,89, as mapping genetic regulation at the exon- rather than gene-level often leads to more detected signal⁹⁰. However, most of these analyses focused on local splicing events or exon-level inclusion, rather than different isoforms of the same gene, which reflect the combined consequences of these splicing events. Local splicing events can be difficult to systematically measure and integrate across multiple large-scale datasets, which is necessary for achieving sufficient sample sizes to interrogate population-level allelic effects^20,21. Our results demonstrate that isoform-centric trait mapping with isoTWAS increases discovery by ~40% compared with a matched local splicing-event-based analysis, although these methods may recover some independent signal. Future work should integrate reference-guided and annotation-free approaches for isoform and local splicing quantification to develop nuanced mechanistic hypotheses for GWAS loci.

We conclude with limitations of and future considerations for isoTWAS. First, isoform-level expression quantifications are maximum-likelihood estimates, due to limitations of short-read RNA-seq. These estimates are guided by existing transcriptome annotations and thus are dependent on their completeness and accuracy. Further, dataset-specific sequencing factors will affect the accuracy of these estimates, especially sequencing depth, read length, and library preparation. The emergence of long-read sequencing platforms will be instrumental for improving tissue-specific reference transcriptome annotations, which, in turn, will improve isoTWAS. As these methods continue to gain scalability and cost-effectiveness, they will ultimately replace short-read sequencing and isoform estimation for population-scale datasets. isoTWAS is agnostic to the method of isoform expression quantification and will continue to be applicable as we approach the long-read sequencing era.

Second, although inferential replicates from RNA-seq quantification can provide measures of technical variation, they are not incorporated into the predictive models. Our analyses of prediction across inferential replicates suggest a methodological opportunity: leveraging these inferential replicates as a measure of quantification error to estimate the robustness of isoform prediction and the precision of SNP effects. A predictive model that estimates standard errors for SNP effects by model averaging across replicates may improve trait mapping by providing a prediction interval for imputed expression. Third, as isoform-level trait mapping is akin to differential transcript expression analysis, isoTWAS can be extended to analyses of genetically regulated transcript usage. However, it is unclear if the compositional nature of transcript usage data needs to accounted for during prediction or trait mapping⁹¹. Lastly, isoTWAS can suffer from reduced power, inflated false positives and reduced fine-mapping sensitivity in the presence of SNP horizontal pleiotropy^92,93. For pathways that are not observed or accounted for in the reference expression panel and GWAS, accounting for horizontal pleiotropy may improve trait mapping. We motivate extensions of probabilistic fine mapping to reconcile pleiotropy for SNPs shared across models for multiple isoforms at the same genetic locus, as summary-statistic-based methods that control for horizontal pleiotropy are not yet effective⁹⁴.

isoTWAS provides a flexible framework to interrogate the transcriptomic mechanisms underlying genetic associations with complex traits and generate biologically meaningful and testable hypotheses about disease risk mechanisms. We emphasize a shift in focus from quantifications of the transcriptome on the gene-level to the transcript-isoform level to maximize discovery of transcriptome-centric genetic associations with complex traits.

Methods

Ethical approval

We use public data with previous ethical approval^{20,22,35,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66}, and our study did not need any specific approval.

Overview of isoTWAS

isoTWAS consists of three steps: (1) training predictive models of isoform expression, (2) imputing isoform-specific expression into a separate GWAS panel and (3) association testing between imputed expression and a phenotype (Fig. 1b). isoTWAS contrasts with TWAS as it models correlations between the expression of isoforms of the same gene. Further mathematical details are provided in Supplemental Methods.

Training predictive models of isoform expression

Model and assumptions

Assume a gene G has M isoforms with expression levels across N samples, with each sample having R inferential replicates. Let ${Y}_{G}^{* }$ be the N × M matrix of mean isoform expression (log-scale TPM) for the N samples and $M$ isoforms, using the expectation-maximization point estimates from a pseudo-mapping quantification algorithm, like Salmon or kallisto^30,31. We can jointly model isoform expression with a system of N × M × R equations. For sample $n\in \{1,\ldots ,N\}$, isoform $m\in \{1,\ldots ,M\}$ of gene G, and replicate $r\in \{1,\ldots ,R\}$, we have:

$${y}_{nmr}={\mathbf{x}}_{\mathbf{n}}{\mathbf{\beta}}_{\mathbf{m}}+{\epsilon}_{nmr},$$

(1)

where y_nmr is the expression of isoform m for the rth inferential replicate of sample n, x_n is the P-vector (vector of length P) of cis-genotypes in a 1 Mb window around gene G, β_m is the P-vector of genetic effects of the P genotypes on isoform expression, and ${\epsilon }_{{nmr}}$ is normally distributed random noise with mean 0 and variance ${\sigma }_{{nmr}}^{2}$. We standardize both the genotypes and the isoform expression to mean 0 and variance 1. As the SNP vector x_n does not differ across replicates, we assume that ${\epsilon }_{{nmr}}$ are independent and identically distributed across samples $n\in \{1,\ldots ,N\}$ and identically distributed across replicates $r\in \{1,\ldots ,R\}$. Accordingly, the point estimates of the SNP effects on isoform expression are not influenced by differences in expression across replications. Therefore, in matrix form, we consider the following predictive model:

$${Y}_{G}^{* }={X}_{G}{B}_{G}+{E}_{G}.$$

(2)

Here, X_G is the N × P matrix of genotype dosages, B_G is the P × M matrix of SNP effects on isoform expression and E_G is a matrix of random errors, such that ${vec}\left({E}_{G}\right)\sim {N}_{{NM}}(0,\varSigma ={\Omega }^{-1}\otimes {I}_{N})$. Σ represents the variance-covariance matrix in the errors (with precision matrix $\varOmega ={\Sigma }^{-1}$), following the above independence assumptions.

Estimating SNP effects on isoform expression

We apply five methods to estimate ${\hat{B}}_{G}$, the matrix of SNP effects on isoform expression. The first four are multivariate methods that model the isoforms jointly; the last method models each isoform separately using univariate methods. The goal of this SNP effect estimation is marginal prediction, that is, leveraging the correlation between isoforms to improve prediction of each isoform separately. The ${\hat{B}}_{G}$ matrix that gives the largest adjusted R² in 5-fold CV across the five methods is selected as the final model to predict isoform expression for a given gene. When interested in predicting gene-level expression from these predicted isoforms, isoTWAS trains an elastic net penalized linear regression that predicts gene-level expression from genetically-predicted isoform-level expression; this model training is conducted across the same 5 folds to prevent data leakage⁹⁵. We train 4 multivariate models and 1 univariate model to marginally predict isoform expression (Supplemental Methods):

(1)
Multivariate elastic net (MVEnet) regression: This is an extension of elastic net, where the response is a matrix of correlated responses⁴¹. The absolute penalty is imposed on each coefficient by a group-lasso penalty on each vector of SNP effects across isoforms (rows of B_G). Accordingly, a SNP can only have a non-zero effect on an isoform if it has a non-zero effect on all isoforms.
(2)
Multivariate LASSO regression with covariance estimation (MRCE): We adapt Rothman et al’s proposed procedure to simultaneously and iteratively estimate both ${\hat{B}}_{G}$, the SNP effects matrix, and ${\hat{\varOmega }},$ the precision matrix⁴⁰. This procedure accounts for the correlation between isoforms but does not impose the group-lasso penalty as in MVEnet.
(3)
Multivariate elastic net with stacked generalization (joinet): We use Rauschenberger and Glaab’s joinet method that uses a two-step prediction⁴²: first, the design matrix of SNPs is used to generate a cross-validated prediction of each isoform, and second, the matrix of predicted isoform expression is used to predict each isoform.
(4)
Sparse partial least squares (SPLS): This is an implementation of partial least squares with a sparsity penalty, that attempts to find an optimal latent decomposition for the linear relationship between the matrix of isoform expression and the design matrix of SNPs. We use the Chun and Keles’s implementation from the spls R package³⁹.
(5)
Univariate FUSION: We disregard the correlation structure between isoforms and train a univariate elastic net⁴¹, estimation of the best linear unbiased predictor (BLUP) in a linear mixed model⁴⁴, and SuSiE⁴³ predictive model for each isoform separately. The model with the largest adjusted R² out of these three models is outputted. This approach serves as a baseline measurement for prediction of each isoform independently.

Trait association and stepwise hypothesis testing

The tests of association in isoTWAS are like tests in differential transcript expression analyses, as TWAS tests of association are analogous to tests in differential gene expression analyses. isoTWAS and TWAS are distinct, as these methods consider imputed isoform and gene expression, respectively, as predicted by the trained expression models. If individual-level genotypes are available in the external GWAS panel, isoform expression can be directly imputed by multiplying the SNP weights from the predictive model with the genotype dosages in the GWAS panel. If only summary statistics are available, we adopt the weighted burden test from Gusev et al. with an ancestry-matched LD panel^4,93. Compared to TWAS, isoTWAS association testing involves an increased number of tests (~4 isoforms per gene)²⁴ and potential correlation in test statistics for isoforms of the same gene.

We perform a two-step hypothesis-testing framework (Extended Data Fig. 1b)⁴⁶. In the first step, for every isoform with a trained model, we generate a test statistic using either linear regression for GWAS with individual-level genotypes or the weighted burden test for GWAS with only summary statistics⁴. Given the t test statistics ${T}_{1},\ldots ,{T}_{t}$ for isoforms for a gene, an omnibus test aggregates the t test statistics into a single P value for a gene. We benchmarked different omnibus tests in simulations, but the default omnibus test in isoTWAS is ACAT⁴⁵. We control for false discovery across all genes via the Benjamini-Hochberg procedure, but the Bonferroni procedure can also be applied for more conservative false discovery control. In the second step, for isoforms for genes with an adjusted omnibus P < 0.05, we employ Shaffer’s modified sequentially rejective Bonferroni procedure to control the within-gene family-wide error rate. At the end of these two steps, we identify a set of genes and their isoforms that are associated with the trait.

Control for false positives within GWAS loci

In TWAS and related methods, association statistics have been shown to be well-calibrated under the null of no GWAS association. However, within loci harboring significant GWAS signal, false positive associations can result when eQTLs and GWAS coincide within overlapping LD blocks. To address this, we adopt two conservative approaches to control for type 1 error within GWAS loci, namely (1) permutation testing and (2) probabilistic fine mapping. The permutation testing approach, adopted from Gusev et al⁴, is a highly conservative test of the signal added by the SNP-transcript effects from the predictive models, conditional on the GWAS architecture of the locus. Briefly, we permute the SNP-transcript effects in the predictive models 10,000 times and generate a null distribution for the isoform test statistic. We use this null distribution to generate a permutation-based P value for the original test statistic for each isoform. Finally, we can use isoform-level probabilistic fine mapping using methods from FOCUS⁵⁰ to generate credible set of isoforms that explain the trait association at a locus. We only run isoform-level fine mapping for significantly associated isoforms in overlapping 1-Mb windows.

Simulation framework

We adopt techniques from Mancuso et al’s twas_sim protocol⁹⁶ to simulate multivariate isoform expression based on randomly simulated genotypes and environmental random noise. First, for n samples, we generate a matrix of genotypes for the SNPs within 1 Mb of 22 different genes (1 per chromosome) using an LD reference panel of European subjects from 1000 Genomes Project⁴⁸.

Next, we generate a matrix of SNP-isoform effects across different causal SNP proportions p_c, numbers of isoforms t, and p_s proportion of the SNP-isoform effects being shared across isoforms of the same gene. We then add two matrices of random noise U and $\epsilon$. The first matrix U noise represents non-cis-genetic effects on isoforms that are correlated between samples and isoforms; we control the proportion of variance explained in isoform expression attributed to U using a parameter σ_h. The second matrix $\epsilon$ is a matrix of random noise that is independent for each isoform, such that ${\epsilon }_{i}\sim N\left(0,{\sigma }_{e}^{2}I\right)$ where ${\sigma }_{e}^{2}=1-{\sigma }_{h}-{h}_{g}^{2}$. We generate 10,000 simulations for each configuration of the simulation parameters, varying $n\in \left\{200,500\right\}$, ${p}_{c}\in \{\mathrm{0.001,0.01,0.05}$}, ${h}_{g}^{2}\in \left\{0.05,\mathrm{0.10,0.25}\right\}$, ${p}_{s}\in \{\mathrm{0,0.5,1}\}$, and ${\sigma }_{h}\in \{\mathrm{0.1,0.25}\}$. Further details are provided in Supplementary Methods and summarized in Fig. 2.

We also generate traits under three distinct scenarios, with a continuous trait with heritability ${h}_{t}^{2}\in \{\mathrm{0.01,0.05,0.10}\}$ and a GWAS sample size of 50,000 (Supplementary Methods):

(1)
Only gene-level expression has a non-zero effect on trait. Here, we sum the isoform expression to generate a simulated gene expression. We randomly simulate the effect size and scale the error to ensure trait heritability.
(2)
Only one isoform has a non-zero effect on the trait. Here, we generate a multivariate isoform expression matrix with 2 isoforms and scale the total gene expression value such that one isoform (called the effect isoform) makes up ${p}_{g}\in \left\{0.10,0.30,0.50,0.70,0.90\right\}$ proportion of total gene expression. We then generate effect size for one of the isoforms and scale the error to ensure trait heritability.
(3)
Two isoforms with different effects on traits. Here, we generate a multivariate isoform expression matrix with 2 isoforms that make up equal portions of the total gene expression. We then generate an effect size of α for one isoform and p_e α for the other isoform, such that ${p}_{e}\in \{-1,-0.5,-\mathrm{0.2,0.2,0.5,1}\}.$ We then scale the error to ensure trait heritability.

To estimate the approximate FPR, we followed the same simulation framework to generate eQTL data and GWAS data. In the GWAS data, we set the effect of gene- and isoform-level imputed expression to 0 to generate a simulated trait under the null. We then estimated the FPR by calculating the proportion of gene-trait associations at P < 0.05 under this null across 20 sets of 1,000 simulated GWAS panels. We also assessed isoform-level fine mapping using FOCUS in a scenario with a gene with 5 or 10 isoforms and a single effect isoform. We computed the sensitivity of 90% credible sets of isoforms (proportion of credible sets that contain the effect isoform) and the number of isoforms in the 90% credible set.

GTEx processing and model training

We quantified GTEx v8 (ref. ³⁵) RNA-seq samples for 48 tissues using Salmon v1.5.2 (ref. ³⁰) in mapping-based mode. We first built a Salmon index for a decoy-aware transcriptome consisting of GENCODE v38 transcript sequences and the full GRCh38 reference genome as decoy sequences²⁴. Salmon was then run on FASTQ files with mapping validation and corrections for sequencing and GC bias. We computed 50 inferential bootstraps for isoform expression. We then imported Salmon isoform-level quantifications and aggregated to the gene-level using tximeta v1.16.1 (ref. ³⁷). Using edgeR, gene and isoform-level quantifications underwent TMM-normalization, followed by transformation into a log-space using the variance-stabilizing transformation using DESeq2 v1.38.3 (ref. ^97,98). We then residualized isoform-level and gene-level expression (as log-transformed CPM) by all tissue-specific covariates (clinical, demographic, genotype principal components (PCs), and expression PEER factors) used in the original QTL analyses in GTEx. We calculated the quantification variance across inferential replicates using the computeInfRV() function from fishpond v2.4.1 (ref. ⁹⁹). We computed the isoform fraction using the isoformToIsoformFraction() function from IsoformSwitchAnalyzeR v1.20.0 (ref. ¹⁰⁰).

SNP genotype calls were derived from Whole Genome Sequencing data for samples from individuals of European ancestry, filtering out SNPs with minor allele frequency (MAF) less than 5% or that deviated from HWE at P < 10⁻⁵. We further filtered out SNPs with MAF less than 1% frequency among the European ancestry samples in 1000 Genomes Project⁴⁸.

Details of the model training pipeline for GTEx are similar to those in Extended Data Fig. 8a. Gene-level univariate models were trained using elastic net regression⁴¹, BLUP in a linear mixed model⁴⁴, and SuSiE⁴³, using all SNPs within 1 Mb of the gene body^4,41,43,44. For each gene, the best performing model was chosen based on McNemar’s adjusted 5-fold CV R². We selected only genes with CV R² ≥ 0.01. We applied multivariate modeling outlined in isoTWAS to train isoform-level predictive models, selecting only those isoform models with CV R² ≥ 0.01. All isoTWAS models generated are publicly available (see Data availability).

Developmental brain reference panel processing and model training

We quantified developmental frontal cortex²² (N = 205) RNA-seq samples using Salmon v1.8.0³⁰ in mapping-based mode. We used the same indexed transcriptome as in the GTEx analysis and ran Salmon with mapping validation and corrections for sequencing and GC bias. We computed 50 inferential bootstraps for isoform expression using Salmon’s Expectation-Maximization algorithm. We then imported Salmon isoform-level quantifications and aggregated to the gene-level using tximeta³⁷. Using edgeR v3.40.2, gene and isoform-level quantifications underwent TMM-normalization, followed by transformation into a log-space using the variance-stabilizing transformation using DESeq2 v1.38.3^97,98. We then residualized isoform-level and gene-level expression (as log-transformed CPM) by covariates (age, sex, 10 genotype PCs, 90 and 70 hidden covariates with prior (HCP), respectively). Typed SNPs with non-zero alternative alleles, MAF >1%, genotyping rate >95%, Hardy Weinberg equilibrium (HWE) P < 10⁻⁶ were first imputed to TOPMed Freeze 5 using minimac4 and eagle v2.4 (refs. ^101,102). We then retained biallelic SNPs with imputation accuracy R² > 0.8, with rsIDs. Finally, we filtered out SNPs with MAF < 0.05 or that deviated from Hardy-Weinberg equilibrium at P < 10⁻⁶.

Adult brain reference panel processing and model training

Matched genotype and RNA-seq data from adult brain cortex tissue from 2,365 individuals were compiled and processed from the PsychENCODE Consortium²⁰ and the Accelerating Medicines Partnership Program for Alzheimer’s Disease (AMP-AD)⁵¹, consisting of the individual studies BipSeq, BrainGVEX, CommonMind Consortium (CMC), CommonMind Consortium’s National Institute of Mental Health Human Brain Collection Core (CMC HBCC), Lieber Institute for Brain Development-szControl (LIBD_szControl), UCLA-ASD, Religious Orders Study and the Memory and Aging Project (ROSMAP), Mount Sinai Brain Bank (MSBB) and MayoRNAseq.

Typed genotypes were lifted over to the GRCh38 build using CrossMap v.0.6.3 (ref. ¹⁰³) and then filtered to remove variants where the reference allele matched any of the alternate alleles. Genotype data from whole genome sequencing (BrainGVEX, UCLA-ASD, ROSMAP, MSBB and MayoRNAseq) were further filtered to variants present on the Infinium Omni5-4 v1.2 array in order to satisfy the imputation server’s maximum limit of 20,000 typed variants per 20 Mb. All genotype data were further processed with PLINK v1.90b6.21 (ref. ¹⁰⁴), removing variants with HWE P < 10⁻⁶, MAF < 0.01 or missingness rate > 0.05, and removing samples with missingness rate > 0.1 across typed variants or missingness rate > 0.5 on any individual chromosome. Genotype data was prepared for imputation using the McCarthy Group’s HRC-1000G-check-bim-v4.3.0 tool against freeze 8 of the Trans-Omics for Precision Medicine (TOPMed) reference panel¹⁰⁵. The tool removes A/T and G/C SNPs with MAF > 0.4, variants with alleles that differ from the reference panel, variants with an allele frequency difference > 0.2 from the reference panel and variants not in the reference panel. Additionally, the tool updates strand, position and reference/alternate allele assignment to match the reference panel.

Genotypes were then passed into the TOPMed Imputation Server by individual array batch¹⁰⁶. The genotypes were phased with Eagle v2.4 and imputed with Minimac4 using the TOPMed reference panel^101,102. Further QC was performed on the imputed genotypes using bcftools v1.11 and PLINK. The imputed genotypes were filtered to well-imputed variants with R² > 0.8. The arrays were merged after filtering to variants that were well imputed in all arrays to be merged. Only arrays with at least 400,000 variants after pre-imputation QC were merged in order to prevent too many variants from dropping out. The merged genotype data were then converted to PLINK 1 binary format and further processed with PLINK, removing variants with duplicates, HWE P < 10⁻⁶, MAF < 0.01 or missingness rate > 0.05 and removing samples with missingness rate > 0.1. Samples from the same individual were identified by calculating the genetic relatedness matrix using SnpArrays.jl and finding sets of samples with relatedness > 0.75. From each set of replicates, only the genotyped sample from the array with the most variants after pre-imputation QC was kept. For model training, only SNPs annotated in HapMap3 were retained¹⁰⁷.

RNA-seq paired reads from each study were sorted by name and then converted to FASTQ format using samtools v1.14 (ref. ¹⁰⁸). The reads were then quantified using salmon v1.8.0 in mapping-based mode using a full decoy indexed from GENCODE v38 transcriptome and GRCh38 patch 13 assembly³⁰. Quantification was run using a standard EM algorithm with library type automatically inferred and estimates adjusted for sequence-specific and fragment-level GC biases. Bootstrapped abundance estimates were calculated using 50 bootstrap samples. Isoform-level expression was summarized to the gene-level using tximeta³⁷. Only isoforms with 0.1 TPM for more than 75% of samples were retained. The resulting expression was normalized using the variance-stabilizing transformation from DESeq2 (ref. ⁹⁸). Samples with WGCNA network connectivity scores of less than -3 were removed as outliers, resulting in a total of 2,115 samples¹⁰⁹. Isoform- and gene-level expression was then batch-corrected using ComBat (sva v3.46.0), using study site as the batch¹¹⁰. Lastly, age, age², sex, 10 genotype PCs and hidden covariates (200 for gene expression and 175 for isoform expression) were removed from the expression matrix^111,112. The number of HCP were selected by optimizing the number of nominal cis-eQTLs and cis-isoQTLs at Bonferroni-corrected P < 0.05, respectively, on a grid from 100 to 300 HCPs, as detected by QTLtools v1.3.1 (ref. ⁹⁰).

Details of the model training pipeline are summarized are equivalent to those used to train models in GTEx data.

Gene- and isoform-level trait mapping

We conducted gene- and isoform-level trait mapping for 15 neuropsychiatric traits: attention-deficit hyperactivity disorder (ADHD, N_cases = 20,183/N_controls = 35,191)⁵³, ALZ (90,338/1,036,225)⁵⁴, anorexia nervosa (AN, 16,992/55,525)⁶⁶, ASD (18,381/27,969)⁵², bipolar disorder (BP, 41,917/371,549)⁵⁵, BV (N = 47,316)⁵⁶, CDG (232,964/494,162)⁵⁷, cortical thickness (CortTH, N = 51,665)⁵⁸, intracranial volume (ICV, N = 32,438)⁵⁹, major depressive disorder (MDD, 246,363/561,190)⁶⁰, NTSM (N = 449,484)⁶¹, obsessive compulsive disorder (OCD, 2,688/7,037)⁶², panic and anxiety disorders (PANIC, 2,248/7,992)⁶³, post-traumatic stress disorder (PTSD, 32,428/174,227)⁶⁴ and SCZ (69,369/236,642)⁶⁵. For gene-level trait mapping, we used the weighted burden test, followed by the permutation test, as outlined by Gusev et al⁴. For isoform-level trait mapping, we used the stage-wise testing procedure outlined in the isoTWAS method. In-sample LD from the QTL reference panels was used to calculate the standard error in the weighted burden test. For isoforms, irrespective of their corresponding genes, passing both stage-wise tests and the permutation test, we employed isoform-level probabilistic fine mapping using FOCUS with default parameters⁵⁰. These methods are summarized in Extended Data Fig. 8a.

We estimated the percent increase in effective sample size by employing the following heuristic. We convert gene-level association P values into χ² test statistics with 1 degree of freedom. For χ² > 1, we then calculate the percent increase for isoTWAS-based associations versus TWAS-based associations. As the mean of the χ² distribution is linearly related to power and sample size¹¹³, we can use this percent increase in test statistic as a measure of power or effective sample size. We defined independent genome-wide significant SNPs in GWAS by LD clumping with lead GWAS SNP < 5 ×10⁻⁸ with P value used for ranking and a R² threshold of 0.2.

Statistics and reproducibility

For analysis of GTEx, PsychENCODE and AMP-AD data, no statistical method was used to predetermine sample size; the maximal sample size was determined by the number of individuals with both RNA-seq and genotype data. Exclusion criteria for these three datasets are included above, in detail. Briefly, as predetermined, GTEx data were restricted to individuals of European genetic ancestry to ensure portability of genetic predictions. PsychENCODE and AMP-AD individuals were removed if their WGCNA network connectivity scores based on isoform-level expression were less than −3; these low scores indicate that these samples may be plagued by technical biases that may affect the estimation of genetic effects on gene- and isoform-level expression. No data were collected directly in this work, and, as such, the investigators were blinded to allocation. Statistical analyses are summarized above and scripts to reproduce the analysis are listed in the code availability statement.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

GTEx genetic, transcriptomic and covariate data were obtained through dbGAP approval at accession number phs000424.v8.p2 (ref. ³⁵). LD reference data from the 1000 Genomes Project were obtained at https://www.internationalgenome.org/data-portal/sample (ref. ⁴⁸). GENCODE reference transcriptome and assembly was downloaded from https://www.gencodegenes.org/human/release_38.html with GenBank assembly accession GCA_000001405.28 (ref. ²⁴). GWAS summary statistics were obtained at the following links: ADHD (https://www.med.unc.edu/pgc/download-results/)⁵³, ALZ (https://ctg.cncr.nl/software/summary_statistics/)⁵⁴, AN (http://www.med.unc.edu/pgc/results-and-downloads)⁶⁶, ASD (https://www.med.unc.edu/pgc/download-results/)⁵², BP (https://www.med.unc.edu/pgc/download-results/)⁵⁵, BV (https://ctg.cncr.nl/software/summary_statistics)⁵⁶, CDG (https://www.med.unc.edu/pgc/results-and-downloads)⁵⁷, CortTH (https://enigma.ini.usc.edu/research/download-enigma-gwas-results/)⁵⁸, ICV (https://enigma.ini.usc.edu/research/download-enigma-gwas-results/)⁵⁹, MDD (https://doi.org/10.7488/ds/2458)⁶⁷, NTSM (https://ctg.cncr.nl/software/summary_statistics/neuroticism_summary_statistics)⁶¹, OCD (https://www.med.unc.edu/pgc/download-results/)⁶², PANIC (https://www.med.unc.edu/pgc/download-results/)⁶³, PTSD (https://www.med.unc.edu/pgc/results-and-downloads/)⁶⁴ and SCZ (https://www.med.unc.edu/pgc/download-results/)⁶⁵. The Developmental Brain RNA-seq and genotype dataset from Walker et al. is available at dbGAP with accession number phs001900 (ref. ²², accesible at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001900.v1.p1). The subset of Adult Brain RNA-seq and genotype data from the PsychENCODE Consortium is available at https://psychencode.synapse.org/DataAccess and from AMP-AD is available at https://adknowledgeportal.synapse.org/Data%20Access (refs. ^20,51). GWAS summary statistics and accession numbers to genotype and RNA-seq data are provided in Supplementary Table 10. isoTWAS models for 48 tissues from GTEx are available at https://zenodo.org/record/8047940 (ref. ⁸⁶), adult brain cortex from PsychENCODE and AMP-AD are available at https://zenodo.org/record/8048198 (ref. ⁸⁷), and developmental brain cortex from Walker et al. are available at https://zenodo.org/record/8048137 (ref. ⁸⁸). All datasets used in this paper are listed here with no omissions.

Code availability

isoTWAS is available as an R package at https://github.com/bhattacharya-a-bt/isotwas (ref. ⁴⁷). Sample scripts for analyses are available at https://github.com/bhattacharya-a-bt/isotwas_manu_scripts (ref. ¹¹⁴). All relevant codes used in this paper are listed here and deposited online with no omissions or restrictions to access.

References

Sullivan, P. F. et al. Psychiatric genomics: an update and an agenda. Am. J. Psychiatry 175, 15–15 (2018).
Article PubMed Google Scholar
Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484 (2019).
Article CAS PubMed Google Scholar
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Article CAS PubMed PubMed Central Google Scholar
Barbeira, A. N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9, 1825 (2018).
Article PubMed PubMed Central Google Scholar
Hu, Y. et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat. Genet. 51, 568–576 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhou, D. et al. A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nat. Genet. 52, 1239–1246 (2020).
Article CAS PubMed PubMed Central Google Scholar
Barbeira, A. N. et al. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet. 15, e1007889 (2019).
Article PubMed PubMed Central Google Scholar
Bhattacharya, A., Li, Y. & Love, M. I. MOSTWAS: multi-omic strategies for transcriptome-wide association studies. PLoS Genet. 17, e1009398 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gusev, A. et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat. Genet. 50, 538–548 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wu, L. et al. A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer. Nat. Genet. 50, 968–978 (2018).
Article CAS PubMed PubMed Central Google Scholar
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Article PubMed PubMed Central Google Scholar
Gleason, K. J., Yang, F., Pierce, B. L., He, X. & Chen, L. S. Primo: integration of multiple GWAS and omics QTL summary statistics for elucidation of molecular mechanisms of trait-associated SNPs and detection of pleiotropy in complex traits. Genome Biol. 21, 236–236 (2020).
Article PubMed PubMed Central Google Scholar
He, X. et al. Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am. J. Hum. Genet. 92, 667–680 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hormozdiari, F. et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet. 99, 1245–1260 (2016).
Article CAS PubMed PubMed Central Google Scholar
Barrera, L. O. et al. Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs. Genome Res. 18, 46–59 (2008).
Article CAS PubMed PubMed Central Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Article CAS PubMed PubMed Central Google Scholar
Melé, M. et al. The human transcriptome across tissues and individuals. Science 348, 660–665 (2015).
Article PubMed PubMed Central Google Scholar
Merkin, J., Russell, C., Chen, P. & Burge, C. B. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338, 1593–1599 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gandal, M. J. et al. Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. Science 362, eaat8127 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wang, D. et al. Comprehensive functional genomic resource and integrative model for the human brain. Science 362, eaat8464 (2018).
Article CAS PubMed PubMed Central Google Scholar
RL, W. et al. Genetic control of expression and splicing in developing human brain informs disease mechanisms. Cell 179, 750–771 (2019).
Article Google Scholar
Leung, S. K. et al. Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell Rep. 37, 110022 (2021).
Article CAS PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS PubMed Google Scholar
Treutlein, B., Gokce, O., Quake, S. R. & Südhof, T. C. Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing. Proc. Natl Acad. Sci. 111, E1291–E1299 (2014).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016).
Article CAS PubMed PubMed Central Google Scholar
Barbeira, A. N. et al. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol. 22, 49 (2021).
Article PubMed PubMed Central Google Scholar
MM, S. & MS, S. RNA mis-splicing in disease. Nat. Rev. Genet. 17, 19–32 (2016).
Article Google Scholar
Akula, N. et al. Deep transcriptome sequencing of subgenual anterior cingulate cortex reveals cross-diagnostic and diagnosis-specific RNA expression changes in major psychiatric disorders. Neuropsychopharmacol. 46, 1364–1372 (2021).
Article CAS Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS PubMed Google Scholar
Jaffe, A. E. et al. Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis. Nat. Neurosci. 21, 1117–1125 (2018).
Article CAS PubMed PubMed Central Google Scholar
Collado-Torres, L. et al. Regional heterogeneity in gene expression, regulation, and coherence in the frontal cortex and hippocampus across development and schizophrenia. Neuron 103, 203–216 (2019).
Article CAS PubMed PubMed Central Google Scholar
Jaffe, A. E. et al. Profiling gene expression in the human dentate gyrus granule cell layer reveals insights into schizophrenia and its genetic risk. Nat. Neurosci. 23, 510–519 (2020).
Article CAS PubMed Google Scholar
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article CAS Google Scholar
Breiman, L. & Friedman, J. H. Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. B 59, 3–54 (1997).
Article Google Scholar
Love, M. I. et al. Tximeta: Reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 16, e1007664 (2020).
Article CAS PubMed PubMed Central Google Scholar
Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.F1000Res. 2015, 1521 (2016).
Article Google Scholar
Chun, H. & Keleş, S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. B 72, 3–25 (2010).
Article Google Scholar
Rothman, A. J., Levina, E. & Zhu, J. Sparse multivariate regression with covariance estimation. J. Comput. Graph. Stat. 19, 947–962 (2010).
Article PubMed PubMed Central Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Article PubMed PubMed Central Google Scholar
Rauschenberger, A. & Glaab, E. Predicting correlated outcomes from molecular data. Bioinformatics 37, 3889–3895 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. B 82, 1273–1300 (2020).
Article Google Scholar
Endelman, J. B. Ridge regression and other kernels for genomic selection with R Package rrBLUP. Plant Genome 4, 250–255 (2011).
Article Google Scholar
Liu, Y. et al. ACAT: A fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet 104, 410–421 (2019).
Article CAS PubMed PubMed Central Google Scholar
Van den Berge, K., Soneson, C., Robinson, M. D. & Clement, L. stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biol. 18, 151 (2017).
Article PubMed PubMed Central Google Scholar
Bhattacharya, A. bhattacharya-a-bt/isotwas: isotwas v1.0.0. Zenodo https://doi.org/10.5281/ZENODO.8322993 (2023).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
Cao, C. et al. Power analysis of transcriptome-wide association study: Implications for practical protocol choice. PLoS Genet. 17, e1009405–e1009405 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mancuso, N. et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nat. Genet. 51, 675–682 (2019).
Article CAS PubMed PubMed Central Google Scholar
Vialle, R. A., de Paiva Lopes, K., Bennett, D. A., Crary, J. F. & Raj, T. Integrating whole-genome sequencing with multi-omic data reveals the impact of structural variants on gene regulation in the human brain. Nat. Neurosci. 25, 504–514 (2022).
Article CAS PubMed PubMed Central Google Scholar
Grove, J. et al. Identification of common genetic risk variants for autism spectrum disorder. Nat. Genet. 51, 431–444 (2019).
Article CAS PubMed PubMed Central Google Scholar
Demontis, D. et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat. Genet. 51, 63–75 (2019).
Article CAS PubMed Google Scholar
Jansen, I. E. et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet. 51, 404–413 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mullins, N. et al. Genome-wide association study of more than 40,000 bipolar disorder cases provides new insights into the underlying biology. Nat. Genet. 53, 817–829 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jansen, P. R. et al. Genome-wide meta-analysis of brain volume identifies genomic loci and genes shared with intelligence. Nat. Commun. 11, 5606 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cross-Disorder Group of the Psychiatric Genomics Consortium. Genomic relationships, novel loci, and pleiotropic mechanisms across eight psychiatric disorders. Cell 179, 1469–1482 (2019).
Grasby, K. L. et al. The genetic architecture of the human cerebral cortex. Science 367, eaay6690 (2020).
Article CAS PubMed PubMed Central Google Scholar
Adams, H. H. et al. Novel genetic loci underlying human intracranial volume identified through genome-wide association. Nat. Neurosci. 19, 1569–1582 (2016).
Article CAS PubMed PubMed Central Google Scholar
Howard, D. M. et al. Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat. Neurosci. 22, 343–352 (2019).
Article CAS PubMed PubMed Central Google Scholar
Nagel, M. et al. Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat. Genet. 50, 920–927 (2018).
Article CAS PubMed Google Scholar
Arnold, P. D. et al. Revealing the complex genetic architecture of obsessive-compulsive disorder using meta-analysis. Mol. Psychiatry 23, 1181–1188 (2018).
Article CAS Google Scholar
Forstner, A. J. et al. Genome-wide association study of panic disorder reveals genetic overlap with neuroticism and depression. Mol. Psychiatry 26, 4179–4190 (2021).
Article PubMed Google Scholar
Nievergelt, C. M. et al. International meta-analysis of PTSD genome-wide association studies identifies sex- and ancestry-specific genetic risk loci. Nat. Commun. 10, 4558 (2019).
Article PubMed PubMed Central Google Scholar
Trubetskoy, V. et al. Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature 604, 502–508 (2022).
Article CAS PubMed PubMed Central Google Scholar
Watson, H. J. et al. Genome-wide association study identifies eight risk loci and implicates metabo-psychiatric origins for anorexia nervosa. Nat. Genet. 51, 1207–1214 (2019).
Article CAS PubMed PubMed Central Google Scholar
Prive, F., Aschard, H., Ziyatdinov, A. & Blum, M. G. B. Efficient analysis of large-scale genome-wide data with two R packages: Bigstatsr and bigsnpr. Bioinformatics 34, 2781–2787 (2018).
Article CAS PubMed PubMed Central Google Scholar
Trubetskoy, V. et al. Mapping genomic loci prioritises genes and implicates synaptic biology in schizophrenia. Nature 604, 502–508 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
van Iterson, M., van Zwet, E. W., Heijmans, B. T. & Heijmans, B. T. Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution. Genome Biol. 18, 19 (2017).
Article PubMed PubMed Central Google Scholar
Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).
Article CAS PubMed Google Scholar
Du, P. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinf. 11, 587–587 (2010).
Article CAS Google Scholar
Schrode, N. et al. Synergistic effects of common schizophrenia risk variants. Nat. Genet. 51, 1475–1485 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bhattacharjee, S. et al. A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. Am. J. Hum. Genet 90, 821–835 (2012).
Article CAS PubMed PubMed Central Google Scholar
O’Donnell-Luria, A. H. et al. Heterozygous Variants in KMT2E cause a spectrum of neurodevelopmental disorders and epilepsy. Am. J. Hum. Genet 104, 1210–1222 (2019).
Article PubMed PubMed Central Google Scholar
Reay, W. R. & Cairns, M. J. Pairwise common variant meta-analyses of schizophrenia with other psychiatric disorders reveals shared and distinct gene and gene-set associations. Transl. Psychiatry 10, 134 (2020).
Article CAS PubMed PubMed Central Google Scholar
Nishioka, K. et al. PR-Set7 is a nucleosome-specific methyltransferase that modifies lysine 20 of histone H4 and is associated with silent chromatin. Mol. Cell 9, 1201–1213 (2002).
Article CAS PubMed Google Scholar
Schmidt-Kastner, R., Guloksuz, S., Kietzmann, T., van Os, J. & Rutten, B. P. F. Analysis of GWAS-derived schizophrenia genes for links to ischemia-hypoxia response of the brain. Front Psychiatry 11, 393 (2020).
Article PubMed PubMed Central Google Scholar
Wong, H. et al. Isoform-specific roles for AKT in affective behavior, spatial memory, and extinction related to psychiatric disorders. eLife 9, e56630 (2020).
Article CAS PubMed PubMed Central Google Scholar
Howell, K. R., Floyd, K. & Law, A. J. PKBγ/AKT3 loss-of-function causes learning and memory deficits and deregulation of AKT/mTORC2 signaling: relevance for schizophrenia. PLoS ONE 12, e0175993 (2017).
Article PubMed PubMed Central Google Scholar
Chen, H.-Y. & Maher, B. J. Lost in translation: Cul3-cependent pathological mechanisms in psychiatric disorders. Neuron 105, 398–399 (2020).
Article CAS PubMed Google Scholar
Pouget, J. G. The emerging immunogenetic architecture of schizophrenia. Schizophr. Bull. 44, 993–1004 (2018).
Article PubMed PubMed Central Google Scholar
Liu, D. et al. Schizophrenia risk conferred by rare protein-truncating variants is conserved across diverse human populations. Nat. Genet. 55, 369–376 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kim, S. et al. The GIT family of proteins forms multimers and associates with the presynaptic cytomatrix protein Piccolo. J. Biol. Chem. 278, 6291–6300 (2003).
Article CAS PubMed Google Scholar
Bhattacharya, A. isoTWAS models for 48 GTEx models (06/2023). Zenodo https://doi.org/10.5281/zenodo.8047940 (2023).
Bhattacharya, A. isoTWAS models for adult brain cortex (06/2023). Zenodo https://doi.org/10.5281/zenodo.8048198 (2023).
Bhattacharya, A. isoTWAS models for developmental brain cortex (06/2023). Zenodo https://doi.org/10.5281/zenodo.8048137 (2023).
Qi, T. et al. Genetic control of RNA splicing and its distinct role in complex trait variation. Nat. Genet. https://doi.org/10.1038/s41588-022-01154-4 (2022).
Article PubMed PubMed Central Google Scholar
Delaneau, O. et al. A complete tool set for molecular QTL discovery and analysis. Nat. Commun. 8, 15452 (2017).
Article CAS PubMed PubMed Central Google Scholar
Doose, G., Bernhart, S. H., Wagener, R. & Hoffmann, S. DIEGO: detection of differential alternative splicing using Aitchison’s geometry. Bioinformatics 34, 1066–1068 (2018).
Article CAS PubMed Google Scholar
Veturi, Y. & Ritchie, M. D. How powerful are summary-based methods for identifying expression-trait associations under different genetic architectures? Pac. Symp. Biocomput. 23, 228–239 (2018).
PubMed PubMed Central Google Scholar
Bhattacharya, A. et al. Best practices for multi-ancestry, meta-analytic transcriptome-wide association studies: lessons from the Global Biobank Meta-analysis Initiative. Cell Genom. 2, 100180 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhu, A. et al. MRLocus: Identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity. PLoS Genet. 17, e1009455 (2021).
Article CAS PubMed PubMed Central Google Scholar
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).
Article CAS PubMed Google Scholar
Wang, X., Lu, Z., Bhattacharya, A., Pasaniuc, B. & Mancuso, N. twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis. Bioinformatics 39, btad288 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Article PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Zhu, A., Srivastava, A., Ibrahim, J. G., Patro, R. & Love, M. I. Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Res. 47, e105 (2019).
Article CAS PubMed PubMed Central Google Scholar
Vitting-Seerup, K. & Sandelin, A. IsoformSwitchAnalyzeR: analysis of changes in genome-wide patterns of alternative splicing and its functional consequences. Bioinformatics 35, 4469–4471 (2019).
Article CAS PubMed Google Scholar
Kowalski, M. H. et al. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 15, e1008500 (2019).
Article PubMed PubMed Central Google Scholar
Loh, P. R. et al. Reference-based phasing using the haplotype reference consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
Article PubMed Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article CAS PubMed PubMed Central Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article CAS PubMed PubMed Central Google Scholar
Belmont, J. W. et al. The international HapMap project. Nature 426, 789–796 (2003).
Article Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).
Article PubMed PubMed Central Google Scholar
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161–e161 (2007).
Article PubMed PubMed Central Google Scholar
Mostafavi, S. et al. Normalizing RNA-sequencing data by modeling hidden covariates with prior knowledge. PLoS ONE 8, e68141 (2013).
Article CAS PubMed PubMed Central Google Scholar
Picard Toolkit. Broad Institute, GitHub Repository https://broadinstitute.github.io/picard/ (2019).
Zhang, W. et al. Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits. Nat. Commun. 10, 3834–3834 (2019).
Article PubMed PubMed Central Google Scholar
Bhattacharya, A. bhattacharya-a-bt/isotwas_manu_scripts: isoTWAS manuscript code and scripts. Zenodo https://doi.org/10.5281/ZENODO.8323001 (2023).

Download references

Acknowledgements

We thank Kangcheng Hou, Tommer Schwarz, Vidhya Venkateswaran, Pan Zhang, Leanna Hernandez, Nathan LaPierre, Harold Pimentel, Mike Love and Achal Patel for engaging discussion during the research process. We thank Kanishka Patel for her aesthetic advice for figures. We thank the Psychiatric Genomics Consortium and Complex Trait Genomics Lab for their publicly available GWAS summary statistics. This work was supported by National Institutes of Health awards R01 HG009120, R01 MH115676, R01 CA251555, R01 AI153827, R01 HG006399, R01 CA244670 and U01 HG011715 (B.P.), as well as SFARI Bridge to Independence Award, NIMH R01-MH121521, NIMH R01-MH123922 and NICHD-P50-HD103557 (M.J.G.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Bogdan Pasaniuc, Michael J. Gandal.

Authors and Affiliations

Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, TX, USA
Arjun Bhattacharya
Institute for Data Science in Oncology, University of Texas MD Anderson Cancer Center, Houston, TX, USA
Arjun Bhattacharya
Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
Arjun Bhattacharya & Bogdan Pasaniuc
Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Daniel D. Vo, Connor Jops & Michael J. Gandal
Lifespan Brain Institute at Penn Med and the Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Daniel D. Vo, Connor Jops & Michael J. Gandal
Department of Psychiatry and Biobehavioral Sciences, Semel Institute, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
Minsoo Kim, Cindy Wen & Michael J. Gandal
Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
Minsoo Kim, Cindy Wen & Michael J. Gandal
Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA, USA
Cindy Wen, Jonatan L. Hervoso & Bogdan Pasaniuc
Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
Bogdan Pasaniuc
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Michael J. Gandal

Authors

Arjun Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Daniel D. Vo
View author publications
You can also search for this author in PubMed Google Scholar
Connor Jops
View author publications
You can also search for this author in PubMed Google Scholar
Minsoo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Cindy Wen
View author publications
You can also search for this author in PubMed Google Scholar
Jonatan L. Hervoso
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Pasaniuc
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Gandal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.B., B.P. and M.J.G. conceptualized the project. A.B., J.H., B.P. and M.J.G. developed the methodology. A.B. and D.D.V. wrote the software. A.B., C.J. and D.D.V. validated results. A.B., M.K., C.W., C.J., D.D.V. and J.H. contributed to formal analysis. A.B. and M.J.G. contributed to investigation. B.P. and M.J.G. provided resources. A.B., M.K., C.W., C.J., M.J.G. and DDV curated data. A.B. and M.J.G. wrote the original draft. All authors reviewed and edited the paper. A.B. visualized results. A.B., B.P. and M.J.G. supervised the project. B.P. and M.J.G. administered the project. B.P. and M.J.G. acquired funds for the project.

Corresponding authors

Correspondence to Arjun Bhattacharya or Michael J. Gandal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Eric Gamazon and Pejman Mohammadi for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 isoTWAS framework assumptions and testing framework.

(a) Directed acyclic graph (DAG) illustrating causal assumptions in isoTWAS: the local genetic variants within 1 Megabase of a gene have direct effects on the expression of a gene G and its isoforms; these genetic effects need not be shared across isoforms and the gene. Further, the abundance of a gene is the sum of abundances of its isoforms. Lastly, the isoform and gene need not affect the complex trait through the same path. Genetic variants may have effects on the trait through pathways independent of gene and isoform expression. (b) Step-wise hypothesis testing in isoTWAS. First, isoform-trait associations are estimated Then, associations for isoforms are aggregated to the gene-level using the Aggregated Cauchy Association Test (ACAT). These aggregated gene-level associations are adjusted for multiple testing burden to control the false discovery rate (FDR). Lastly, for isoforms of genes that pass gene-level testing, we control the family-wide error rate (FWER) using Shaffer’s modified sequentially rejective procedure.

Extended Data Fig. 2 Prediction comparison in simulation.

(a) Boxplots of adjusted R² of prediction of isoform expression (Y axis) across shared isoQTL proportion (X axis), for 5 isoforms with isoform heritability (${h}_{i}^{2}$) set to 0.05 or 0.10 (n = 1,000 independent simulations). (b) Boxplots of percent difference in adjusted R² in predicting gene expression between isoTWAS and TWAS models from simulations with sample size 200 (compared with sample size 500 in Fig. 2), where isoform and gene expression heritability are set to (top) 0.05 and (bottom) 0.10. For (a-b), all boxplots represent the median, 25% and 75% quantiles, and whiskers correspond to the 10% and 90% quantiles.

Extended Data Fig. 3 Isoform prediction comparison across 48 GTEx tissues.

(a) Number of multivariate (cream) and univariate (blue) models predicting isoform expression at CV R² > 0.01 (X axis). (b) Percent difference in CV R² (X axis) of prediction of isoform expression models using multivariate models versus univariate models. The label shows the proportion of isoforms with improved performance using multivariate models (n = 139-803 biologically independent sample, see Supplementary Table 1). (c) Number of isoforms with CV R² > 0.01 (Y axis) using the baseline univariate model (teal, best univariate) and 4 multivariate models. (d) On left, median percent difference in R² of predicting original isoform expression using multivariate versus univariate models (left) and gene expression using isoTWAS versus TWAS models (right) across increasing number of isoforms per gene, colored by models trained in the original dataset and the leave-one-out dataset (n = 139–255, see Supplementary Table 1). For (b,d), all boxplots represent the median, 25% and 75% quantiles, and whiskers correspond to the 10% and 90% quantiles.

Extended Data Fig. 4 isoTWAS inclusion criterion and performance gains across 48 GTEx tissues.

(a) Number of genes that pass TWAS (blue) and isoTWAS (red) CV R² cutoffs to be available for testing in the trait-mapping step (X axis) (b) Percent difference in CV R² (X axis) of prediction of isoform expression models using multivariate models versus univariate models. The label shows the proportion of isoforms with improved performance using multivariate models. All boxplots represent the median, 25% and 75% quantiles, and whiskers correspond to the 10% and 90% quantiles.

Extended Data Fig. 5 Gene prediction comparison across 48 GTEx tissues.

(a) Number of genes predicted at CV R² > 0.01 using TWAS (blue) and isoTWAS (red). (b) Number of genes (left) and isoforms (right) predicted at CV R² > 0.01 using isoTWAS across 48 GTEx tissues.

Extended Data Fig. 6 isoTWAS performance across multiple factors.

(a) (top left) Ratio of number of isoforms predicted at R² > 0.01 using multivariate versus univariate prediction. (top right) Ratio of number of genes passing CV threshold using isoTWAS versus TWAS. (bottom left) Median number of isoforms predicted at CV R² > 0.01 in isoTWAS models across increasing number of isoforms per gene. The red line shows the line Y = X + 1. (bottom right) Ratio of number of genes with CV R² > 0.01 using isoTWAS versus TWAS. (b-f) Across bins for maximum isoform fraction (b), gene length (c), SNP density (d), sample size (e), proportion of shared isoTWAS model effect SNPs (f), (left) ratio of number of isoforms predicted at R² > 0.01 using multivariate versus univariate prediction, (middle) ratio of number of genes passing CV threshold using isoTWAS versus TWAS, and (right) ratio of number of genes with CV R² > 0.01 using isoTWAS versus TWAS. (g-h) Across bins for mean counts (g) and quantification variance (h) of isoforms and genes, (left) ratio of number of isoforms predicted at R² > 0.01 using multivariate versus univariate prediction and (right) Ratio of number of genes with CV R² > 0.01 using isoTWAS versus TWAS. For (a-h), all boxplots represent the median, 25% and 75% quantiles, and whiskers correspond to the 10% and 90% quantiles.

Extended Data Fig. 7 Power comparison in simulation.

(a) Across 20 iterations of 1,000 simulations, boxplots of false positive rate to detect a gene-trait association using Cauchy-aggregated P values of isoform-trait associations (red) and gene-level TWAS (blue) from weighted burden tests. We calculate the false positive rate as the proportion of the 1,000 tests that give P > 0.05. All boxplots represent the median, 25% and 75% quantiles, and whiskers correspond to the 10% and 90% quantiles. (b) (Scenario 1) Power to detect gene-trait association (proportion of tests with P < 2.5 × 10⁻⁶ using weighted burden test, Y axis) across number of total isoforms per gene (X axis). Points are shaped by causal isoQTL proportion and colored by method. (Scenario 2) Power to detect gene-trait association across proportion of gene expression explained by effect isoform (X axis). (Scenario 3) Power to detect gene-trait association across ratio of effect sizes of 2 effect isoforms (X axis). All plots for (1–3) are facetted by proportion of shared isoQTLs (top margin) and proportion of expression heritability attributed to shared non-genetic effects across isoforms (right margin). For (2–3), points are shaped by number of isoforms per gene and colored by method. Here, expression heritability is set of 0.05, trait heritability is set to 0.1, and causal proportion of Scenarios 2–3 is set of 0.01. (c) Sensitivity and mean set size of 90% credible set using FOCUS to finemap isoform-trait associations for a single gene, across causal isoQTL proportion (X axis). Points are colored by trait heritability and shaped by the number of isoforms per gene. Plots are facetted by proportion of shared isoQTLs (top margin) and proportion of expression heritability attributed to shared non-genetic effects across isoforms (right margin). Line-ranges in show a 95% jackknife confidence interval (n = 1,000 independent simulations).

Extended Data Fig. 8 Discovery comparison in public data.

(a) Data sources for eQTL reference data, GWAS cohorts, and reference LD data are provided on the left (black). The full gene-level TWAS (red) and isoTWAS (blue) are summarized on the right. (b) Number of gene-trait associations (Y axis) using TWAS (red) and isoTWAS (blue) across trait (X axis), faceting by tissue (top margin) and threshold (right margin: adjusted weighted burden test P < 0.05 and permutation test P < 0.05, top; in 90% credible set using FOCUS fine-mapping, bottom). (c) Number of isoform-trait associations (Y axis) using isoTWAS across trait (X axis), faceting by tissue (top margin) and threshold (right margin: adjusted weighted burden test P < 0.05 and permutation test P < 0.05, top; in 90% credible set using FOCUS fine-mapping, bottom). (d) Distribution of number of genes (left) and isoforms (right) in risk region and in 90% credible set using TWAS and isoTWAS.

Extended Data Fig. 9 Test statistic inflation comparison in public data.

QQ-plots of gene-level P values using TWAS (red) and isoTWAS (blue) across 15 traits.

Extended Data Fig. 10 Biological relevance of gene-trait associations detected by isoTWAS.

(a) Lollipop plot of enrichment ratio (X axis) of ontologies (Y axis) for isoTWAS-prioritized genes associated at adjusted weighted burden test P < 0.05 and permutation test P < 0.05. Points are shaped by tissue type (adult or developmental) and colored by ontology type (biological process, cell component, molecular function). (b-d) For ENST00000681794 association with BV (b) and ENST00000492957 with BV (c), Manhattan plots of GWAS, eQTL, and isoQTL signal colored by LD (top), boxplots of gene (red) and isoform (blue) expression (Y axis) by genotype (X axis) (bottom left), and forest plot of lead isoQTL effect size using two-sided Wald-type t-test from linear regression and 95% confidence interval with isoform (blue), gene (red), and trait (black) (bottom right, n = 2,115 biologically independent samples). Vertical gray lines indicate the transcription start and end sites for each gene, and the horizontal gray line indicates P = 5 × 10⁻⁸ for GWAS and 10⁻⁶ for QTLs. All boxplots represent the median, 25% and 75% quantiles, and whiskers correspond to the 10% and 90% quantiles. (d) Comparison of exon and intron structure of ENST00000681794 and ENST00000492957, based on Gencode v38 reference.

Supplementary information

Supplementary Information

Supplementary Note, Methods, Figures 1–4, Table and Data Legends and References.

Reporting Summary

Peer Review File

Supplementary Table 1

Legends are included in Supplementary Information.

Supplementary Data 1

Predictive performance comparison of isoTWAS multivariate methods in simulated data across a variety of genetic architecture settings. Data here underlies Extended Data Fig. 2.

Supplementary Data 2

Predictive performance comparison of isoTWAS and TWAS gene expression prediction in simulated data across a variety of genetic architecture settings. Data here underlies Fig. 2 and Extended Data Fig. 2.

Supplementary Data 3

Isoform expression prediction metrics across a variety of factors, using 48 GTEx datasets. Data here underlies Extended Data Fig. 6.

Supplementary Data 4

Gene expression prediction metrics across a variety of factors, using 48 GTEx datasets. Data here underlies Extended Data Fig. 6.

Supplementary Data 5

False positive rates using isoTWAS and TWAS to detect a gene-trait association at P < 0.05 across a variety of genetic architecture parameters. Data here underlies Extended Data Fig. 7.

Supplementary Data 6

Power to detect trait association at P < 2.5 × 10⁻⁶ across 1,000 simulations each for 22 genes using TWAS and isoTWAS across various simulations. These simulations are under Scenario 1 in Fig. 4a (gene has a true effect on the trait, but none of the isoforms have a true effect on the trait). Data here underlies Fig. 4 and Extended Data Fig. 7.

Supplementary Data 7

Power to detect trait association at P < 2.5 × 10⁻⁶ across 1,000 simulations each for 22 genes using TWAS and isoTWAS (ACAT) across various simulations. These simulations are under Scenario 2 in Fig. 4b (a gene has multiple isoforms, only one has an effect on the trait, and we vary the usage of this effect isoform). Data here underlies Fig. 4 and Extended Data Fig. 7.

Supplementary Data 8

Power to detect trait association at P < 2.5 × 10⁻⁶ across 1,000 simulations each for 22 genes using TWAS and isoTWAS (ACAT) across various simulations. These simulations are under Scenario 3 in Fig. 4c (a gene has two isoforms with differing effects on the trait, and we vary the effect size of one of the isoforms). Data here underlies Fig. 4 and Extended Data Fig. 7.

Supplementary Data 9

Sensitivity and mean set size of 90% credible sets determined by FOCUS in simulated data across a variety of genetic architecture parameters. Data here underlies Extended Data Fig. 7.

Supplementary Data 10

Raw TWAS results across 15 neuropsychiatric traits using adult brain cortex expression models. Data here underlies Extended Data Figs. 8–9.

Supplementary Data 11

Raw isoTWAS results across 15 neuropsychiatric traits using adult brain cortex expression models. Data here underlies Extended Data Figs. 8–9.

Supplementary Data 12

Raw TWAS results across 15 neuropsychiatric traits using developmental brain cortex expression models. Data here underlies Extended Data Figs. 8–9.

Supplementary Data 13

Raw isoTWAS results across 15 neuropsychiatric traits using developmental brain cortex expression models. Data here underlies Extended Data Figs. 8–9.

Supplementary Data 14

GWAS and nominal eQTL and isoQTL summary statistics corresponding to isoTWAS isoform-trait association examples shown in Fig. 6 and Extended Data Fig. 10b. Data here underlies Fig. 6 and Extended Data Fig. 10.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bhattacharya, A., Vo, D.D., Jops, C. et al. Isoform-level transcriptome-wide association uncovers genetic risk mechanisms for neuropsychiatric disorders in the human brain. Nat Genet 55, 2117–2128 (2023). https://doi.org/10.1038/s41588-023-01560-2

Download citation

Received: 09 September 2022
Accepted: 05 October 2023
Published: 30 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1038/s41588-023-01560-2

This article is cited by

Neuron type-specific proteomics reveals distinct Shank3 proteoforms in iSPNs and dSPNs lead to striatal synaptopathy in Shank3B–/– mice
- Yi-Zhi Wang
- Tamara Perez-Rosello
- Jeffrey N. Savas
Molecular Psychiatry (2024)
Alternative splicing in prostate cancer progression and therapeutic resistance
- Chitra Rawat
- Hannelore V. Heemers
Oncogene (2024)