Shared associations identify causal relationships between gene expression and immune cell phenotypes

Genetic mapping studies have identified thousands of associations between common variants and hundreds of human traits. Translating these associations into mechanisms is complicated by two factors: they fall into gene regulatory regions; and they are rarely mapped to one causal variant. One way around these limitations is to find groups of traits that share associations, using this genetic link to infer a biological connection. Here, we assess how many trait associations in the same locus are due to the same genetic variant, and thus shared; and if these shared associations are due to causal relationships between traits. We find that only a subset of traits share associations, with many due to causal relationships rather than pleiotropy. We therefore suggest that simply observing overlapping associations at a genetic locus is insufficient to infer causality; direct evidence of shared associations is required to support mechanistic hypotheses in genetic studies of complex traits.


Introduction 1
Genetic mapping studies have identified thousands of associations between common variants 2 and hundreds of human traits. Uncovering the mechanisms that underlie these traits requires 3 understanding the molecular, cellular and physiological events altered by causal genetic 4 variants. Incomplete fine mapping due to linkage disequilibrium and the possible action of 5 causal variants across diverse cell types, contexts and genes currently limits our ability to infer 6 the mode of action of causal variants, and hence the biology underlying traits. Experimentally 7 testing multiple such mechanistic hypotheses across thousands of associations rapidly 8 becomes a problem of scale; we thus need principled approaches to generating and testing 9 such mechanistic hypotheses. 10 11 We and others have suggested such an approach, building on the concept of pleiotropy. The 12 molecular and cellular events altered by causal genetic variants are, by definition, also genetic 13 traits, and they must be associated with the same variant. To link traits together and thus form 14 mechanistic hypotheses, one can thus look for shared genetic associations between traits [1][2][3][4]

. 15
Such sharing is often defined as two traits associated with variation in the same general 16 genome region, often within an arbitrary window of physical distance. A more robust alternative 17 is to identify pairs of traits that share an underlying causal effect, rather than a shared genomic 18 segment, and several methods have been developed to this end [5][6][7][8][9][10][11][12] . 19 20 We have reported a relative paucity of overlaps between expression quantitative trait loci 21 (eQTLs) and disease risk associations 12 . This result appears paradoxical given the strong 22 enrichment of risk heritability in gene regulatory regions [13][14][15] , which suggests that the majority of 23 risk effects should alter gene regulation, and therefore expression. This paucity may be 24 because we are not interrogating the right cell types, or the right physiological or stimulation 25 conditions for those cells; or it may be that we lack power to detect such overlaps, though our 26 simulations suggest the latter is not a major factor 3, 12 . 27 28 It is tempting, therefore, to assume that requiring a demonstration of shared association 29 between traits is overly stringent, and simply identifying associations to the same region is a 30 more productive approach. A further temptation is then to assume causality, especially when 31 the two traits are drawn from different levels of physiology: it is natural to assume that a gene 32 conditional association analyses for all markers within 200 kilobases of each lead SNP. At each 23 step, we identified the most associated SNP not in LD with any other lead SNP (r 2 < 0.2); if this 24 SNP had p < 1×10 -3 , we added it to the model and repeated the analysis until no independent 25 SNP satisfied the p-value threshold. For each conditionally independent signal, we then 26 calculated residual association statistics, where we condition on all other independent effects in 27 a locus. All association signals with a lead SNP with an association p < 1×10 -5 were carried 28 forward to subsequent analyses. These represent strong independent associations, with any 29 residual weak effects removed (identified by the more lenient p < 1×10 -3 threshold). 30 We next identified cis-eQTLs overlapping immune phenotype associations. We adopted the 1 GTEx definition of a cis-eQTL being within 1 megabase of the transcription start site of the 2 gene. We looked for conditionally independent immune phenotype associations within 200 3 kilobases of each lead SNP above; we therefore identified all genes with a transcription start 4 site (TSS) within 1 megabase of each lead SNP (R/BiomaRt v2.34.3, Ensembl build 37). We 5 then looked for cis-eQTL associations for each such gene in T cells, monocytes and 6 neutrophils, independently. For each identified eQTL (p < 1×10 -3 ), we then performed stepwise 7 conditional association analyses as described above, for all SNPs within 1.2 megabases of the 8 TSS ( Figure S4). We chose this distance so any effects overlapping the lead SNP window are 9 conditionally independent. 10 11 Due to the smaller sizes of the gene expression traits we limited iterations to a maximum of 12 three independent signals per locus. As above, we then calculated residual association 13 statistics for each independent eQTL effect in each of the three cell types. For each of the 14 genetic loci associated with immune phenotypes we then selected all gene expression 15 association statistics with a lead SNP with an association p < 1×10 -3 at the respective genetic 16

locus. 17
Identifying shared associations between immune and gene expression We tested for shared effects between immune and gene expression trait pairs with JLIM v2 12 . 20 Given genotype-phenotype associations for two phenotypes in different cohorts in the same 21 locus, JLIM assesses the likelihood of the joint model that variant i is causal in one trait and 22 variant j in another trait, over some number of variants observed in two distinct cohorts. If this 23 joint likelihood is maximal when i = j we can infer the presence of a single, shared effect driving 24 both associations. Conversely, when the likelihood is maximal when i ≠ j we can infer that the 25 observed associations are due to different underlying effects. JLIM assumes that only one 26 causal variant for each of the tested traits is present in the analyzed window. 27 For each trait pair, we used the immune phenotype as the primary trait and the gene expression 28 trait as secondary. We used the 404 non-Finnish European samples from the 1000 Genomes 29 Project (phase 3, release 2013/05/02) as an external LD reference panel. We permuted the secondary trait for each pairwise comparison 100,000 times to obtain empirical significance 1 levels, and used a false discovery rate (FDR) < 0.05 as a significance threshold. 2 We compared 16,652 unique combinations of immune phenotype and gene expression trait in 3 one of three cell types across 1,199 genetic loci. These pairs encompassed all 164 immune 4 traits and 7,060 genes with eQTLs in at least one of the three BLUEPRINT cell types. As we 5 considered up to three conditionally independent associations per gene and locus, we made a 6 total of 22,379 comparisons. 7 Calculating polygenic risk scores within trait pairs the ith individual over m independent SNPs is defined as: 14 where ! is the number of minor alleles carried at the jth SNP, and ! is the eQTL effect size for 15 the jth SNP 27 . To account for shared effects between some trait pairs but not others, we 16 condition eQTL traits on the variant with the strongest JLIM p-value (even if not significant), and 17 use the now conditionally independent eQTL data for the PRS calculation. We used ten 18 different significance thresholds to select these SNPs: 1×10 -2 , 1×10 -3 , 1×10 -4 , 5×10 -5 , 1×10 -5 , 19 5×10 -6 , 1×10 -6 , 5×10 -7 , 1×10 -7 , and 5×10 -8 . We then calculated the proportion of immune 20 phenotype variance (R 2 ) explained by these PRS and their empirical significance, also using 21

PRSice. 22 23
We then compared PRS results between trait pairs with a shared effect and trait pairs with no 24 sharing, using two approaches. We compared the proportion of immune phenotype variance 25 explained (i.e. the R 2 values) with the Mann-Whitney-Wilcoxon test, and the correlation between 26 JLIM p-values and PRS R 2 with univariate linear regression. 27 1 We used two Mendelian randomization (MR) approaches to assess evidence that gene 2 expression traits are causal for the immune phenotype traits for which they share an 3 association. First, we used two-sample Mendelian randomization (TSMR) 28,29 , as implemented 4 in the TwoSampleMR v0.4.25 R package, using inverse variance weighting of effect sizes. As 5 instruments, we selected all independent SNPs associated with the gene expression trait with 6 an association p-value < 1×10 -5 . 7 8 We also used transcriptome-wide summary statistics-based Mendelian Randomization 9 (TWMR), an extension of TSMR 30 . As described by Porcu et al, we first identified all variants 10 associated with the gene expression trait in each trait pair in each cell type (conditional 11 association p < 1×10 -3 ). We then identified all genes within 1 megabase of the gene's TSS, and 12 selected all SNPs associated with the expression levels of any of these genes. We then 13 removed genes with highly correlated expression values to the original gene (r 2 > 0.2), and 14 selected pairwise-independent SNPs from the remaining list (pairwise LD r 2 < 0.1). We used the 15 resulting set of variants as instruments in a multivariate MR model to estimate the causal effect 16 on the immune phenotype. independent association to at least one immune phenotype (p < 1×10 -5 ; 32 loci at p < 5×10 -8 ). In 10 83/1,379 loci, we found more than one independent effect for the same immune phenotype 11 (conditional p < 1×10 -5 , 6% of loci). 12 13 We next looked for genes whose expression could plausibly be influenced by the same genetic 14 effect in these loci. We found 14,634 genes with a transcription start site within 1 megabase of 15 the lead variants, with at least one gene in 1,318/1,379 (95.6%) of the loci. We found that though we test the vast majority of cases where immune phenotypes and eQTLs overlap, we 3 find limited statistical evidence for shared effects between them. This is consistent with our 4 previous observations of limited sharing between autoimmune disease associations and cis-5 eQTLs 12 . Also similar to our previous findings, we observe that the power to detect shared 6 association thus depends in part on statistical power in the secondary trait cohorts ( Figure S6). 7 8 Some of our results highlight clear relationships between traits: for example, we see a shared 9 effect between expression level of SELL and the level of its protein product L-selectin (CD62L; 10 Figure 1). We find the eQTL in all three BLUEPRINT cell types, and the immunological trait 11 association in both neutrophils and eosinophils. As expected, the allele associated with 12 increased SELL transcript abundance in neutrophils and monocytes is also associated with 13 CD62L intensity in both neutrophils and eosinophils (Figure 1 and Figure S7). However, the 14 same effect decreases SELL transcript abundance in T cells. We find strong evidence that the T We found that many shared effect cis-eQTLs are not in the same immune cell subpopulation as 21 their cognate immune phenotype. In some cases we did not have expression data for the 22 subpopulation in which an immunophenotype was measured. We found, for instance, a shared 23 effect between the expression level of CR2 in T cells and the level of its protein product, 24 complement receptor type 2 or CD21 in multiple B cell populations ( Figure S8). CD21 is the 25 route through which Epstein-Barr virus infects B cells, and there is more recent evidence that 26 this is also the mechanism of T cell infection 34 . This may therefore be a constitutive CR2 eQTL, 27 and may have a bearing on susceptibility to EBV infection. effect on the outcome trait [35][36][37] . Therefore, to establish if traits with shared associations are 10 more likely to be causally related, we first assess evidence for shared heritability between them, 11 and then directly assess evidence for mediation. that PRS (as !"# ! ). We reasoned that if traits with a shared association are more likely to share 20 heritability more broadly, we should see more variance explained in the 190 trait pairs than in 21 the 16,462 that do not pass our JLIM analysis. We knew, however, that the presence of a 22 shared association would bias this analysis in favor of our expected outcome, because we 23 would be including a known positive association to the immune traits only in the 190 pairs. We 24 accounted for this bias by performing conditionally independent association testing for the 25 main cis-eQTL effect. 26

27
We found that the variance of an immune trait explained by gene expression PRS was higher 28 when the two traits shared an association than when they did not share one, even though we 29 removed the effect of the shared association (Mann-Whitney-Wilcoxon p < 0.05; Table 1 and 30 Figure 3). We found that this was generally true across a range of thresholds for selecting 31 variants to include in the PRS calculation 27 . We also found that in cases where an immune trait 1 shared an association with more than one eQTL, including PRS for all shared eQTLs explained 2 more variance than any single expression trait alone ( Figure S9). Our analysis is conservative, 3 as we condition on the variant with the strongest evidence of shared association; as expected, 4 including lead cis-eQTL effects in the PRS calculations shows an even more extreme difference 5 (Table S2). 6 7 To ensure our results were not due to selection artefacts induced by p-value thresholds, we 8 also examined the correlation between all JLIM p-values and variance of immune phenotypes 9 explained by the gene expression PRS, and found significant correlation (Table 1, Figure 3). 10 Together, the results of the PRS analyses provide evidence for a stronger genetic correlation 11 between colocalized gene expression and immune phenotypes if they share the same 12 underlying genetic effect. 13 Trait pairs sharing associations are more likely to be causally related immune phenotype) where we found evidence for at least 2 independent variants associated 23 with the gene expression trait (conditional p < 1x10 -5 ). Using these variants as TSMR 24 instruments, we found that the estimated causal effects of gene expression traits on immune 25 phenotypes were higher in the 190 pairs with a shared association compared to the 16,441 26 non-sharing pairs (Figure 4a), but this difference was not significant. However, we observed 27 overall correlation between JLIM p-values and the magnitude of the causal effect of gene 28 expression traits on immune phenotypes (Figure 4b). These results suggest that trait pairs 29 sharing an association are more likely to be causally related, and that this likelihood increases 30 with increasing evidence that a shared association exists. 31 To address the limited number of instruments available for gene expression traits, we 1 broadened our analysis to include other transcripts in the locus influenced by the same 2 variants 30 . For each trait pair, we first identify variants independently associated with the gene 3 expression trait, as above. We then ask if these variants are associated with any other 4 transcript levels in the locus (within 1 megabase of the lead variant in the shared association 5 test), and if so, identify all variants independently associated with those transcripts too. We thus 6 gather a larger set of instruments for our MR analysis. We remove transcripts whose overall 7 expression is highly correlated to the initial gene expression trait, to avoid over-estimating the 8 effects of variants, as previously suggested 30 . We then perform the same inverse variance-9 weighted MR analysis as above, using the expanded instrument sets. In this work, we report that immune cell traits and gene expression traits share a small but 2 significant set of associations, and that these point to interesting biological events with 3 mechanistic implications. We then show that traits sharing a pleiotropic association tend to be 4 causally related, rather than subject to horizontal pleiotropy. Thus, even though individual 5 analyses may be under-powered, we are able to show in bulk that shared associations occur 6 between causally related traits. We were struck by the number of shared effects between monocyte and neutrophil gene 20 expression traits and T cell immune parameters. In these cases, the genes are either not 21 detected at all, or show no evidence of an eQTL, in our T cell data (Figure 2). These results 22 suggest that changes to gene expression in one cell type can have a direct effect on another 23 population. Widespread cross-talk between immune cell subsets is a well-attested 24 phenomenon, but to our knowledge this is the first time genetic mapping uncovers mechanisms 25 of cellular coordination across cell subsets. 26 27 A major challenge in human genetics is translating genotype-phenotype associations into 28 testable hypotheses of the underlying molecular, cellular and physiological events. Directly 29 predicting the effect of a trait-associated variant remains challenging, especially for non-coding 30 polymorphisms. This is further hampered by the limited resolution of fine mapping, so that in 31 most cases we can only narrow an association signal to a group of variants over a genomic 32 interval, all of which must be investigated, rather than pinpoint the exact causal variant. As 1 variant function prediction ability is limited, direct experimentation is necessary, gradually 2 uncovering molecular, cellular and ultimately physiological effects. Without prior information, a 3 variety of outcomes across different cell types and conditions need to be assessed at each 4 stage to uncover trait-relevant events. This approach is not scalable, so most translation efforts 5 are necessarily conducted piecemeal. 6 7 One alternative approach to this bottleneck is to exploit pleiotropy across traits to generate 8 molecular, cellular and physiological mechanism hypotheses, which can then be tested 9 experimentally in a more focused way. This approach uses the fact that a variant associated 10 with a physiological trait must act on molecular and cellular events; these are, by definition, also 11 genetic traits as they are altered by a genetic variant, and the variant must have a pleiotropic 12 effect on all these traits. We can thus compare, in unbiased fashion, many molecular traits 13 (gene expression levels, in the present work) with many cellular traits (here, immunological 14 parameters) to identify pleiotropic effects. Two broad approaches to this cross-trait pleiotropy 15 have been articulated: the first is to look for shared heritability between two traits genome-16 wide 47,48 ; and the second to look for shared effects at specific loci where we see genotype-17 phenotype associations, as we do in the present work. The latter is particularly suited to traits 18 such as gene expression, where one variant often explains a large proportion of trait variance. 19 20 After identifying pairs of traits that share genetic effects (either locus-specific or genome-wide), 21 it is tempting to immediately conclude that one trait directly causes the other. This conclusion is 22 particularly appealing when considering traits across different levels of physiology, as we do 23 here with gene expression and cellular measurements. Our results, however, suggest that 24 simply observing association of different traits at the same genetic locus is not sufficient. 25 Rather, we must look for direct evidence for causality between traits, and shared association 26 mapping is a useful approach to do so. Ultimately, only with explicit proof of causality can we 27 construct mechanistic hypotheses about trait physiology. and wrote the manuscript, which was approved by SC and SRS. 12 13 The authors declare that no competing interests exist.   meeting a threshold of association in the gene expression trait. We used these variants to calculate PRS for each individual in the Milieu Intérieur Project immune phenotype collection, and calculated the proportion of immune phenotype variance explained by these PRS ( !"# ! ). We found that gene expression PRS could explain significantly more immune trait variance for gene expression/immune trait pairs with a shared underlying genetic effect than for those that did not share an association. We saw this effect at different thresholds for selecting PRS instruments: (a) expression quantitative trait locus (eQTL) p < 5×10 -8 ; and (c) eQTL p < 5×10 -5 . We also found that the proportion of variance explained was correlated to the overall strength of evidence for a shared (JLIM p-value) at these two selection thresholds (b, d). MWW = Mann-Whitney-Wilcoxon test.