A compendium of uniformly processed human gene expression and splicing quantitative trait loci

Many gene expression quantitative trait locus (eQTL) studies have published their summary statistics, which can be used to gain insight into complex human traits by downstream analyses, such as fine mapping and co-localization. However, technical differences between these datasets are a barrier to their widespread use. Consequently, target genes for most genome-wide association study (GWAS) signals have still not been identified. In the present study, we present the eQTL Catalogue (https://www.ebi.ac.uk/eqtl), a resource of quality-controlled, uniformly re-computed gene expression and splicing QTLs from 21 studies. We find that, for matching cell types and tissues, the eQTL effect sizes are highly reproducible between studies. Although most QTLs were shared between most bulk tissues, we identified a greater diversity of cell-type-specific QTLs from purified cell types, a subset of which also manifested as new disease co-localizations. Our summary statistics are freely available to enable the systematic interpretation of human GWAS associations across many cell types and tissues.

This manuscript describes eQTL Catalog, and presents analyses of the properties and use cases of the data and demonstrate its value especially in terms of uniform effect size estimation and better finemapping. The biological discoveries are rather modest, but the analyses are diverse, well done, and follow all the latest standards of QTL mapping and downstream analyses. The manuscript is well written and careful work, and the data are presented very clearly. I have only a few technical comments: -Effect size sharing analysis: "We limited our analysis to 54,733 fine mapped eQTLs (see Methods) and defined two eQTLs to be shared between a pair datasets if they had the same sign and their effect sizes did not differ more than two-fold." Wouldn't these results bias the results? -The notion that gene expression and eQTL effect size sharing clusters very similarly could have a reference to GTEx v8 main paper (2020) that showed and quantified this pattern as well.
-It appears that the matrix factorization model was applies to only a subset of the data sets ( Fig 3D). Why? -The unit of Figure 3D color scale is not explained. -It could be helpful to include in Fig 5 legend the definition of "novel colocalization" (i.e. not detected in GTEx) to make it clearer for a casual reader.
Reviewer #2: Remarks to the Author: The authors describe the identification of cis-eQTLs for a wide variety of tissues and cell-types. They use a highly consistent approach to harmonise these datasets, and therefore are able to uncover sets of robust cis-eQTLs that affect whole gene and transcript expression levels.
I am impressed by the rigour that has been applied to arrive at these results, and applaud the accompanying website. I carefully studied the materials and methods and have no comments on the procedure that has been employed to identify these cis-eQTLs. I also believe these cis-eQTLs are highly relevant for the field, particularly since in the past sometimes incorrect procedures have been employed on these dataset: these methods sometimes did not property accounted for multiple testing, resulting in a large number of reported, but false-positive eQTLs.
I do have one comment: The manuscript is a little brief on the biological interpretation or application: I would imagine that its purpose is predominantly to link associated loci for complex diseases to individual genes inside these loci. The authors use colocalization and Mendelian randomisation to do this, but I would like to see more insights in when this works, and when it doesn't: My group has been applying coloc and MR, and we have observed that such procedures do result in the identification of positional candidate genes, but also that, based on a comparison with different gene prioritisation approaches, these candidate genes are very often not the most likely candidates. This is also something that others have recently observed (https://pubmed.ncbi.nlm.nih.gov/30239796/), and I believe the authors are uniquely positioned to report on this in the manuscript: I would love to see an analysis where the authors attempt to quantify to what extent this resource enables prioritization of the most likely positional candidate gene. Ideally, the authors might want to give recommendations on how to apply this resource, and discuss potential reasons why it remains a possibility that if for a given locus a GWAS signal and a tissue specific cis-eQTL signal are showing both colocalisation and encouraging Mendelian randomisation result that it still remains possible that this gene is in fact not the causal gene inside the locus.
It would additionally be helpful if the authors can speculate on the utility of single-cell eQTL datasets: the authors have studied several cell-type specific, but bulk RNA-seq datasets. I wonder whether single-cell RNA-seq eQTL studies might miss out on transcript eQTLs, that with bulk RNA-seq can be detected? And vice versa: what kind of biological principles can only be detected when using singlecell eQTL data instead of using bulk data instead? It might therefore be worthwhile to hear more about the pros and cons of bulk and single-cell eQTL studies.
Reviewer #3: Remarks to the Author: In this study the authors uniformly re-computed QTLs from 21 QTL studies, and demonstrated that cis-eQTLs are highly reproducible between studies from the same cell or tissue type. Much of the analysis resembles that of the GTEx eQTL papers, but the added value of this paper is that the observation that cell type and tissue type, not study identity or technical factors, are the dominant axis of variation in eQTL effects within this multi-study data set. I applaud the authors for their large effort of uniformly processing 21 eQTL studies, which allowed them to address this question effectively. Because this is one of the major claims of this paper, we do think that a bit more thorough analysis is needed. In particular, the overall sharing of eQTLs across tissues should be modeled explicitly and excess enrichment of overlap across studies within tissues should be clearly demonstrated. Such enrichment can be seen in the figure, but it is not explicitly modeled and tested.
Additional comments: The authors show eQTLs are highly reproducible between studies from the same cell or tissue type ( Figure 2B, 2C). Does this hold true for the three other molecular trait (transcript usage, exon expression, txevise events) QTLs?
It was undoubtedly a large effort to uniformly process genotype, expression, and covariate data for each of the 21 studies. The impact of this study could be broadened if this intermediate data, not just summary statistics and links to the raw data from the original studies, could be made accessible.
Much of the analysis presented in the manuscript relies on the selection of the 54,733 fine-mapped eQTLs to be a representative eQTL signal. The methods section entitled "Identifying independent signals based on fine mapping" did not adequately describe how those 54,733 variants were selected. I know that SuSIE is designed to identify fine-mapped credible sets in each study, independently. How were these credible-sets merged across studies? This needs to be clarified. Furthermore, once the credible sets are merged, it is necessary to select a representative, top variant. The author's choice of selecting the variant with smallest p-value across all eQTL datasets will certainly bias variant selection in favor of higher-powered data sets. It would be useful to see that the key results (figure 2B for example) are robust to this choice.
The legend for Figure 4D does not specify what study is being colocalized with each of the eQTL Catalogue datasets. I assume it is colocalization with the GWAS hit for lymphocyte count, but this should be clarified for the reader. In line 153, the mash model is introduced as "multiple adaptive shrinkage". It is actually called "multivariate adaptive shrinkage".

Author Rebuttal to Initial comments
We have now made the following major changes to the manuscript: 1. We now use our own re-processed version of GTEx v8 summary statistics across the manuscript. As a result, the number of distinct RNA-seq samples used in the analysis increased from 17,210 to 23,839 and the number of unique donors increased from 5,714 to 6,045. Consequently, the term 'GTEx' now consistently refers to GTEx version 8 throughout the manuscript. For consistency, we have also repeated the colocalisation analysis using our re-processed GTEx summary statistics rather than the official GTEx v8 summary statistics used in the previous version of the manuscript.
2. A small bug in the release 3 of the eQTL Catalogue summary statistics caused us to underestimate the number of novel colocalisation detected in the eQTL Catalogue by approximately 50% (https://www.ebi.ac.uk/eqtl/Release_notes/). This bug was fixed in release 3.1 and the analysis in the revised version of the manuscript uses the fixed summary statistics. We now find that ~20% of the colocalisation that we detect in the eQTL Catalogue are missed when relying on GTEx data alone.
3. We have now performed several additional analyses to demonstrate that our eQTL sharing results are robust to the way in which the eQTL lead variants are selected. We also demonstrate that the main QTL sharing results generalise to the other three quantification methods.
Responses to the specific concerns raised by the reviewers can be found below.

Reviewer #1 :
Remarks to the Author: Kerimov et al. introduce eQTL catalog that provides multiple types of genetic regulatory associations for the research community, based on uniform processing of a large number of eQTL data sets. This is a very valuable resource that is already being used widely, and I believe it will become The Repository for eQTL (and sQTL etc) data and an essential resource for the genetics and genomics community.
This manuscript describes eQTL Catalog, and presents analyses of the properties and use cases of the data and demonstrate its value especially in terms of uniform effect size estimation and better fine-mapping. The biological discoveries are rather modest, but the analyses are diverse, well done, and follow all the latest standards of QTL mapping and downstream analyses. The manuscript is well written and careful work, and the data are presented very clearly. I have only a few technical comments: sign and their effect sizes did not differ more than two-fold." Wouldn't these results bias the results?
We agree that how the variants are chosen for eQTL sharing analysis with mash could, in principle, bias the results. However, choosing those variants is necessary to avoid including hundreds of thousands of correlated variants into the model. To test how sensitive our results are to the choice of the variants, we have now tested three different strategies: (1) selecting the lead variant based on minimal p-value, (2) selecting the lead variants based on maximal effect size and (3) Figure 14). We also tested an approach where the lead variant within each connected component was selected based on the smallest p-value and found that this approach tended to favour datasets with larger sample size (Supplementary Figure 14). Reassuringly, the exact strategy for choosing lead variants did not have a strong effect on downstream eQTL sharing analysis results (Supplementary Figure 15).  Figure 14).
We found that while Strategy 1 favoured datasets with larger sample size, both strategies 2 and 3 slightly favoured datasets with smaller sample size (Supplementary Figure 14). However, the eQTL sharing patterns remained broadly similar regardless of how the lead variants were chosen (Supplementary Figure 15). In the end, we decided to use Strategy 2 (maximal effect size) for quantifying eQTL sharing and Strategy 3 for quantifying cross-dataset QTL sharing for the other three quantification methods.
2. The notion that gene expression and eQTL effect size sharing clusters very similarly could have a reference to GTEx v8 main paper (2020) that showed and quantified this pattern as well.
Thank you for the suggestion. This section of the manuscript (lines 160-163) now reads: Consistent with previous reports by GTEx, we found that the eQTL similarity between datasets closely matched their gene expression similarity ( Figure 3A-B) (1). This suggests that high gene expression similarity and a high degree of eQTL sharing both reflect similarity in the underlying regulatory state of cells.
3. It appears that the matrix factorization model was applies to only a subset of the data sets (Fig 3D). Why?
Matrix factorisation was always applied to all datasets, but to improve legibility, in the previous version of the manuscript we only included those cell types and tissues in Figure 3D that had non-zero factor loadings in at least one other factor than the universal factor. We have now repeated the matrix factorisation analysis using summary statistics from GTEx v8 and updated Figure 3D (and added Supplementary Figure 7) to contain almost all of the datasets (except nine stimulated monocyte and macrophage datasets for which we did not detect a separate factor). Fitting these additional 9 datasets on the figure would have made the factor labels too small to read. Figure 3D color scale is not explained.

The unit of
We have now modified the legend of Figure 3D to clairy that the colour scale represents factor loadings (in arbitrary units). Fig 5 legend the definition of "novel colocalization" (i.e. not detected in GTEx) to make it clearer for a casual reader.

It could be helpful to include in
The first sentence of the caption for Figure 5 now reads:

Overview of the novel GWAS colocalisations detected in the eQTL Catalogue but not in any of the GTEx tissues.
Reviewer #2: Remarks to the Author: The authors describe the identification of cis-eQTLs for a wide variety of tissues and cell-types. They use a highly consistent approach to harmonise these datasets, and therefore are able to uncover sets of robust cis-eQTLs that affect whole gene and transcript expression levels.
I am impressed by the rigour that has been applied to arrive at these results, and applaud the accompanying website. I carefully studied the materials and methods and have no comments on the procedure that has been employed to identify these cis-eQTLs. I also believe these cis-eQTLs are highly relevant for the field, particularly since in the past sometimes incorrect procedures have been employed on these dataset: these methods sometimes did not property accounted for multiple testing, resulting in a large number of reported, but false-positive eQTLs.

I do have one comment: The manuscript is a little brief on the biological interpretation or application: I would imagine that its purpose is predominantly to link associated loci for complex diseases to individual genes inside these loci. The authors use colocalization and Mendelian randomisation to do this, but I would like to see more insights in when this works, and when it doesn't:
My group has been applying coloc and MR, and we have observed that such procedures do result in the identification of positional candidate genes, but also that, based on a comparison with different gene prioritisation approaches, these candidate genes are very often not the most likely candidates. This is also something that others have recently observed (https://pubmed.ncbi.nlm.nih.gov/30239796/), and I believe the authors are uniquely positioned to report on this in the manuscript: I would love to see an analysis where the authors attempt to quantify to what extent this resource enables prioritization of the most likely positional candidate gene.

Ideally, the authors might want to give recommendations on how to apply this resource, and discuss potential reasons why it remains a possibility that if for a given locus a GWAS signal and a tissue specific cis-eQTL signal are showing both colocalisation and encouraging Mendelian randomisation result that it still remains possible that this gene is in fact not the causal gene inside the locus.
First, to clarify, we have only relied on colocalisation in our analysis and did not use Mendelian randomisation. Nevertheless, we completely agree that these approaches can often point to multiple genes, many of which are probably not causally involved in the GWAS trait. We now discuss this extensively in the manuscript (lines 386-408):

As the number of eQTL studies and their sample sizes increases, it is becoming increasingly clear that eQTL analysis is not the silver bullet for identifying causal genes underlying GWAS associations that it was once hoped to be. A number of carefully conducted studies have demonstrated that eQTL colocalisation analysis often identifies multiple candidate genes, many
of which are unlikely to be truly causal (41,42). Similarly, we found that in our analysis 56% of LD blocks colocalised with the expression of more than one gene. The two main reasons for this are: 1) multiple independent causal variants affecting the two traits that current colocalisation methods fail to properly distinguish, and 2) truly pleiotropic variants that affect multiple neighbouring genes. While improved fine mapping and colocalisation methods can overcome the first limitation, true molecular pleiotropy will remain. For example, we found that 18.4% of confidently fine mapped eQTLs (credible set size < 30) were associated with the expression of two or more genes (Supplementary Figure 11)

It would additionally be helpful if the authors can speculate on the utility of single-cell eQTL datasets: the authors have studied several cell-type specific, but bulk RNA-seq datasets. I wonder whether single-cell RNA-seq eQTL studies might miss out on transcript eQTLs, that with bulk RNA-seq can be detected? And vice versa: what kind of biological principles can only be detected when using single-cell eQTL data instead of using bulk data instead? It might therefore be worthwhile to hear more about the pros and cons of bulk and single-cell eQTL studies.
We agree that single-cell eQTL datasets are going to be highly relevant in the future and we have already started to integrate the first ones to the eQTL Catalogue. To reflect their importance, we have now added the following paragraph to the discussion (lines 421-431): A number of single-cell RNA-seq eQTL datasets have been published from differentiating iPSCs and peripheral blood cells (10,(49)(50)(51)

Reviewer #3:
Remarks to the Author: In this study the authors uniformly re-computed QTLs from 21 QTL studies, and demonstrated that cis-eQTLs are highly reproducible between studies from the same cell or tissue type. Much of the analysis resembles that of the GTEx eQTL papers, but the added value of this paper is that the observation that cell type and tissue type, not study identity or technical factors, are the dominant axis of variation in eQTL effects within this multi-study data set. I applaud the authors for their large effort of uniformly processing 21 eQTL studies, which allowed them to address this question effectively. Because this is one of the major claims of this paper, we do think that a bit more thorough analysis is needed.

In particular, the overall sharing of eQTLs across tissues should be modeled explicitly and excess enrichment of overlap across studies within tissues should be clearly demonstrated. Such enrichment can be seen in the figure, but it is not explicitly modeled and tested.
We decided to approach this by focussing on the seven cell types and tissues that had sufficiently large sample size (n > 200) and were profiled in two or more studies: For all these cell types and tissues, we obtained their pairwise eQTL sharing estimates from mash and compared their distributions in two groups: (1) same tissue, different study vs (2) different tissue, same study. We found that the eQTL sharing estimates were significantly higher in the first group compared to the second group (Wilcoxon rank sum test p-value < 10 -8 ) (Supplementary Figure 5A). Importantly the results did not change if we used a different strategy to define lead eQTLs variants for the mash analysis (Supplementary Figure 5B). We also repeated the same analysis for the three other quantification methods and observed the same general trend, although the differences were slightly smaller (Supplementary Figure 5C-E). Figure 5. Distribution of pairwise mash QTL sharing estimates for seven cell types and tissues (skin, adipose, LCL, blood, fibroblast, muscle, brain (DLPFC) profiled in two or more studies (GTEx, TwinsUK, GENCORD, GEUVADIS, Lepik_2017, ROSMAP, FUSION). Each panel contrasts the QTL sharing estimates for the same cell type or tissue profiled in different studies against different cell types and tissues profiled in the same study. Analysis was performed separately for gene expression (A-B), transcript usage (C), exon expression (D) and txrevise (E) QTLs. Note that adipose, skin and muscle tissues have high eQTL sharing also within the same study. The p-values were calculated using the two-sample Wilcoxon rank sum test.

Supplementary Figure 5. Supplementary
We now refer to this analysis also in the main text (lines 178-181): To assess this formally, we focussed on a subset of tissues profiled in at least two studies. We found that the average eQTL sharing for the same tissue profiled in two different studies was significantly higher than for two different cell types or tissues profiled within the same study (Supplementary Figure 5) Additional comments: 2. The authors show eQTLs are highly reproducible between studies from the same cell or tissue type ( Figure 2B, 2C). Does this hold true for the three other molecular trait (transcript usage, exon expression, txevise events) QTLs?
We have now used mash to estimate QTL sharing between datasets also for the other three molecular traits. These results are presented in Supplementary Figure 4, which is also copied below for convenience. Overall, we see broadly similar sharing patterns as we observed for eQTLs, but the results are noisier and there is some evidence for slightly larger batch effects (e.g. BLUEPRINT CD4+ T-cells no longer cluster as well together with other T-cell datasets and the distinction between whole blood, monocytes and LCLs is weaker).
We have added the following sentence to the main text (lines 176-178): We also observed broadly similar patterns of sharing among the QTLs detected with the other three quantification methods (Supplementary Figure 4).

It was undoubtedly a large effort to uniformly process genotype, expression, and covariate data for each of the 21 studies. The impact of this study could be broadened if this intermediate data, not just summary statistics and links to the raw data from the original studies, could be made accessible.
We have now published the processed RNA-seq count matrices and microarray gene expression data together with minimal sample metadata on Zenodo and added the following clarification to the the Data availability section: Unfortunately, we are unable to share the processed genotype files due to identifiable personal information contained in those and the restrictions set by the original data owners. Users who need access to the individual-level genotype data should directly contact each individual cohort. For convenience, the database accessions of the individual studies are provided on our website (https://www.ebi.ac.uk/eqtl/Studies/).

Much of the analysis presented in the manuscript relies on the selection of the 54,733
fine-mapped eQTLs to be a representative eQTL signal. The methods section entitled "Identifying independent signals based on fine mapping" did not adequately describe how those 54,733 variants were selected. I know that SuSIE is designed to identify fine-mapped credible sets in each study, independently. How were these credible-sets merged across studies? This needs to be clarified. Furthermore, once the credible sets are merged, it is necessary to select a representative, top variant. The author's choice of selecting the variant with smallest p-value across all eQTL datasets will certainly bias variant selection in favor of higher-powered data sets. It would be useful to see that the key results (figure 2B for example) are robust to this choice.
We agree that this part of the Methods section was unclear. We have now significantly rewritten it, adding much more detail ( Figure 14). We also tested an approach where the lead variant within each connected component was selected based on the smallest p-value and found that this approach tended to favour datasets with larger sample size (Supplementary Figure 14). Reassuringly, the exact strategy for choosing lead variants did not have a strong effect on downstream eQTL sharing analysis results (Supplementary Figure 15).
Identifying independent signals for the other three quantification methods (exon expression, transcript usage and txrevise) was more challenging, because each gene often has multiple highly correlated molecular traits (exons, transcripts, transcriptional Figure 14).
Regarding robustness, we now tried three different options: (1) selecting the lead variant based on minimal p-value, (2) selecting the lead variants based on maximal effect size and (3) selecting only one lead variant per gene from a randomly selected credible set. We found that while Strategy 1 indeed favoured datasets with larger sample size, both Strategies 2 and 3 slightly favoured datasets with smaller sample size (Supplementary Figure 14). However, the eQTL sharing patterns remained broadly similar regardless of how the lead variants were chosen (Supplementary Figure 15).
The Supplementary Figures 14 and 15 are shown below for convenience: