Abstract
The large majority of variants identified by GWAS are non-coding, motivating detailed characterization of the function of non-coding variants. Experimental methods to assess variants’ effect on gene expressions in native chromatin context via direct perturbation are low-throughput. Existing high-throughput computational predictors thus have lacked large gold standard sets of regulatory variants for training and validation. Here, we leverage a set of 14,807 putative causal eQTLs in humans obtained through statistical fine-mapping, and we use 6121 features to directly train a predictor of whether a variant modifies nearby gene expression. We call the resulting prediction the expression modifier score (EMS). We validate EMS by comparing its ability to prioritize functional variants with other major scores. We then use EMS as a prior for statistical fine-mapping of eQTLs to identify an additional 20,913 putatively causal eQTLs, and we incorporate EMS into co-localization analysis to identify 310 additional candidate genes across UK Biobank phenotypes.
Similar content being viewed by others
Introduction
Although genome-wide association studies (GWAS) have identified large numbers of loci associated with complex traits1,2, identifying the underlying biological mechanisms is often difficult. Two particular challenges are that (1) the majority of the associated variants are in noncoding regions1, and (2) the association signals from GWAS studies typically contain a large number of variants in linkage disequilibrium (LD)3. Interpreting associations in GWAS to identify the underlying causal mechanisms requires an understanding of the function of noncoding variants at single-variant resolution.
Many approaches to characterize noncoding variants exist. Large-scale consortium studies4,5 have provided a map of functional and regulatory elements across the genome in different cell types that are enriched in various trait heritability6,7,8,9,10. Reporter assays have been powerful tools to test variant effects in cellular contexts, but typical high-throughput massive parallel reporter assays (MPRAs)11,12 do not represent the native chromatin context in the human genome. Direct introduction of single base pair variants in the native genome are still low throughput13. RNA-seq studies combined with genotyping or whole-genome sequencing have highlighted loci that are associated with gene expression in humans (eQTLs)14,15,16. However, as with GWAS, eQTL studies associate loci, rather than individual causal variants, to gene expression.
Statistical fine-mapping3,17,18 is used to disentangle tightly correlated structures of the nearby genetic variants in LD to elucidate causal variant(s) in a locus identified by a genetic association study, such as a GWAS on an eQTL study. For example, Benner et al.19 uses stochastic search to enumerate and evaluate possible causal configurations, and Wang et al.20 performs iterative Bayesian stepwise selection to prioritize causal variants. Such fine-mapping methods have been applied to identify putative causal eQTLs (i.e., variants that modify gene expression in native chromatin context) that are valuable both for understanding gene regulation and for interpreting GWAS signals at a locus15,16,21,22,23,24. However, fine-mapped eQTLs fall short of genome-wide characterization of noncoding function, as many variants fail to be identified because of LD or small effect size.
While not providing the same level of confidence as genome editing or fine-mapped eQTLs, computational predictions are informative about variant function in native chromatin in human cells, and can be applied to every variant in the genome. For example, state-of-the-art computational methods predict the effects of noncoding genetic variants on the epigenetic landscape and on gene expression as a function of sequence context, using deep neural networks25,26,27,28,29,30. These methods, rather than directly training on gold standard expression-modifying variants, instead predict expression level or other outcomes as a function of sequence, and then score variants based on the difference in predicted expression between the two alleles.
Here, we combine such computational predictions with the large-scale, though not comprehensive, gold standard data provided by statistical fine-mapping of eQTLs, with two goals: to improve on existing computational predictors, and to expand the set of confidently identified eQTLs. Toward the former goal, we combine an existing sequence-based predictor28 with epigenetic data and other gene features into a single predictor, leveraging fine-mapped eQTLs (https://www.finucanelab.org/data) as training data. Specifically, we directly train a predictor of whether a variant modifies expression using 14,807 putative expression-modifying variant–gene pairs in humans as training data and utilizing 6121 features; we call the resulting prediction the expression modifier score (EMS). Toward the second goal, we use EMS as a prior for statistical fine-mapping of eQTLs (analogous to recently performed functionally informed fine-mapping of complex traits31,32,33), increasing fine-mapping resolution and identifying an additional 20,913 variants across 49 tissues. Finally, using UK Biobank (UKBB)34 phenotypes as an example, we show that EMS can be incorporated into colocalization analysis at scale, and we identify 310 additional candidate genes for UKBB phenotypes.
Results
Functional enrichment of fine-mapped eQTLs
To define the set of putative expression-modifying variant–gene pairs, we analyzed results of recent fine-mapping of cis-eQTLs (±1 Mb window) from GTEx v8 (ref. 16; https://www.finucanelab.org/data), including the 14,807 variant–gene pairs with posterior inclusion probability (PIP) > 0.9 according to two methods19,20 across 49 tissues (Supplementary Figs. 1 and 2). The size of our dataset allowed us to quantify the enrichment of putative causal variant–gene pairs for several functional annotations, including deep learning-derived variant effect scores from Basenji28,29 and distance to canonical transcription starting site (TSS), with high precision (Fig. 1, and Supplementary Figs. 3 and 4). Our results are consistent with previous studies24,35: putative causal variant–gene pairs are enriched for a number of functional annotations, such as 5′UTR, H3K4me3 (>10× enrichment compared to random variant–gene pairs) or distance to TSS (>500× enrichment for variant–gene pairs with distance to TSS < 100), but are not strongly enriched for introns (0.966×), and are depleted for a histone mark related to heterochromatin state (H3K9me3; 0.510× enrichment).
Building a predictor for putative causal eQTLs [EMS]
Next, we built a random forest classifier of whether a given variant is a putative causal eQTL for a given gene using 807 binary functional annotations, including cell-type-specific histone modifications, as well as non-cell-type-specific annotations from the baseline model4,5,6, 5313 Basenji features corresponding to functional activity predictors28,29, and distance to TSS. We then scaled the output score of the random forest classifier to reflect the probability of observing a positively labeled sample in a random draw from all the variant–gene pairs (Fig. 2a and “Methods”), and named this scaled score the EMS. We performed the above process for 49 tissues in GTEx v8 individually, to obtain the EMS for variant–gene pairs in each tissue. In other words, EMS is an estimated probability of a variant–gene pair being a putative causal eQTL in a specific tissue, given the >6000 functional annotations of the variant–gene pair. For whole blood, the Basenji scores together had 55.0% of the feature importance for EMS, and distance to TSS had feature importance of 43.1%. The binary functional annotations together had <2% of importance (Fig. 2b, c). Analyses of other tissues also showed that (1) distance to TSS is by far the most important single feature, (2) Basenji scores individually explain a small fraction of predictor performance, but are collectively equally or more important than the distance to TSS, and (3) compared to the distance to TSS and Basenji scores, the feature importances of both cell-type-specific and nonspecific binary functional annotations are much smaller (Supplementary Data 1).
Performance evaluation of EMS
To evaluate the performance of EMS, we focused on whole blood and compared EMS (calculated by leaving one chromosome out at a time to avoid overfitting) to other genomic scores26,36,37,38,39. EMS achieved higher prediction accuracy than other genomic scores for putative causal eQTLs (top bin enrichment for held-out putative causal eQTLs 18.3× vs 15.1× for distance to TSS, the second best, Fisher’s exact test p = 3.33 × 10−4, Fig. 3a; AUPRC = 0.884 vs 0.856 when using distance to TSS, the second best, Supplementary Fig. 5 and “Methods”). EMS was among the top-performing methods in prioritizing experimentally suggested regulatory variants from reporter assay experiments12,40, despite not varying distance to TSS, the most informative feature (Fig. 3b, c, Supplementary Fig. 6, and “Methods”). Finally, EMS was also among the top-performing methods in prioritizing putative causal noncoding variants for hematopoietic traits in the UKBB dataset (17.6× for EMS, best, vs 17.1× for DeepSEA, the second best; Fig. 3d), although there are known differences between the genetic architectures of cis-gene expression and complex traits41. These results were consistent when we performed the same set of analyses in different datasets: hematopoietic traits in BioBank Japan42 and lymphoblastoid cell line (LCL) eQTL in Geuvadis14,22 (Supplementary Fig. 7).
Functionally informed fine-mapping using EMS
Since EMS is in units of estimated probability, one natural way to utilize EMS for better prioritization of putative causal eQTLs is to use it as a prior for statistical fine-mapping. We developed a simple algorithm for approximate functionally informed fine-mapping and applied it with EMS as a prior to obtain a functionally informed posterior, denoted PIPEMS, in whole blood (“Methods”). As expected, we found that PIPEMS identified more putative causal eQTLs than the original PIP calculated with a uniform prior, denoted PIPunif. Specifically, 95.4% of variants with PIPunif > 0.9 also had PIPEMS > 0.9 (2152 out of 2255), while only 33.8% of variants with PIPEMS > 0.9 had PIPunif > 0.9 (1125 out of 3277; Fig. 4a). Similarly, credible sets mostly decreased in size (Fig. 4b and Supplementary Data 2). Previous work in functionally informed fine-mapping33 adjusted the prior so that the maximum prior value did not exceed 100 times the minimum prior value. We conducted a second round of functionally informed fine-mapping with a similar adjustment of the prior, identifying fewer additional putative causal eQTLs, as expected (1125 with EMS as a prior vs 269 with EMS adjusted to a max/min ratio of 100 as a prior; Supplementary Fig. 8).
We evaluated the quality of PIPEMS by comparing it with PIPunif and a publicly available eQTL fine-mapping result that uses distance to TSS as a prior16,23 (denoted PIPDAP-G) in two ways (other methods for functionally informed fine-mapping based on expectation maximization31,32,35 would be computationally intensive for a dataset this size, while the recently introduced PolyFun33 is designed for complex traits). First, PIPEMS had the highest enrichment level of reporter assay QTLs40 (raQTLs) in the PIP > 0.9 bin (16.8× vs 12.9× in PIPunif and 11.4x in PIPDAP-G, Fisher’s exact test p = 1.65 × 10−2 between PIPEMS and PIPDAP-G; Fig. 4c). Second, complex trait causal noncoding variants were comparably enriched in PIP > 0.9 bins (Supplementary Fig. 9). These results suggest that PIPEMS is a valid measure for identifying putative causal cis-regulatory variants.
Applying functionally informed PIP (PIPEMS) in gene prioritization across 95 traits
We next compared the utility of PIPEMS to PIPunif for complex trait gene prioritization, as in Weeks et al.43. To do this, we first calculated PIPEMS for 49 GTEx tissues using EMS of matched tissues as priors (Supplementary Figs. 10 and 11), resulting in a total of 20,913 additional eQTLs with PIPEMS > 0.9 (Fig. 5a, Supplementary Fig. 12, and Supplementary Data 3). Tissue-specificity of putative causal eQTLs were characterized by enrichments of corresponding tissue-specific transcription factor (TF) activity scores in the Basenji model (Fig. 5b–d, Supplementary Figs. 13 and 14, and “Methods”). We then colocalized the eQTL signals with 95 UKBB phenotypes. Using the evaluation gene set described in ref. 43, PIPEMS achieved higher precision and higher recall than PIPunif (Table 1 and “Methods”). Overall, PIPEMS elucidated 310 candidate genes for UKBB phenotypes that were not identified with PIPunif (Supplementary Data 4). On the other hand, PIPDAP-G showed lower precision than PIPEMS and PIPunif but higher recall (Table 1), suggesting the value of future studies in investigating different priors in eQTL fine-mapping and the trade-off between precision and recall for gene prioritization.
An example of PIPEMS resolving a credible set that is ambiguous with PIPunif is shown in Fig. 6. Here, four variants upstream of CITED4 are in perfect LD in GTEx, giving PIPunif = 0.25 for all four (Supplementary Fig. 15). In UKBB, the four variants are also in high LD, with PIP for neutrophil count between 0.133 and 0.181 for all four. Thus, standard colocalization analysis does not identify CITED4 as a neutrophil count-related gene (CLPP < 4.53 × 10−2 for all variants; “Methods”). However, one of the four variants, rs35893233, creates a binding motif of SPI1, a TF known to be involved in myeloid differentiation44,45, and presents epigenetic activity in myeloid-related cell types, such as showing the highest basenji score for cap analysis gene expression (CAGE)46 activity in acute myeloid leukemia. This variant has >25× greater EMS than the other three variants (1.73 × 10−3 vs 6.11 × 10−5, 1.00 × 10−5 and 8.62 × 10−6, respectively), enabling PIPEMS to narrow down the credible set to the single variant (PIPEMS = 0.956 for rs35893233). Integrating EMS into the colocalization analysis thus allows identification of CITED4 as a neutrophil count-related gene (CLPP = 0.173). Additional examples are described in Supplementary Fig. 16.
Discussion
In this study, we introduced EMS, a prediction of the probability that a variant has a cis-regulatory effect on gene expression in a tissue. To derive EMS, we trained a random forest model that takes >6000 features. By analyzing the importance of each feature in the model, we showed that the importance of direct epigenetic measurements, such as binary histone mark peak annotation is relatively limited once distance to TSS and deep learning-derived variant effect scores (Basenji) were incorporated. Taking whole blood as an example, we showed that EMS accurately prioritizes putative causal eQTLs, reporter assay active variants, and putative complex trait causal noncoding variants. We provided a broader set of putative causal variants (n = 20,913 across 49 tissues) by using EMS as a prior to perform approximate functionally informed eQTL fine-mapping, and utilized EMS for colocalization analysis to identify 310 additional candidate genes for complex traits.
Evaluating predictors of noncoding variant function is complicated by the absence of gold standard data. While EMS outperformed other scores for prioritizing putative causal eQTLs, which we believe to be the closest to gold standard of existing large-scale base-pair resolution datasets, it did not outperform existing scores in prioritizing reporter assay active variants or putative complex trait causal noncoding variants. These latter two datasets, while valuable for independent validation, do not fully recapitulate the challenge of prioritizing causal expression-modifying variants in native context41,47. On the other hand, we recognize that putative causal eQTLs on a held-out chromosome do not constitute a fully independent validation set. As genome editing technologies continue to improve, we look forward to future large-scale datasets that will enable independent, gold standard evaluation and comparison of scores of noncoding functions at base-pair resolution.
Although our work refines our understanding of cis-gene regulatory mechanisms at single-variant resolution, it also presents limitations. First, there are biases in the way the training variants are ascertained: the power to call a putative causal variant is affected by the recombination rate and the allele frequency of the variant48,49, and the GTEx cohort is highly biased towards adult samples with European ancestry background. Second, although we utilize over 6000 features in EMS, larger sets of variant and gene annotations, such as 3D configuration of genome50,51, constraint52,53,54, or pathway enrichment43 of genes could allow us to further improve prediction accuracy. Third, we simplified the prediction task by thresholding PIP. We formed a binary classification problem rather than a regression problem to build a predictor due to a highly skewed distribution of PIP, and because of LD-induced biases in variants with intermediate PIPs, but with larger sample size and a more principled hierarchical model, we could potentially take advantage of variants with intermediate PIP as well.
In this work, we focused on the task of predicting putative causal eQTLs. Future work could use a similar framework to predict putative causal splicing QTLs or other molecular QTLs for which statistical fine-mapping has identified a large number of high-PIP variants. In addition, although noisy effect size estimates from eQTL studies present a challenge, future work could explore leveraging features correlated with the sign and magnitude of effect (Supplementary Fig. 17) to estimate these values. As recent studies have suggested, such approaches would also be valuable in understanding the gene expression and complex trait regulation landscape in light of natural selection55. Our approach of utilizing statistical fine-mapping of eQTLs to define training data, assembling large number of features to train a predictor, and using the predictor output to expand the set of putative causal eQTLs is highly generalizable. EMS for all variant–gene pairs in GTEx v8 are publicly available for 49 tissues. Our study provides a powerful resource for deciphering the mechanisms of noncoding variation.
Methods
The expression modifier score
Fine-mapping of GTEx v8 data is described in https://www.finucanelab.org/data and is summarized in the Supplementary Methods. We constructed a binary classification task by labeling the variant–gene pairs with PIP > 0.9 for both of the two fine-mapping methods (FINEMAP19 and Sum of Single Effects, SuSiE20) as positive, and the ones with PIP < 0.0001 for both methods as negative. Each variant–gene pair was annotated with 6121 features (distance to TSS annotated in the GTEx v8 dataset, 12 non-cell-type-specific binary features from the LDSC baseline model6, 795 cell-type-specific binary features from the Roadmap Epigenomics Consortium5, where variants falling in narrow peak are annotated as 1, and others are 0, and 5313 deep learning-derived cell-type-specific features generated by the Basenji model28,29; Supplementary Data 5). The 152 most predictive features were selected based on different prediction accuracy metrics, such as F1 measure and mean decrease of impurity for each feature (Supplementary Methods). A combination of random search followed by grid search was performed to tune the hyperparameter for a random forest classifier that maximizes the AUROC of the binary prediction in the held-out dataset (Supplementary Data 6). Finally, for each prediction score bin, we calculated the fraction of positively labeled samples and scaled the output score, to derive the EMS. Further details are described in the Supplementary Methods.
Performance evaluation of EMS
To evaluate the performance of EMS, for each chromosome, we trained EMS using all the other chromosomes to avoid overfitting. CADD36 v1.4 and GERP38 scores were annotated using the hail56 annotation database (https://hail.is), and ncER39 scores were downloaded from https://github.com/TelentiLab/ncER_datasets. In order to annotate the DeepSEA26 v1.0 and Fathmm37 v2.3 noncoding scores, we mapped hg38 coordinates to hg19 using the hail liftover function, removed variants that do not satisfy 1-to-1 matching, and followed their web instructions (https://humanbase.readthedocs.io/en/latest/deepsea.html, and http://fathmm.biocompute.org.uk) to score the variants. Insertion and deletions were not included in the Fathmm scores. For DeepSEA, we calculated the e-values from the individual features, following ref. 4. We computed the area under the receiver operating characteristic curve and the precision recall curve (Supplementary Fig. 5), as well as enrichments of different variant–gene pairs or variants, as described in the next sections (Fig. 3).
Computation of enrichment
Enrichment of a specific set of variant–gene pairs (e.g., putative causal variants in GTEx whole blood) in a score bin is defined as the probability of drawing a variant–gene pair in the set given that the variant–gene is in the score bin, divided by the overall probability of drawing a variant–gene pair in the set. The error bar of enrichment denotes the standard error of the numerator, divided by the denominator (we assumed the standard error of the denominator is small enough, since the total number of variant–gene pairs is typically large; >100,000,000 for all the variant–gene pairs in GTEx v8). When testing binary functional features as in Fig. 1, the score is the individual functional feature, and the set is defined by the specific PIP bin.
enrichment analysis of eQTL, complex trait, and reporter assay data
Saturation mutagenesis data12 was downloaded from the MPRA data access portal (http://mpra.gs.washington.edu). An MPRA hit was defined as having a bonferroni-significant association p value (<0.05 divided by the total number of variant–cell type pairs) for at least one cell type, regardless of the effect size and direction. The raQTL data40 was downloaded from https://osf.io/w5bzq/wiki/home/. EMS was rescaled to have a constant distance to TSS (200 bp, roughly representing the scale of typical distance to TSS in plasmids12), which is expected to significantly decrease the performance of EMS compared to in native genome. Similarly, when comparing EMS with other scores for enrichments of MPRA hits or raQTLs, distance to TSS was not used for the comparison.
Fine-mapping of UKBB traits is described in https://www.finucanelab.org/data. To focus on noncoding regulatory effects, we annotated the variants in VEP57 v85 and filtered out coding and splice variants for the UKBB dataset. For each (noncoding) variant, we calculated the maximum PIP over all the hematopoietic traits, as well as the maximum whole-blood EMS over all the genes in the cis-window of the variant, since a variant can have different regulatory effect on different genes, for different phenotypes. A variant was defined as putative hematopoietic-trait causal if it has SuSiE PIP > 0.9 in any of the hematopoietic traits. In UKBB, we focused on the variants that exist in the GTEx v8 dataset to reduce the calculation complexity.
For all four datasets, the variants (or variant–gene pairs in GTEx) other than putative causal ones were randomly downsampled to achieve a total number of variants to be exactly 100,000, to reduce the computational burden, while keeping enough number of variants to observe statistical significance. GTEx enrichment, MPRA hits enrichment, raQTL enrichment, and UKBB enrichment are thus defined as the enrichment of putative causal eQTLs, MPRA hits, raQTLs, and putative hematopoietic-trait causal variants in the downsampled dataset, respectively.
Approximate functionally informed fine-mapping using EMS
In the SuSiE model, for a given gene, the vector \(b\) of true SNP effects on that gene is modeled as a sum of vectors with only one non-zero element each:
where \(b\) and \({b}_{l}\) are vectors of length \(m\) and \(m\) is the number of variants in the locus. Intuitively, each \({b}_{l}\) corresponds to the contribution of one causal variant. One output of SuSiE is a set of \(m\)-vectors \({\alpha }_{1},...,{\alpha }_{L}\), with \({\alpha }_{L}(v)\) equal to the posterior probability that \({b}_{l}(v)\ne 0\); i.e., that the \(l\)th causal variant is the variant \(v\). Credible sets are computed for each \(l\) from \({\alpha }_{l}\), and credible sets that are not pure—i.e., that contain a pair of variants with absolute correlation < 0.5—are pruned out. The \({\alpha }_{l}\) are also used to compute PIPs.
Our algorithm for approximate functionally informed fine-mapping takes the approach of re-weighting the posterior probability calculated using the uniform prior, analogous to ref. 32, and proceeds as follows. For each gene and each tissue, we start with \({\alpha }_{1},...,{\alpha }_{L}\) computed by SuSiE using the uniform prior. For each \(l\), if \({\alpha }_{l}\) corresponds to a pure credible set, we re-weight each element of \({\alpha }_{l}\) by the EMS of the corresponding variant, and we normalize so that the sum is equal to 1, obtaining \({\hat{\alpha }}_{l}\). In other words, letting \({w}_{1}\)…\({w}_{m}\) denote the EMSs for the \(m\) variants, we define \({\hat{\alpha }}_{l}(v)\) for the variant \(v\) to be
if \({\alpha }_{l}\) corresponds to a pure credible set; otherwise, we set \({{\hat{\alpha }}_{l}=\alpha }_{l}\). We then use the updated \({\hat{\alpha }}_{1},...,{\hat{\alpha }}_{L}\) to compute updated PIPs and credible sets, as in the original SuSiE method. See Supplementary Methods for further details.
Performance evaluation of PIPEMS and application to gene prioritization
PIP using distance to TSS as a prior (PIPDAP-G) was downloaded from the GTEx portal (https://gtexportal.org/). The raQTL data was downloaded from https://osf.io/w5bzq/wiki/home/, and the negative variants were randomly downsampled to a total of 100,000 variants. For complex trait causal noncoding variant prioritization, a threshold of PIP > 0.1 was chosen to account for low sample size. We defined a gene prioritization task using 49 tissues in GTEx v8 and 95 complex traits in UKBB, using the following steps (further details are described in Weeks et al.43):
Across all traits, we identified 1 Mb regions centered at unresolved credible sets (no coding variant with PIP > 0.1) that additionally contained at least one “evaluation gene” (protein-coding variant with PIP > 0.5) for the same trait. There were 2897 such regions and 1161 evaluation genes. Our intuition is that the gene with the fine-mapped protein-coding variant is most likely to be the primary causal signal, and that a nearby noncoding signal is more likely to act through this gene (i.e., via regulation) than through a different gene.
For each gene–region pair, we defined the colocalization posterior probability (CLPP) for the gene to be the maximum of the product of the eQTL PIP and trait PIP, across all tissues and all variants in the unresolved credible set. A gene is prioritized if it has CLPP > 0.1 and it has the maximum CLPP in its region. We compute the precision as the number of correctly prioritized genes (where the prioritized gene is also the gene with the primary, protein-coding signal) divided by the total number of prioritized genes. We compute recall as the number of correctly prioritized genes divided by the total number of evaluation genes. The total number of candidate genes is defined as the number of gene–trait pairs, presenting CLPP > 0.1 in at least one tissue and variant.
Tissue-specific putative causal eQTL analysis
Tissue-specific putative causal eQTL in a tissue was defined as a variant–gene pair with PIPEMS > 0.9 in the tissue and PIPEMS < 0.1 in all the other tissues (including cases where a variant is missing in a tissue; Supplementary Data 7). A tissue-specific putative causal eQTL pair was defined as a pair of tissue-specific putative causal eQTL on a same gene in two different tissues, existing within 10 kb distance (Supplementary Fig. 14 and Supplementary Data 8). Basenji features were classified as TF related if the feature name contains the gene symbol classified as a human TF in an external database58 (http://humantfs.ccbr.utoronto.ca/download.php).
Then for each TF, we defined it as specific for tissue T if the expression level (TPM) of the TF was higher in T than in all other tissues and was >2 standard deviations away from the mean expression level across tissues. All the tissues for which the TF had expression level ten times lower than that of tissue T were defined as control tissues. TF-related Basenji features with no specific tissue, or lacking control tissues were filtered out. We also filtered out the features where the TF specificity and the assay cell type did not clearly match (Supplementary Data 9). This resulted in 42 TF-related Basenji features corresponding to 30 unique TFs. Enrichment of each TF-related Basenji feature was examined by comparing the average score in the tissue-specific putative causal eQTLs for the corresponding tissue with the average in the control tissues, using a t test (Supplementary Data 9).
Statistical analysis
All the statistical tests were two-sided. No adjustment was made in the p value we report.
Error bar in Fig. 5b–d and Supplementary Fig. 13 is defined as the standard error of the mean.
Error bar in the enrichment analyses (all the other figures, where error bars are present) are explained in the “Computation of enrichment“ section in the “Methods” . The set of software used for data generation, statistical analysis, and plotting in the study are listed below:
SuSiE v0.8.1.0521 (https://github.com/stephenslab/susie-paper)
FINEMAP v1.3.1 (http://www.christianbenner.com)
ggseqlogo (https://cran.r-project.org/web/packages/ggseqlogo/index.html)
basenji v0.0.1 (https://github.com/calico/basenji)
brokenaxis v0.3.1 (https://pypi.org/project/brokenaxes/)
joblib v0.11 (https://joblib.readthedocs.io)
hail v0.2.26 (https://hail.is)
matplotlib v3.2.0 (https://matplotlib.org)
numpy v1.18.1 (https://numpy.org)
pandas v1.0.1 (https://pandas.pydata.org)
scikit-learn v0.21.3 and v0.23.2 (https://scikit-learn.github.io/stable)
scipy v1.2.1 (http://scikit-learn.github.io/stable)
seaborn v0.9.0 (https://seaborn.pydata.org).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
EMS for 49 tissues are available at https://www.finucanelab.org/data. CADD v1.4 and GERP scores were annotated using the hail annotation database (https://hail.is). ncER scores were downloaded from https://github.com/TelentiLab/ncER_datasets. DeepSEA v1.0 scores were downloaded from https://humanbase.readthedocs.io/en/latest/deepsea.html. Fathmm v2.3 noncoding scores were downloaded from http://fathmm.biocompute.org.uk. Saturation mutagenesis data was downloaded from the MPRA data access portal (http://mpra.gs.washington.edu). The raQTL data was downloaded from https://osf.io/w5bzq/wiki/home/. Human transcription factor (TF) data was downloaded from http://humantfs.ccbr.utoronto.ca/download.php. The UKBB fine-mapping results are deposited at https://www.finucanelab.org/data.
Code availability
Code used in this manuscript is available at https://github.com/FinucaneLab/Expression_Modifier_Score/.
References
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Paul, D. S., Soranzo, N. & Beck, S. Functional interpretation of non-coding sequence variation: concepts and challenges. Bioessays 36, 191–199 (2014).
Maller, J. B. et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat. Genet. 44, 1294–1301 (2012).
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Pickrell, J. K. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 94, 559–573 (2014).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Trynka, G. et al. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nat. Genet. 45, 124–130 (2013).
Trynka, G. & Raychaudhuri, S. Using chromatin marks to interpret and localize genetic associations to complex human traits and diseases. Curr. Opin. Genet. Dev. 23, 635–641 (2013).
Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).
Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).
Tian, R. et al. Pitfalls in single clone CRISPR-Cas9 mutagenesis to fine-map regulatory intervals. Genes 11, 504 (2020).
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Chen, W. et al. Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics 200, 719–736 (2015).
Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. B 82, 1273–1300 (2020).
Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci with multiple signals of association. Genetics 198, 497–508 (2014).
Brown, A. A. et al. Predicting causal variants affecting expression by using whole-genome sequencing and RNA-seq from multiple human tissues. Nat. Genet. 49, 1747–1751 (2017).
Wen, X., Lee, Y., Luca, F. & Pique-Regi, R. Efficient integrative multi-SNP association analysis via deterministic approximation of posteriors. Am. J. Hum. Genet. 98, 1114–1129 (2016).
Wen, X., Luca, F. & Pique-Regi, R. Cross-population joint analysis of eQTLs: fine mapping and functional annotation. PLoS Genet. 11, e1005176 (2015).
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).
Kichaev, G. et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 10, e1004722 (2014).
Jiang, J. et al. Functional annotation and Bayesian fine-mapping reveals candidate genes for important agronomic traits in Holstein bulls. Commun. Biol. 2, 1–12 (2019).
Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203 (2018).
Chen, W., McDonnell, S. K., Thibodeau, S. N., Tillmans, L. S. & Schaid, D. J. Incorporating functional annotations for fine-mapping causal variants in a Bayesian framework using summary statistics. Genetics 204, 933–958 (2016).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).
Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum. Mutat. 34, 57–65 (2013).
Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Wells, A. et al. Ranking of non-coding pathogenic variants and putative essential regions of the human genome. Nat. Commun. 10, 5241 (2019).
van Arensbergen, J. et al. High-throughput identification of human SNPs affecting regulatory element activity. Nat. Genet. 51, 1160–1169 (2019).
Yao, D. W., O’Connor, L. J., Price, A. L. & Gusev, A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat. Genet. 52, 626–633 (2020).
Kanai, M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet. 50, 390–400 (2018).
Weeks, E. M. et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Preprint at medRxiv https://doi.org/10.1101/2020.09.08.20190561 (2020).
Chen, H. et al. PU.1 (Spi-1) autoregulates its expression in myeloid cells. Oncogene 11, 1549–1560 (1995).
Burda, P., Laslo, P. & Stopka, T. The role of PU.1 and GATA-1 transcription factors during normal and leukemogenic hematopoiesis. Leukemia 24, 1249–1257 (2010).
Takahashi, H., Kato, S., Murata, M. & Carninci, P. CAGE- cap analysis gene expression: a protocol for the detection of promoter and transcriptional networks. Methods Mol. Biol. 786, 181–200 (2012).
Inoue, F. et al. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res. 27, 38–52 (2017).
LaPierre, N. et al. Identifying causal variants by fine mapping across multiple studies. Preprint at bioRxiv https://doi.org/10.1101/2020.01.15.908517 (2020).
Hutchinson, A., Watson, H. & Wallace, C. Improving the coverage of credible sets in Bayesian genetic fine-mapping. PLOS Computational Biol. 16, e1007829 (2020).
Kempfer, R. & Pombo, A. Methods for mapping 3D chromosome architecture. Nat. Rev. Genet. 21, 207–226 (2020).
Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17, 1111–1117 (2020).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Iulio, Jdi et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333 (2018).
Schoech, A. P. et al. Negative short-range genomic autocorrelation of causal effects on human complex traits. Preprint at bioRxiv https://doi.org/10.1101/2020.09.23.310748 (2020).
Hail Team. Hail 0.2. https://github.com/hail-is/hail(2020).
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Louppe, G. Understanding random forests: from theory to practice. Preprint at https://arxiv.org/abs/1407.7502 (2015).
Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).
Fornes, O. et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 48, D87–D92 (2020).
Acknowledgements
We thank Yakir Reshef, Jesse Engreitz, Elle Weeks, and all the members of Finucane lab for useful conversations. H.K.F. was funded by NIH grant DP5 OD024582 and by Eric and Wendy Schmidt. Q.S.W. and M.K. were supported by the Nakajima Foundation Scholarship.
Author information
Authors and Affiliations
Consortia
Contributions
Q.S.W., D.G.M., and H.K.F. designed the study. Q.S.W., D.R.K., J.U., and S.S. analyzed the data. Q.S.W. and H.K.F. wrote the manuscript with input from all authors (D.R.K., J.U., M.K., S.S., R.C., C.A., N.C., Y.O., B.B.J., F.A., K.G.A., and D.G.M.).
Corresponding authors
Ethics declarations
Competing interests
D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme.
Additional information
Peer review information Nature Communications thanks Anshul Kundaje and the other, anonymous, reviewer for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Q.S., Kelley, D.R., Ulirsch, J. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat Commun 12, 3394 (2021). https://doi.org/10.1038/s41467-021-23134-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-021-23134-8
This article is cited by
-
Current genomic deep learning models display decreased performance in cell type-specific accessible regions
Genome Biology (2024)
-
Identifying genetic variants that influence the abundance of cell states in single-cell data
Nature Genetics (2024)
-
MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies
Nature Genetics (2024)
-
Determinants of gastric cancer immune escape identified from non-coding immune-landscape quantitative trait loci
Nature Communications (2024)
-
Multimodal Omics Approaches to Aging and Age-Related Diseases
Phenomics (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.