Abstract
Genome-wide association studies (GWASs) are a valuable tool for understanding the biology of complex human traits and diseases, but associated variants rarely point directly to causal genes. In the present study, we introduce a new method, polygenic priority score (PoPS), that learns trait-relevant gene features, such as cell-type-specific expression, to prioritize genes at GWAS loci. Using a large evaluation set of genes with fine-mapped coding variants, we show that PoPS and the closest gene individually outperform other gene prioritization methods, but observe the best overall performance by combining PoPS with orthogonal methods. Using this combined approach, we prioritize 10,642 unique gene–trait pairs across 113 complex traits and diseases with high precision, finding not only well-established gene–trait relationships but nominating new genes at unresolved loci, such as LGR4 for estimated glomerular filtration rate and CCR7 for deep vein thrombosis. Overall, we demonstrate that PoPS provides a powerful addition to the gene prioritization toolbox.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
A repository of processed gene features, visualizations of top derived features and code to reproduce these analyses are available on GitHub at https://github.com/FinucaneLab/gene_features. Complete PoPS results for 95 complex traits in the UK Biobank and 18 additional disease traits, as well as results for PoPS and locus-based methods in genome-wide significant loci, are available at https://www.finucanelab.org/data.
Code availability
PoPS is available as an open-source Python package at https://github.com/FinucaneLab/pops. A static version of the PoPS method used in the present study is available at https://doi.org/10.5281/zenodo.8002379.
References
Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Donnelly, P. Progress and challenges in genome-wide association studies in humans. Nature 456, 728–731 (2008).
Gallagher, M. D. & Chen-Plotkin, A. S. The post-GWAS era: from association to function. Am. J. Hum. Genet. 102, 717–730 (2018).
Reich, D. E. et al. Linkage disequilibrium in the human genome. Nature 411, 199–204 (2001).
van Arensbergen, J., van Steensel, B. & Bussemaker, H. J. In search of the determinants of enhancer-promoter interaction specificity. Trends Cell Biol. 24, 695–702 (2014).
Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6, 5890 (2015).
Hormozdiari, F. et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet. 99, 1245–1260 (2016).
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
de Leeuw, C. A., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11, e1004219 (2015).
Greene, C. S. et al. Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet. 47, 569–576 (2015).
Fulco, C. P. et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Jung, I. et al. A compendium of promoter-centered long-range chromatin interactions in the human genome. Nat. Genet. 51, 1442–1449 (2019).
Ulirsch, J. C. et al. Interrogation of human hematopoiesis at single-cell and single-variant resolution. Nat. Genet. 51, 683–693 (2019).
Javierre, B. M. et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384 (2016).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Liu, Y., Sarkar, A., Kheradpour, P., Ernst, J. & Kellis, M. Evidence of reduced recombination rate in human regulatory domains. Genome Biol. 18, 193 (2017).
Fine, R. S., Pers, T. H., Amariuta, T., Raychaudhuri, S. & Hirschhorn, J. N. Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization algorithms. Am. J. Hum. Genet. 104, 1025–1039 (2019).
Barbeira, A. N. et al. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol. 22, 49 (2021).
Stacey, D. et al. ProGeM: a framework for the prioritization of candidate causal genes at molecular quantitative trait loci. Nucleic Acids Res. 47, e3 (2019).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Kanai, M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv https://doi.org/2021.09.03.21262975 (2021).
The 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Li, T. et al. A scored human protein–protein interaction network to catalyze genomic interpretation. Nat. Methods 14, 61–64 (2017).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012).
Croft, D. et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 39, D691–D697 (2011).
Blake, J. A. et al. The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse. Nucleic Acids Res. 42, D810–D817 (2014).
Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).
Wheeler, E. et al. Impact of common genetic determinants of hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: a transethnic genome-wide meta-analysis. PLoS Med. 14, e1002383 (2017).
Kurkó, J. et al. Genetics of rheumatoid arthritis—a comprehensive review. Clin. Rev. Allergy Immunol. 45, 170–179 (2013).
Gejman, P. V., Sanders, A. R. & Duan, J. The role of genetics in the etiology of schizophrenia. Psychiatr. Clin. North Am. 33, 35–66 (2010).
Heyes, S. et al. Genetic disruption of voltage-gated calcium channels in psychiatric and neurological disorders. Prog. Neurobiol. 134, 36–54 (2015).
GTEx, Consortium et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).
Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53, 1527–1533 (2021).
Dron, J. S. & Hegele, R. A. Genetics of lipid and lipoprotein disorders and traits. Curr. Genet. Med. Rep. 4, 130–141 (2016).
Thompson, D. J. et al. Genetic predisposition to mosaic Y chromosome loss in blood. Nature 575, 652–657 (2019).
Brisch, R. et al. The role of dopamine in schizophrenia from a neurobiological and evolutionary perspective: old fashioned, but still in vogue. Front. Psychiatry 5, 47 (2014).
Basak, A. et al. BCL11A deletions result in fetal hemoglobin persistence and neurodevelopmental alterations. J. Clin. Invest. 125, 2363–2368 (2015).
Quednow, B. B., Brzózka, M. M. & Rossner, M. J. Transcription factor 4 (TCF4) and schizophrenia: integrating the animal and the human perspective. Cell. Mol. Life Sci. 71, 2815–2835 (2014).
Ulirsch, J. C. et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).
Cvejic, A. et al. SMIM1 underlies the Vel blood group and influences red blood cell traits. Nat. Genet. 45, 542–545 (2013).
Cawley, N. X. et al. Obese carboxypeptidase E knockout mice exhibit multiple defects in peptide hormone processing contributing to low bone mineral density. Am. J. Physiol. Endocrinol. Metab. 299, E189–E197 (2010).
Kato, S. et al. Leucine-rich repeat-containing G protein-coupled receptor-4 (LGR4, Gpr48) is essential for renal development in mice. Nephron Exp. Nephrol. 104, e63–e75 (2006).
Budnik, I. & Brill, A. Immune factors in deep vein thrombosis initiation. Trends Immunol. 39, 610–623 (2018).
Lambert, M. P., Sachais, B. S. & Kowalska, M. A. Chemokines and thrombogenicity. Thromb. Haemost. 97, 722–729 (2007).
Purcell, S. et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 81, 559–575 (2007).
Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Baglama, J. & Reichel, L. Restarted block Lanczos bidiagonalization methods. Numer. Algorithms 43, 251–272 (2007).
Hyvärinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw. 10, 626–634 (1999).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://doi.org/10.48550/arXiv.1802.03426 (2018).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
UK10K Consortium et al. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Csárdi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. complex syst. 1695, 1–9 (2006).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 82, 1273–1300 (2020).
Benner, C. et al. Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. 101, 539–551 (2017).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Cairns, J. et al. CHiCAGO: robust detection of DNA looping interactions in Capture Hi-C data. Genome Biol. 17, 127 (2016).
Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Calderon, D. et al. Landscape of stimulation-responsive chromatin across diverse human immune cells. Nat. Genet. 51, 1494–1505 (2019).
Acknowledgements
We thank K. Aragam, A. Butterworth, M. Daly, N. Artomov, Y. Reshef and all members of the Finucane lab for helpful discussions. This research was conducted using the UK Biobank Resource under project 31063. H.K.F. was funded by a National Institutes of Health (NIH) grant (no. DP5 OD024582) and by Eric and Wendy Schmidt. J.M.E. was supported by a Pathway to Independence Award (grant nos. K99HG00917 and R00HG009917), the Harvard Society of Fellows and the Base Research Initiative at Stanford University. J.M. and J.N.H. were supported by an NIH grant (no. R01DK075787). R.S.F. was supported by National Human Genome Research Institute, NIH (grant no. F31HG009850). J.O.-M. was supported by the Richard and Susan Smith Family Foundation, the HHMI Damon Runyon Cancer Research Foundation Fellowship (no. DRG-2274-16), the AGA Research Foundation’s AGA-Takeda Pharmaceuticals Research Scholar Award in Inflammatory Bowel Disease (grant no. AGA2020-13-01), the HDDC Pilot and Feasibility (grant no. P30 DK034854) and the Food Allergy Science Initiative.
Author information
Authors and Affiliations
Contributions
E.M.W. and H.K.F. conceived of the study. E.M.W., J.C.U., N.Y.C. and H.K.F. designed the research, performed the experiments, analyzed the data and interpreted the results. B.L.T. and R.S.F. designed and performed the enrichment-based validations. J.M., T.A.P., M.K., J.N., C.P.F., K.C.T., F.A., T.L., J.O.-M., C.S.S., M.B., A.K.S., A.N.A., R.J.X., A.R., R.M.G., K.L., K.G.A., J.N.H. and J.M.E. provided data or analysis tools used by PoPS or other gene prioritization methods. E.S.L. helped advise the project. E.M.W., J.C.U. and H.K.F. wrote the manuscript with input from all authors. H.K.F. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
J.C.U. reports compensation from consulting services with Goldfinch Bio and is an employee of Illumina. R.S.F. is an employee of Vertex Pharmaceuticals Incorporated. C.P.F. is an employee of Bristol Myers Squibb. J.O.-M. reports compensation for consulting services with Cellarity. A.R. is a cofounder and equity holder of Celsius Therapeutics and an equity holder in Immunitas, and was an SAB member of Thermo Fisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov until 31 July 2020. From 1 August 2020, A.R. is an employee of Genentech. J.N.H. served on the Scientific Advisory Board of and consults for Camp4 Therapeutics. E.S.L. serves on the Board of Directors for Codiak BioSciences and Neon Therapeutics, and serves on the Scientific Advisory Board of F-Prime Capital Partners and Third Rock Ventures; he is also affiliated with several nonprofit organizations including serving on the Board of Directors of the Innocence Project, Count Me In and Biden Cancer Initiative, and the Board of Trustees for the Parker Institute for Cancer Immunotherapy. He has served and continues to serve on various federal advisory committees. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 PoPS model parameter choices and feature selection.
a-c, Results using Benchmarker to compare different parameter choices for fitting the PoPS model, meta-analyzed across independent traits (n = 46). Error bars represent 95% confidence intervals around the meta-analyzed point estimate. a, Feature selection: GLS with an L1 penalty on the full set of features performs less well than GLS after marginal selection using a P value < 0.05 threshold from the two-sided Wald test. b, Error model: ordinary least squares (OLS) performs less well than generalized least squares (GLS) using marginal selection from a. c, Joint model regularization: GLS after marginal feature selection with an L2 penalty performs better than similar models with an L1 penalty or no penalty. d, Number of features selected (marginal P value < 0.05 from the two-sided Wald test) and included in the joint predictive model for PoPS for each trait. A legend for trait domain colors is provided in Fig. 2.
Extended Data Fig. 2 Additional comparisons using closest gene metric.
a, Results using closest gene enrichment to compare similarity-based gene prioritization methods, meta-analyzed within each trait domain across independent traits (n = 46). Error bars represent 95% confidence intervals around the meta-analyzed point estimate. b, Results using closest gene enrichment to compare PoPS results using different feature sets, meta-analyzed within each trait domain across independent traits (n = 46). Error bars represent 95% confidence intervals around the meta-analyzed point estimate.
Extended Data Fig. 3 Comparison of gene expression features derived from bulk and single-cell RNA seq datasets.
a, Results using Benchmarker to compare PoPS results using different feature sets, meta-analyzed within each trait domain across independent traits (n = 46). Error bars represent 95% confidence intervals around the meta-analyzed point estimate. b, Results using closest gene enrichment to compare PoPS results using different feature sets, meta-analyzed within each trait domain across independent traits (n = 46). Error bars represent 95% confidence intervals around the meta-analyzed point estimate.
Extended Data Fig. 4 Comparison of similarity-based methods using precision and recall.
Precision-recall plot showing performance of similarity-based methods.
Extended Data Fig. 5 Comparing prioritization criteria.
Precision-recall plots for each method with varying prioritization criteria. Each point shows the precision and recall for a set of prioritized genes selected using prioritization criteria based on absolute thresholds and/or relative rank in a locus. For all methods, the star represents the final chosen criteria. a, Circles: PoP scores ranked ≤ 2–5 in the locus. Star: highest PoPS score in the locus. b, Plus: significant TWAS P value after Bonferroni correction (P < 0.05/235,584). Circles: TWAS P values ranked ≤ 2–5 in the locus. Star: significant TWAS P value after Bonferroni correction (P < 0.05/235,584) and the most significant in the locus. c, Pluses: CLPP > 0.01, 0.1, 0.5, 0.9, and 0.99. Circles: CLPP > 0.01, 0.1, 0.5, 0.9, and 0.99 and also the highest CLPP in the locus. Star: CLPP > 0.1 and also the highest CLPP in the locus. d, Plus: any predicted connection from ABC. Circles: ABC connection strength ranked ≤ 2–5 in the locus. Star: highest ABC connection strength in the locus. e, Pluses: any predicted connection from PCHiC for individual datasets. Triangle: any predicted connection from PCHi-C in any dataset. Circles: highest connection strength in the locus for individual datasets. Star: highest connection strength in the locus in any dataset. f, Pluses: any predicted connection from E-P correlation for individual datasets. Triangle: any predicted connection from E-P correlation in any dataset. Circles: highest connection strength in the locus for individual datasets. Star: highest connection strength in the locus in any dataset. g, Circle: closest gene by distance to the transcription start site. Star: closest gene by distance to the gene body. h, Circles: MAGMA z-scores ranked ≤ 2–5 in the locus. Star: highest MAGMA score in the locus. i, Plus: significant SMR P value after Bonferroni correction (P < 0.05/18,383). Circles: SMR P values ranked ≤ 2–5 in the locus. Star: significant SMR P value after Bonferroni correction (P < 0.05/18,383) and the most significant in the locus.
Extended Data Fig. 6 Performance of PoPS and locus-based gene prioritization methods by trait.
Precision-recall plots for each method. Each point represents a single trait colored by trait domain. Only traits for which the method prioritized at least five genes in the validation loci were included. A legend for trait domain colors is provided in Fig. 2.
Extended Data Fig. 7 Additional performance metrics using evaluation gene set in 1,348 non-coding loci containing genes that harbor fine-mapped protein coding variants.
a, Sensitivity-specificity plot showing performance of locus-based methods, PoPS, intersections of pairs of locus-based methods, and intersections of PoPs with locus-based methods on the evaluation gene set of 589 genes with fine-mapped protein coding variants. b, Heatmap showing performance using the F-score of locus-based methods, PoPS, intersections of pairs of locus-based methods, and intersections of PoPs with locus-based methods.
Extended Data Fig. 8 Number of prioritized genes for non-UK Biobank traits.
Number of unique gene-trait pairs prioritized by PoPS, locus-based gene prioritization methods, and their intersections, sorted by estimated precision. The full height of each bar represents the total number of genes prioritized. The opaque portion of each bar represents the expected number of true causal genes prioritized. Methods to the left of the dashed line achieve precision greater than 75%.
Extended Data Fig. 9 Known example RBM38.
Top: summary statistics colored by LD to the lead variant and fine-mapping results for variants in the locus colored by credible set. Bottom: results from PoPS and locus-based methods for all genes in the locus. Genes are colored by strength of prediction for each method with a star denoting the prioritized gene. Variant rs737092, RBM38 for mean corpuscular hemoglobin (MCH).
Extended Data Fig. 10 Sensitivity of precision and recall estimates to locus definition.
a, Loci defined as +/− 100 kb on either side of the lead variant. b, Loci defined as +/− 1 Mb on either side of the lead variant. c, Results restricted to loci in fine-mapped regions with three or fewer independent credible sets. d, Results restricted to loci in fine-mapped regions with five or fewer independent credible sets.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Weeks, E.M., Ulirsch, J.C., Cheng, N.Y. et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat Genet 55, 1267–1276 (2023). https://doi.org/10.1038/s41588-023-01443-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01443-6
This article is cited by
-
Genetics of chronic respiratory disease
Nature Reviews Genetics (2024)
-
Multiomic approaches to stroke: the beginning of a journey
Nature Reviews Neurology (2024)
-
The Role of Genetics in Advancing Cardiometabolic Drug Development
Current Atherosclerosis Reports (2024)
-
Epigenomic insights into common human disease pathology
Cellular and Molecular Life Sciences (2024)
-
Biological basis of extensive pleiotropy between blood traits and cancer risk
Genome Medicine (2024)