Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Development of a human genetics-guided priority score for 19,365 genes and 399 drug indications

Abstract

Studies have shown that drug targets with human genetic support are more likely to succeed in clinical trials. Hence, a tool integrating genetic evidence to prioritize drug target genes is beneficial for drug discovery. We built a genetic priority score (GPS) by integrating eight genetic features with drug indications from the Open Targets and SIDER databases. The top 0.83%, 0.28% and 0.19% of the GPS conferred a 5.3-, 9.9- and 11.0-fold increased effect of having an indication, respectively. In addition, we observed that targets in the top 0.28% of the score were 1.7-, 3.7- and 8.8-fold more likely to advance from phase I to phases II, III and IV, respectively. Complementary to the GPS, we incorporated the direction of genetic effect and drug mechanism into a directional version of the score called the GPS with direction of effect. We applied our method to 19,365 protein-coding genes and 399 drug indications and made all results available through a web portal.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Workflow to build the GPS for drug indications.
Fig. 2: Association of genetic features with drug indications in the Open Target dataset.
Fig. 3: Association of the GPS at increments of 0.3 with drug indication in the SIDER dataset.
Fig. 4: Simulation of a drug prioritization framework.
Fig. 5: Association of the GPS with drug indication by clinical trial phase.
Fig. 6: Association of the GPS-DOE with drug indication in the SIDER dataset.

Similar content being viewed by others

Data availability

The GPS and GPS-DOE for 14,899 genes and 399 drug indications are publicly available at https://rstudio-connect.hpc.mssm.edu/geneticpriorityscore/ and https://doi.org/10.5281/zenodo.10044666. Public data used in this study mentioned in Methods are available via the listed URLs:

Open Target genetic evidence and clinical trial data (v22.06), https://platform.opentargets.org/downloads

SIDER 4.1, http://sideeffects.embl.de/download/

Drugbank (5.1.9), https://go.drugbank.com/releases/latest

ChEMBL (release 29), https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_29/

Supplementary Table 2 from ref. 34

Ensembl (release 105), http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/

OMIM (accessed February 8, 2022), https://www.omim.org/downloads

HGMD Professional (accessed March 12, 2022), https://www.hgmd.cf.ac.uk/ac/index.php

Gene burden results from Genebass, gs://ukbb-exome-public/500k/results/results.mt

Single-variant association results from Genebass, gs://ukbb-exome-public/500k/results/variant_results.mt

Genebass (500K), gs://ukbb-exome-public/300k/results/variant_results.mt

GTEx Analysis V8, https://www.gtexportal.org/home/datasets

Pan-UK Biobank, https://pan.ukbb.broadinstitute.org/downloads/index.html

UCSC liftover chain file, https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/

OE, https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz

Benjamin Neale’s lab GWAS summary statistics in the UK Biobank, http://www.nealelab.is/uk-biobank

SAIGE GWAS summary statistics in the UK Biobank, https://www.leelabsg.org/resources

ATC classification (version 2022AA, uploaded 9 August 2022), https://bioportal.bioontology.org/ontologies/ATC

UMLS (accessed 18 January 2022), https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html

HPO annotations (accessed 8 February 2022), http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa

HPO to phecode map, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5959723/ and Supplementary Table 12

Phecode Map 1.2b1 to ICD-10 (β), https://phewascatalog.org/phecodes_icd10

Phecode Map 1.2b1 to ICD-10-CM (β), https://phewascatalog.org/phecodes_icd10cm

Phecode definitions, https://phewascatalog.org/files/phecode_definitions1.2.csv.zip

EBISPOT OLS, https://github.com/EBISPOT/EFO-UKB-mappings. Source data are provided with this paper.

Code availability

All analysis code is available on Zenodo46 (https://doi.org/10.5281/zenodo.10044666).

References

  1. Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).

    Article  CAS  PubMed  Google Scholar 

  2. Cook, D. et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nat. Rev. Drug Discov. 13, 419–431 (2014).

    Article  CAS  PubMed  Google Scholar 

  3. Dowden, H. & Munro, J. Trends in clinical success rates and therapeutic focus. Nat. Rev. Drug Discov. 18, 495–496 (2019).

    Article  CAS  PubMed  Google Scholar 

  4. Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856–860 (2015).

    Article  CAS  PubMed  Google Scholar 

  5. Ochoa, D. et al. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nat. Rev. Drug Discov. 21, 551 (2022).

    Article  CAS  PubMed  Google Scholar 

  6. King, E. A., Davis, J. W. & Degner, J. F. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet. 15, e1008489 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Ghoussaini, M., Nelson, M. R. & Dunham, I. Future prospects for human genetics and genomics in drug discovery. Curr. Opin. Struct. Biol. 80, 102568 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Fang, H. et al. A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat. Genet. 51, 1082–1091 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Duffy, A. et al. Tissue-specific genetic features inform prediction of drug side effects in clinical trials. Sci. Adv. 6, eabb6242 (2020).

    Article  CAS  PubMed  Google Scholar 

  10. Nguyen, P. A., Born, D. A., Deaton, A. M., Nioi, P. & Ward, L. D. Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects. Nat. Commun. 10, 1579 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Yao, J., Hurle, M. R., Nelson, M. R. & Agarwal, P. Predicting clinically promising therapeutic hypotheses using tensor factorization. BMC Bioinformatics 20, 69 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Han, Y. et al. Empowering the discovery of novel target-disease associations via machine learning approaches in the open targets platform. BMC Bioinformatics 23, 232 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Ochoa, D. et al. Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Res. 49, D1302–D1310 (2021).

    Article  CAS  PubMed  Google Scholar 

  14. Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2017).

    Article  PubMed Central  Google Scholar 

  15. Cook, C. E. et al. The European Bioinformatics Institute in 2016: data growth and integration. Nucleic Acids Res. 44, D20–D26 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Stenson, P. D. et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Sveinbjornsson, G. et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nat. Genet. 48, 314–317 (2016).

    Article  CAS  PubMed  Google Scholar 

  19. Karczewski, K. J. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genom. 2, 100168 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Pan-UKB team. https://pan.ukbb.broadinstitute.org (2020).

  21. Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53, 1527–1533 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721 (2021).

    Article  CAS  PubMed  Google Scholar 

  23. Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Kuhn, M. et al. The SIDER database of drugs and side effects. Nucleic Acids Res. 44, D1075–D1079 (2016).

    Article  CAS  PubMed  Google Scholar 

  25. Hingorani, A. D. et al. Improving the odds of drug development success through human genomics: modelling study. Sci. Rep. 9, 18911 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Stein, D. et al. Genome-wide prediction of pathogenic gain- and loss-of-function variants from ensemble learning of a diverse feature set. Preprint at bioRxiv https://doi.org/10.1101/2022.06.08.495288 (2022).

  27. Estrada, K. et al. Identifying therapeutic drug targets using bidirectional effect genes. Nat. Commun. 12, 2224 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Chen, B. & Altman, R. B. Opportunities for developing therapies for rare genetic diseases: focus on gain-of-function and allostery. Orphanet J. Rare Dis. 12, 61 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Kok, B. P. et al. Discovery of small-molecule enzyme activators by activity-based protein profiling. Nat. Chem. Biol. 16, 997–1005 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Kobayashi, K. et al. Class B1 GPCR activation by an intracellular agonist. Nature 618, 1085–1093 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Okuyama, R. Chronological analysis of first-in-class drugs approved from 2011 to 2022: their technological trend and origin. Pharmaceutics 15, 1794 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Pendlington, Z. M et al. EBISPOT/EFO-UKB-mappings. GitHub. https://github.com/EBISPOT/EFO-UKB-mappings (2019).

  34. Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090 (2014).

    Article  CAS  PubMed  Google Scholar 

  35. Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).

    Article  CAS  PubMed  Google Scholar 

  36. Davies, M. et al. ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res. 43, W612–W620 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Santos, R. et al. A comprehensive map of molecular drug targets. Nat. Rev. Drug Discov. 16, 19–34 (2017).

    Article  CAS  PubMed  Google Scholar 

  39. Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2021).

    Article  PubMed Central  Google Scholar 

  40. Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–d1217 (2021).

    Article  PubMed  Google Scholar 

  41. Kuhn, R. M., Haussler, D. & Kent, W. J. The UCSC genome browser and associated tools. Brief. Bioinform. 14, 144–161 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

    Article  CAS  Google Scholar 

  43. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2011).

    Article  PubMed  Google Scholar 

  45. R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2022).

  46. Duffy, A. & Do, R. Development of a human genetics-guided priority score for 19,365 genes and 399 drug indications. Zenodo https://doi.org/10.5281/zenodo.10095684 (2023).

Download references

Acknowledgements

C.M.L. is supported by the National Institutes of Health (NIH) T32 Postdoctoral Research Award (5T32HL00782424). R.D. is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836) and the National Heart, Lung, and Blood Institute of the NIH (R01-HL139865 and R01-HL155915). M.V. is supported by the French National Research Agency (ANR; ANR-21-CE45-0023-01). Y.I. is funded by the Leducq Foundation (21CVD01) and by the Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai. D.S. is funded by the Helmsley Foundation Award (2209-05535). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

A.D. and R.D. conceived and designed the study. A.D. and B.O.P. performed statistical analyses. A.D., B.O.P., D.S., M.V., J.K.P., I.S.F., K.G., H.M.V., R.C., C.M.L., A.S., Y.I., M.M., D.N.C., G.R., D.M.J. and R.D. provided administrative, technical and material support. A.D. and R.D. drafted the paper. R.D. supervised the study. These authors contributed equally in the acquisition and interpretation of data and/or critical revision of the paper. A.D. and R.D. had access to and verified all of the data in the study.

Corresponding author

Correspondence to Ron Do.

Ethics declarations

Competing interests

R.D. reported receiving grants from AstraZeneca, grants and nonfinancial support from Goldfinch Bio, being a scientific cofounder, consultant and equity holder for Pensieve Health and being a consultant for Variant Bio, all not related to this work. All other authors have declared no competing interest.

Peer review

Peer review information

Nature Genetics thanks Maya Ghoussaini, Robert Plenge, Yakov Tsepilov, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Association of genetic features with drug indications using Firth logistic regression in the Open Target dataset in all drugs and drugs with one gene target.

The Open Targets dataset was split into 80% training and 20% test sets in five-fold cross-validation. Firth logistic regression was run on the cross-validation training sets (n = 735,847 independent drug-gene-phenotype combinations) with drug indication as the outcome variable and the eight human genetic features, 14 phecode categories, genetic constraint and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene targets as the predictor variables. Shown is a forest plot of beta coefficients with 95% CIs from the eight human genetic features included in the models in five-fold cross-validation. Each cross-validated sample is color labeled and filled circles indicate a beta coefficient with a significant P-value and the 95% CIs are defined as error bars.

Extended Data Fig. 2 Contribution of each genetic feature on the GPS.

Shown is a violin plot demonstrating the contribution of each genetic feature to the 919,809 genetic priority scores in the Open Target dataset for n = 231,066 gene-phenotype combinations. The plot shows the contribution of each feature to the GPSs compared across all features and all scores, binned according to the percentile of the score. The violin width represents the density of the genetic feature at each percentile and the mean percentile for each feature is shown as a point. On the y-axis, the sample size (n) and mean weight of each genetic feature from the five cross-validated samples was recorded and the y-axis was ordered by increasing value of these weights. We demonstrate that many different genetic features contribute to the highest percentile ranked GPSs.

Extended Data Fig. 3 Contribution of genetic features to the GPS at each 0.3 increment bin.

Each bar plot shows the contribution of the genetic features to the GPSs at 0.3 increment bins in the Open Target dataset. As the GPS increases, the number of features which contribute to each score increases. On the x axis of each bar plot is the number of genetic features that contributes to each score, colored by each feature present. On the y-axis is the number of counts for each feature.

Extended Data Fig. 4 Association of the GPS at increments of 0.3 with drug indication in the Open Target dataset.

The association of increasing GPSs with drug indications was investigated by binning the Open Target drug dataset (n = 919,809 independent drug-gene-phenotype combinations) into 0.3 increments of the GPS and comparing GPSs greater or equal to each increment with GPSs equal to zero. For each bin, a logistic regression model was performed with drug indication as the outcome variable and the GPS bin as the predictor variable, adjusting for phecode categories as covariates and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene target. ORs with 95% CI are defined in the forest plot as circles and error bars.

Source data

Extended Data Fig. 5 Association of the GPS at increments of 0.3 with drug indication in the Open Target and SIDER dataset in drugs with one gene target.

The association of increasing GPSs with drug indications was investigated by binning the Open Target drug dataset (n = 215,028 independent drug-gene-phenotype combinations) and the SIDER dataset (n = 66,792 independent drug-gene-phenotype combinations) into 0.3 increments of the GPS and comparing GPSs greater or equal to each increment with GPSs equal to zero. For each bin, a logistic regression model was performed with drug indication as the outcome variable and the GPS bin as the predictor variable, adjusting for phecode categories as covariates. We show the ORs with 95% CI from the logistic regression model assessed in the Open Targets and SIDER dataset, restricted to drugs with one gene target. ORs with 95% CI are defined in the forest plots as circles and error bars.

Source data

Extended Data Fig. 6 Schematic used to derive the desired direction of therapeutic modulation using direction of genetic effect.

An idealistic framework using direction of genetic effect to propose the direction of therapeutic modulation. Mutations which decrease gene function and subsequently increase disease risk model activator drugs and mutations which increase gene function and increase disease risk model inhibitor drugs.

Extended Data Fig. 7 Association of the GPS-DOE with drug indication in the Open Target dataset.

For GPS-DOE, the association of the increasing absolute values of the GPS-DOE with drug indications was investigated by binning the Open Target drug dataset (n = 839,752 independent drug-gene-phenotype combinations) into 0.3 increments of the GPS-DOE and comparing scores greater or equal to each increment with scores equal to zero. For each bin, a logistic regression model was performed with drug indication as the outcome variable and the GPS-DOE bin as the predictor variable, adjusting for phecode categories as covariates and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene targets. ORs with 95% CI are defined in the forest plot as circles and error bars and we repeated the associations for GPS-DOE restricted to LOF predictions and GOF predictions only.

Source data

Extended Data Fig. 8 Association of the GPS-DOE with drug indication by clinical trial phase.

a) Association results of the absolute values of the GPS-DOE in s.d. units with drug indication in the Open Target dataset (n = 839,752 independent drug-gene-phenotype combinations) by clinical trial phase are shown. The plot shows a forest plot of ORs with 95% CI represented as circles and error bars for each logistic regression model with the 14 phecode categories and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene targets as covariates. On the y-axis, the number of unique drug indications for each clinical phase is recorded in red and the number of unique drugs in blue. We demonstrate that the GPS-DOE has a strong association with drug indications as the clinical trial phase advances from phase I to phase IV. b) Shown is the fold enrichment of drug indications with support from a high GPS-DOE using score thresholds of 0.9,1.5 and 2.1 compared to those without genetic evidence in each targeted phase (for example, phase II, III or IV) divided by the total sum observed in phase I.

Extended Data Table 1 Firth regression beta coefficients used to create the GPS and GPS-DOE in SIDER and across all genes and phenotypes

Supplementary information

Supplementary Information

Supplementary Figs. 1–8 and Supplementary Tables 1–17.

Reporting Summary

Source data

Source Data Fig. 3

Statistical source data.

Source Data Fig. 6

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duffy, Á., Petrazzini, B.O., Stein, D. et al. Development of a human genetics-guided priority score for 19,365 genes and 399 drug indications. Nat Genet 56, 51–59 (2024). https://doi.org/10.1038/s41588-023-01609-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-023-01609-2

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research