Abstract
Studies have shown that drug targets with human genetic support are more likely to succeed in clinical trials. Hence, a tool integrating genetic evidence to prioritize drug target genes is beneficial for drug discovery. We built a genetic priority score (GPS) by integrating eight genetic features with drug indications from the Open Targets and SIDER databases. The top 0.83%, 0.28% and 0.19% of the GPS conferred a 5.3-, 9.9- and 11.0-fold increased effect of having an indication, respectively. In addition, we observed that targets in the top 0.28% of the score were 1.7-, 3.7- and 8.8-fold more likely to advance from phase I to phases II, III and IV, respectively. Complementary to the GPS, we incorporated the direction of genetic effect and drug mechanism into a directional version of the score called the GPS with direction of effect. We applied our method to 19,365 protein-coding genes and 399 drug indications and made all results available through a web portal.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The GPS and GPS-DOE for 14,899 genes and 399 drug indications are publicly available at https://rstudio-connect.hpc.mssm.edu/geneticpriorityscore/ and https://doi.org/10.5281/zenodo.10044666. Public data used in this study mentioned in Methods are available via the listed URLs:
Open Target genetic evidence and clinical trial data (v22.06), https://platform.opentargets.org/downloads
SIDER 4.1, http://sideeffects.embl.de/download/
Drugbank (5.1.9), https://go.drugbank.com/releases/latest
ChEMBL (release 29), https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_29/
Supplementary Table 2 from ref. 34
Ensembl (release 105), http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/
OMIM (accessed February 8, 2022), https://www.omim.org/downloads
HGMD Professional (accessed March 12, 2022), https://www.hgmd.cf.ac.uk/ac/index.php
Gene burden results from Genebass, gs://ukbb-exome-public/500k/results/results.mt
Single-variant association results from Genebass, gs://ukbb-exome-public/500k/results/variant_results.mt
Genebass (500K), gs://ukbb-exome-public/300k/results/variant_results.mt
GTEx Analysis V8, https://www.gtexportal.org/home/datasets
Pan-UK Biobank, https://pan.ukbb.broadinstitute.org/downloads/index.html
UCSC liftover chain file, https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/
Benjamin Neale’s lab GWAS summary statistics in the UK Biobank, http://www.nealelab.is/uk-biobank
SAIGE GWAS summary statistics in the UK Biobank, https://www.leelabsg.org/resources
ATC classification (version 2022AA, uploaded 9 August 2022), https://bioportal.bioontology.org/ontologies/ATC
UMLS (accessed 18 January 2022), https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
HPO annotations (accessed 8 February 2022), http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa
HPO to phecode map, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5959723/ and Supplementary Table 12
Phecode Map 1.2b1 to ICD-10 (β), https://phewascatalog.org/phecodes_icd10
Phecode Map 1.2b1 to ICD-10-CM (β), https://phewascatalog.org/phecodes_icd10cm
Phecode definitions, https://phewascatalog.org/files/phecode_definitions1.2.csv.zip
EBISPOT OLS, https://github.com/EBISPOT/EFO-UKB-mappings. Source data are provided with this paper.
Code availability
All analysis code is available on Zenodo46 (https://doi.org/10.5281/zenodo.10044666).
References
Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).
Cook, D. et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nat. Rev. Drug Discov. 13, 419–431 (2014).
Dowden, H. & Munro, J. Trends in clinical success rates and therapeutic focus. Nat. Rev. Drug Discov. 18, 495–496 (2019).
Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47, 856–860 (2015).
Ochoa, D. et al. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nat. Rev. Drug Discov. 21, 551 (2022).
King, E. A., Davis, J. W. & Degner, J. F. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet. 15, e1008489 (2019).
Ghoussaini, M., Nelson, M. R. & Dunham, I. Future prospects for human genetics and genomics in drug discovery. Curr. Opin. Struct. Biol. 80, 102568 (2023).
Fang, H. et al. A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat. Genet. 51, 1082–1091 (2019).
Duffy, A. et al. Tissue-specific genetic features inform prediction of drug side effects in clinical trials. Sci. Adv. 6, eabb6242 (2020).
Nguyen, P. A., Born, D. A., Deaton, A. M., Nioi, P. & Ward, L. D. Phenotypes associated with genes encoding drug targets are predictive of clinical trial side effects. Nat. Commun. 10, 1579 (2019).
Yao, J., Hurle, M. R., Nelson, M. R. & Agarwal, P. Predicting clinically promising therapeutic hypotheses using tensor factorization. BMC Bioinformatics 20, 69 (2019).
Han, Y. et al. Empowering the discovery of novel target-disease associations via machine learning approaches in the open targets platform. BMC Bioinformatics 23, 232 (2022).
Ochoa, D. et al. Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Res. 49, D1302–D1310 (2021).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2017).
Cook, C. E. et al. The European Bioinformatics Institute in 2016: data growth and integration. Nucleic Acids Res. 44, D20–D26 (2015).
Stenson, P. D. et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002).
Sveinbjornsson, G. et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nat. Genet. 48, 314–317 (2016).
Karczewski, K. J. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genom. 2, 100168 (2022).
Pan-UKB team. https://pan.ukbb.broadinstitute.org (2020).
Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53, 1527–1533 (2021).
Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721 (2021).
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).
Kuhn, M. et al. The SIDER database of drugs and side effects. Nucleic Acids Res. 44, D1075–D1079 (2016).
Hingorani, A. D. et al. Improving the odds of drug development success through human genomics: modelling study. Sci. Rep. 9, 18911 (2019).
Stein, D. et al. Genome-wide prediction of pathogenic gain- and loss-of-function variants from ensemble learning of a diverse feature set. Preprint at bioRxiv https://doi.org/10.1101/2022.06.08.495288 (2022).
Estrada, K. et al. Identifying therapeutic drug targets using bidirectional effect genes. Nat. Commun. 12, 2224 (2021).
Chen, B. & Altman, R. B. Opportunities for developing therapies for rare genetic diseases: focus on gain-of-function and allostery. Orphanet J. Rare Dis. 12, 61 (2017).
Kok, B. P. et al. Discovery of small-molecule enzyme activators by activity-based protein profiling. Nat. Chem. Biol. 16, 997–1005 (2020).
Kobayashi, K. et al. Class B1 GPCR activation by an intracellular agonist. Nature 618, 1085–1093 (2023).
Okuyama, R. Chronological analysis of first-in-class drugs approved from 2011 to 2022: their technological trend and origin. Pharmaceutics 15, 1794 (2023).
Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004).
Pendlington, Z. M et al. EBISPOT/EFO-UKB-mappings. GitHub. https://github.com/EBISPOT/EFO-UKB-mappings (2019).
Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090 (2014).
Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).
Davies, M. et al. ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res. 43, W612–W620 (2015).
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2016).
Santos, R. et al. A comprehensive map of molecular drug targets. Nat. Rev. Drug Discov. 16, 19–34 (2017).
Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2021).
Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–d1217 (2021).
Kuhn, R. M., Haussler, D. & Kent, W. J. The UCSC genome browser and associated tools. Brief. Bioinform. 14, 144–161 (2012).
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2011).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2022).
Duffy, A. & Do, R. Development of a human genetics-guided priority score for 19,365 genes and 399 drug indications. Zenodo https://doi.org/10.5281/zenodo.10095684 (2023).
Acknowledgements
C.M.L. is supported by the National Institutes of Health (NIH) T32 Postdoctoral Research Award (5T32HL00782424). R.D. is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836) and the National Heart, Lung, and Blood Institute of the NIH (R01-HL139865 and R01-HL155915). M.V. is supported by the French National Research Agency (ANR; ANR-21-CE45-0023-01). Y.I. is funded by the Leducq Foundation (21CVD01) and by the Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai. D.S. is funded by the Helmsley Foundation Award (2209-05535). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
A.D. and R.D. conceived and designed the study. A.D. and B.O.P. performed statistical analyses. A.D., B.O.P., D.S., M.V., J.K.P., I.S.F., K.G., H.M.V., R.C., C.M.L., A.S., Y.I., M.M., D.N.C., G.R., D.M.J. and R.D. provided administrative, technical and material support. A.D. and R.D. drafted the paper. R.D. supervised the study. These authors contributed equally in the acquisition and interpretation of data and/or critical revision of the paper. A.D. and R.D. had access to and verified all of the data in the study.
Corresponding author
Ethics declarations
Competing interests
R.D. reported receiving grants from AstraZeneca, grants and nonfinancial support from Goldfinch Bio, being a scientific cofounder, consultant and equity holder for Pensieve Health and being a consultant for Variant Bio, all not related to this work. All other authors have declared no competing interest.
Peer review
Peer review information
Nature Genetics thanks Maya Ghoussaini, Robert Plenge, Yakov Tsepilov, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Association of genetic features with drug indications using Firth logistic regression in the Open Target dataset in all drugs and drugs with one gene target.
The Open Targets dataset was split into 80% training and 20% test sets in five-fold cross-validation. Firth logistic regression was run on the cross-validation training sets (n = 735,847 independent drug-gene-phenotype combinations) with drug indication as the outcome variable and the eight human genetic features, 14 phecode categories, genetic constraint and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene targets as the predictor variables. Shown is a forest plot of beta coefficients with 95% CIs from the eight human genetic features included in the models in five-fold cross-validation. Each cross-validated sample is color labeled and filled circles indicate a beta coefficient with a significant P-value and the 95% CIs are defined as error bars.
Extended Data Fig. 2 Contribution of each genetic feature on the GPS.
Shown is a violin plot demonstrating the contribution of each genetic feature to the 919,809 genetic priority scores in the Open Target dataset for n = 231,066 gene-phenotype combinations. The plot shows the contribution of each feature to the GPSs compared across all features and all scores, binned according to the percentile of the score. The violin width represents the density of the genetic feature at each percentile and the mean percentile for each feature is shown as a point. On the y-axis, the sample size (n) and mean weight of each genetic feature from the five cross-validated samples was recorded and the y-axis was ordered by increasing value of these weights. We demonstrate that many different genetic features contribute to the highest percentile ranked GPSs.
Extended Data Fig. 3 Contribution of genetic features to the GPS at each 0.3 increment bin.
Each bar plot shows the contribution of the genetic features to the GPSs at 0.3 increment bins in the Open Target dataset. As the GPS increases, the number of features which contribute to each score increases. On the x axis of each bar plot is the number of genetic features that contributes to each score, colored by each feature present. On the y-axis is the number of counts for each feature.
Extended Data Fig. 4 Association of the GPS at increments of 0.3 with drug indication in the Open Target dataset.
The association of increasing GPSs with drug indications was investigated by binning the Open Target drug dataset (n = 919,809 independent drug-gene-phenotype combinations) into 0.3 increments of the GPS and comparing GPSs greater or equal to each increment with GPSs equal to zero. For each bin, a logistic regression model was performed with drug indication as the outcome variable and the GPS bin as the predictor variable, adjusting for phecode categories as covariates and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene target. ORs with 95% CI are defined in the forest plot as circles and error bars.
Extended Data Fig. 5 Association of the GPS at increments of 0.3 with drug indication in the Open Target and SIDER dataset in drugs with one gene target.
The association of increasing GPSs with drug indications was investigated by binning the Open Target drug dataset (n = 215,028 independent drug-gene-phenotype combinations) and the SIDER dataset (n = 66,792 independent drug-gene-phenotype combinations) into 0.3 increments of the GPS and comparing GPSs greater or equal to each increment with GPSs equal to zero. For each bin, a logistic regression model was performed with drug indication as the outcome variable and the GPS bin as the predictor variable, adjusting for phecode categories as covariates. We show the ORs with 95% CI from the logistic regression model assessed in the Open Targets and SIDER dataset, restricted to drugs with one gene target. ORs with 95% CI are defined in the forest plots as circles and error bars.
Extended Data Fig. 6 Schematic used to derive the desired direction of therapeutic modulation using direction of genetic effect.
An idealistic framework using direction of genetic effect to propose the direction of therapeutic modulation. Mutations which decrease gene function and subsequently increase disease risk model activator drugs and mutations which increase gene function and increase disease risk model inhibitor drugs.
Extended Data Fig. 7 Association of the GPS-DOE with drug indication in the Open Target dataset.
For GPS-DOE, the association of the increasing absolute values of the GPS-DOE with drug indications was investigated by binning the Open Target drug dataset (n = 839,752 independent drug-gene-phenotype combinations) into 0.3 increments of the GPS-DOE and comparing scores greater or equal to each increment with scores equal to zero. For each bin, a logistic regression model was performed with drug indication as the outcome variable and the GPS-DOE bin as the predictor variable, adjusting for phecode categories as covariates and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene targets. ORs with 95% CI are defined in the forest plot as circles and error bars and we repeated the associations for GPS-DOE restricted to LOF predictions and GOF predictions only.
Extended Data Fig. 8 Association of the GPS-DOE with drug indication by clinical trial phase.
a) Association results of the absolute values of the GPS-DOE in s.d. units with drug indication in the Open Target dataset (n = 839,752 independent drug-gene-phenotype combinations) by clinical trial phase are shown. The plot shows a forest plot of ORs with 95% CI represented as circles and error bars for each logistic regression model with the 14 phecode categories and the number of gene targets per drug binarized into drugs with a single gene target and drugs with multiple gene targets as covariates. On the y-axis, the number of unique drug indications for each clinical phase is recorded in red and the number of unique drugs in blue. We demonstrate that the GPS-DOE has a strong association with drug indications as the clinical trial phase advances from phase I to phase IV. b) Shown is the fold enrichment of drug indications with support from a high GPS-DOE using score thresholds of 0.9,1.5 and 2.1 compared to those without genetic evidence in each targeted phase (for example, phase II, III or IV) divided by the total sum observed in phase I.
Supplementary information
Supplementary Information
Supplementary Figs. 1–8 and Supplementary Tables 1–17.
Source data
Source Data Fig. 3
Statistical source data.
Source Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 7
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Duffy, Á., Petrazzini, B.O., Stein, D. et al. Development of a human genetics-guided priority score for 19,365 genes and 399 drug indications. Nat Genet 56, 51–59 (2024). https://doi.org/10.1038/s41588-023-01609-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01609-2