Main

Human genetics is one of the only forms of scientific evidence that can demonstrate the causal role of genes in human disease. It provides a crucial tool for identifying and prioritizing potential drug targets, providing insights into the expected effect (or lack thereof6) of pharmacological engagement, dose–response relationships7,8,9,10 and safety risks6,11,12,13. Nonetheless, many questions remain about the application of human genetics in drug discovery. Genome-wide association studies (GWASs) of common, complex traits, including many diseases, generally identify variants of small effect. This contributed to early scepticism of the value of GWASs14. Anecdotally, such variants can point to highly successful drug targets7,8,9, and yet, genetic support from GWASs is somewhat less predictive of drug target advancement than support from Mendelian diseases5,15.

In this paper we investigate several open questions regarding the use of genetic evidence for prioritizing drug discovery. We explore the characteristics of genetic associations that are more likely to differentiate successful from unsuccessful drug mechanisms, exploring how they differ across therapy areas and among discovery and development phases. We also investigate how close we may be to saturating the insights we can gain from genetic studies for drug discovery and how much of the genetically supported drug discovery space remains clinically unexplored.

To characterize the drug development pipeline, we filtered Citeline Pharmaprojects for monotherapy programmes added since 2000 annotated with a highest phase reached and assigned both a human gene target (usually the gene encoding the drug target protein) and an indication defined in Medical Subject Headings (MeSH) ontology. This resulted in 29,476 target–indication (T–I) pairs for analysis (Extended Data Fig. 1a). Multiple sources of human genetic associations totalled 81,939 unique gene–trait (G–T) pairs, with traits also mapped to MeSH terms. Intersection of these datasets yielded an overlap of 2,166 T–I and G–T pairs (7.3%) for which the indication and the trait MeSH terms had a similarity ≥0.8; we defined these T–I pairs as possessing genetic support (Extended Data Figs. 1b and 2a and Methods). The probability of having genetic support, or P(G), was higher for launched T–I pairs than those in historical or active clinical development (Fig. 1a). In each phase, P(G) was higher than previously reported5,15, owing, as expected15,16, more to new G–T discoveries than to changes in drug pipeline composition (Extended Data Fig. 3a–f). For ensuing analyses, we considered both historical and active programmes. We defined success at each phase as a T–I pair transitioning to the next development phase (for example, from phase I to II), and we also considered overall success—advancing from phase I to a launched drug. We defined relative success (RS) as the ratio of the probability of success, P(S), with genetic support to the probability of success without genetic support (Methods). We tested the sensitivity of RS to various characteristics of genetic evidence. RS was sensitive to the indication–trait similarity threshold (Extended Data Fig. 2a), which we set to 0.8 for all analyses herein. RS was >2 for all sources of human genetic evidence examined (Fig. 1b). RS was highest for Online Mendelian Inheritance in Man (OMIM) (RS = 3.7), in agreement with previous reports5,15; this was not the result of a higher success rate for orphan drug programmes (Extended Data Fig. 2b), a designation commonly acquired for rare diseases. Rather, it may owe partly to the difference in confidence in causal gene assignment between Mendelian conditions and GWASs, supported by the observation that the RS for Open Targets Genetics (OTG) associations was sensitive to the confidence in variant-to-gene mapping as reflected in the minimum share of locus-to-gene (L2G) score (Fig. 1c). The differences common and rare disease programmes face in regulatory and reimbursement environments4 and differing proportions of drug modalities9 probably contribute as well. OMIM and GWAS support were synergistic with one another (Supplementary Fig. 2b). Somatic evidence from IntOGen had an RS of 2.3 in oncology (Extended Data Fig. 2c), similar to GWASs, but analyses below are limited to germline genetic evidence unless otherwise noted.

Fig. 1: Impact of genetic evidence characteristics on RS.
figure 1

a, Proportion of T–I pairs with genetic support, P(G), as a function of highest phase reached. n at right: denominator, number of T–I pairs per phase; numerator, number that are genetically supported. b, Sensitivity of phase I–launch RS to source of human genetic association. GWAS Catalog, Neale UKBB and FinnGen are subsets of OTG. n at right: denominator, number of T–I pairs with genetic support from each source; numerator, number of those launched. Note that RS is calculated from a 2 × 2 contingency table (Methods). Total n = 13,022 T–I pairs. c, Sensitivity of RS to L2G share threshold among OTG associations. Minimum L2G share threshold is varied from 0.1 to 1.0 in increments of 0.05 (labels); RS (y axis) is plotted against the number of clinical (phase I+) programmes with genetic support from OTG (x axis). d, Sensitivity of RS for OTG GWAS-supported T–I pairs to binned variables: (1) year that T–I pair first acquired human genetic support from GWASs, excluding replications and excluding T–I pairs otherwise supported by OMIM; (2) number of genes exhibiting genetic association to the same trait; (3) quartile of effect size (beta) for quantitative traits; (4) quartile of effect size (odds ratio, OR) for case/control traits standardized to be >1 (that is, 1/OR if <1); (5) order of magnitude of minor allele frequency bins. n at right as in b. Total n = 13,022 T–I pairs. e, Count of indications ever developed in Pharmaprojects (y axis) by the number of genes associated with traits similar to those indications (x axis). Throughout, error bars or shaded areas represent 95% CIs (Wilson for P(G) and Katz for RS) whereas centres represent point estimates. See Supplementary Fig. 1 for the same analyses restricted to drugs with a single known target.

Source Data

As sample sizes grow ever larger with a corresponding increase in the number of unique G–T associations, some expect17 the value of GWAS genetic findings to become less useful for the purpose of drug target selection. We explored this in several ways. We investigated the year that genetic support for a T–I pair was first discovered, under the expectation that more common and larger effects are discovered earlier. Although there was a slightly higher RS for discoveries from 2007–2010 that was largely driven by early lipid and cardiovascular-related associations, the effect of year was overall non-significant (P = 0.46; Fig. 1d). Results were similar when replicate associations or OMIM discoveries were included (Extended Data Fig. 2d–f). We next divided up GWAS-supported drug programmes by the number of unique traits associated to each gene. RS nominally increased with the number of associated genes, by 0.048 per gene (P = 0.024; Fig. 1d). The reason is probably not that successful genetically supported programmes inspire other programmes, because most genetic support was discovered retrospectively (Extended Data Fig. 2g); the few examples of drug programmes prospectively motivated by genetic evidence were primarily for Mendelian diseases9. There were no statistically significant associations with estimated effect sizes (P = 0.90 and 0.57, for quantitative and binary traits, respectively; Fig. 1d and Extended Data Fig. 2h) or minor allele frequency (P = 0.26; Fig. 1d). That ever larger GWASs can continue to uncover support for successful targets is also illustrated by two recent large GWASs in type 2 diabetes (T2D)18,19 (Extended Data Fig. 4).

Previously5, we observed significant heterogeneity among therapy areas in the fraction of approved drug mechanisms with genetic support, but did not investigate the impact on probability of success5. Here, our estimates of RS from phase I to launch showed significant heterogeneity (P < 1.0 × 10−15), with nearly all therapy areas having estimates greater than 1; 11 of 17 were >2, and haematology, metabolic, respiratory and endocrine >3 (Fig. 2a–e). In most therapy areas, the impact of genetic evidence was most pronounced in phases II and III and least impactful in phase I, corresponding to capacity to demonstrate clinical efficacy in later development phases. Accordingly, therapy areas differed in P(G) and in whether P(G) increased throughout clinical development or only at launch (Extended Data Fig. 5); data source and other properties of genetic evidence including year of discovery and effect size also differed (Extended Data Fig. 6). We also found that genetic evidence differentiated likelihood to progress from preclinical to clinical development for metabolic diseases (RS = 1.38; 95% confidence interval (95% CI), 1.25 to 1.54), which may reflect preclinical models that are more predictive of clinical outcomes. P(G) by therapy area was correlated with P(S) (ρ = 0.59, P = 0.013) and with RS (ρ = 0.72, P = 0.0011; Extended Data Fig. 7), which led us to explore how the sheer quantity of genetic evidence available within therapy areas (Fig. 2f and Extended Data Fig. 8a) may influence this. We found that therapy areas with more possible gene–indication (G–I) pairs supported by genetic evidence had significantly higher RS (ρ = 0.71, P = 0.0010; Fig. 2g), although respiratory and endocrine were notable outliers with high RS despite fewer associations.

Fig. 2: Differences in RS between therapy areas and the number and diversity of indications per target.
figure 2

ae, RS by therapy area and phase transitions: preclinical to phase I (a), phase I to II (b), phase II to III (c), phase III to launch (d) and phase I to launch (e). n at right: denominator, T–I pairs with genetic support; numerator, number of those that succeeded in the phase transition indicated at the top of the panel. For ‘all’, total n = 22,638 preclinical, 13,022 reaching at least phase I, 7,223 reaching at least phase II and 2,184 reaching at least phase III. Total n for each therapy area is provided in Supplementary Table 27. f, Cumulative number of possible genetically supported G–I pairs in each therapy (y axis) as genetic discoveries have accrued over time (x axis). g, RS (y axis) by number of possible supported G–I pairs (x axis) across therapy areas, with dots coloured as in panels ae and sized according to number of genetically supported T–I pairs in at least phase I. h, Number of launched indications versus similarity of those indications, by approved drug target. i, Proportion of launched T–I pairs with genetic support, P(G), binned by quintile of the number of launched indications per target (top panel) or by mean similarity among launched indications (bottom panel). Targets with exactly 1 launched indication (6.2% of launched T–I pairs) are considered to have mean similarity of 1.0. n at right: denominator, total number of launched T–I pairs in each bin; numerator, number of those with genetic support. j, RS (y axis) versus mean similarity among launched indications per target (x axis) by therapy area. k, RS (y axis) versus mean count of launched indications per target (x axis). Throughout, error bars or shaded areas represent 95% CIs (Wilson for P(G) and Katz for RS) whereas centres represent point estimates. See Supplementary Fig. 2 for the same analyses restricted to drugs with a single known target.

Source Data

We hypothesized that genetic support might be most pronounced for drug mechanisms with disease-modifying effects, as opposed to those that manage symptoms, and that the proportions of such drugs differ by therapy area20,21. We were unable to find data with these descriptions available for a sufficient number of drug mechanisms to analyse, but we reasoned that targets of disease-modifying drugs are more likely to be specific to a disease, whereas targets of symptom-managing drugs are more likely to be applied across many indications. We therefore examined the number and diversity of all-time launched indications per target. Launched T–I pairs are heavily skewed towards a few targets (Fig. 2h). Of 450 launched targets, the 42 with ≥10 launched indications comprise 713 (39%) of 1,806 launched T–I pairs (Fig. 2h). Many of these are used across diverse indications for management of symptoms such as inflammatory and immune responses (NR3C1, IFNAR2), pain (PTGS2, OPRM1), mood (SLC6A4) or parasympathetic response (CHRM3). The count of launched indications was inversely correlated with the mean similarity of those indications (ρ = −0.72, P = 4.4 × 10−84; Fig. 2h). Among T–I pairs, the probability of having genetic support increased as the number of launched indications decreased (P = 6.3 × 10−7) and as the similarity of a target’s launched indications increased (P = 1.8 × 10−5; Fig. 2i). We observed a corresponding impact on RS, increasing in therapy areas for which the similarity among launched indications increased, and decreasing with increasing indications per target (ρ = 0.74, P = 0.0010, and ρ = −0.62, P = 0.0080, respectively; Fig. 2j,k).

Only 4.8% (284 of 5,968) of T–I pairs active in phases I–III possess human germline genetic support (Fig. 1a), similar to T–I pairs no longer in development (4.2%, 560 of 13,355), a difference that was not statistically significant (P = 0.080). We estimated (Methods) that only 1.1% of all genetically supported G–I relationships have been explored clinically (Fig. 3a), or 2.1% when restricting to the most similar indication. Given that the vast majority of proteins are classically ‘undruggable’, we explored the proportion of genetically supported G–I pairs that had been developed to at least phase I, as a function of therapy area across several classes of tractability and relevant protein families22 (Fig. 3a). Within therapy areas, oncology kinases with germline evidence were the most saturated: 109 of 250 (44%) of all genetically supported G–I pairs had reached at least phase I; GPCRs for psychiatric indications were also notable (14 of 53, 26%). Grouping by target rather than G–I pair, 3.6% of genetically supported targets have been pursued for any genetically supported indication (Extended Data Fig. 8). Of possible genetically supported G–I pairs, most (68%) arose from OTG associations, mostly in the past 5 years (Fig. 2f). Such low use is partly due to recent emergence of most genetic evidence (Extended Data Figs. 2f,g and 7a), as drug programmes prospectively supported by human genetics have had a mean lag time from genetic association of 13 years to first trial21 and 21 years to approval9. Because some types of targets may be more readily tractable by antagonists than agonists, we also grouped by target and examined human genetic evidence by direction of effect for tumour suppressors versus oncogenes (Fig. 3b), identifying a few substrata for which a majority of genetically supported targets had been pursued to at least phase I for at least one genetically supported indication. Oncogene kinases received the most attention, with 19 of 25 (76%) reaching phase I.

Fig. 3: Clinical investigation of drug mechanisms with genetic evidence.
figure 3

a, Heatmap of proportion of genetically supported T–I pairs that have been developed to at least phase I, by therapy area (y axis) and gene list (x axis). b, As panel a, but for genetic support from IntOGen rather than germline sources and grouped by the direction of effect of the gene according to IntOGen (y axis), and also grouped by target rather than T–I pair. Thus, the denominator for each cell is the number of targets with at least one genetically supported indication, and each target counts towards the numerator if at least one genetically supported indication has reached phase I. c, Of targets that have reached phase I for any indication, and have at least one genetically supported indication, the mean count (x axis) of genetically supported (left) and unsupported (right) indications pursued, binned by the number of possible genetically supported indications (y axis). The centre is the mean and bars are Wilson 95% CIs. n = 1,147 targets. d, Proportion of D–I pairs with genetic support, P(G) (x axis), as a function of each D–I pair’s phase reached (inner y-axis grouping) and the drug’s highest phase reached for any indication (outer y-axis grouping). The centre is the exact proportion and bars are Wilson 95% CIs. The n is indicated at the right, for which the denominator is the total number of D–I pairs in each bin, and the numerator is the number of those that are genetically supported. See Supplementary Fig. 3 for the same analyses restricted to drugs with a single known target. Ab, antibody; SM, small molecule.

Source Data

To focus on demonstrably druggable proteins, we further restricted the analysis to targets with both (1) any programme reaching phase I, and (2) ≥1 genetically supported indications. Of 1,147 qualifying targets, only 373 (33%) had been pursued for one or more supported indications (Fig. 3c), and most (307, 27%) of these targets were pursued for indications both with and without genetic support. Overall, an overwhelming majority of development effort has been for unsupported indications, at a 17:1 ratio. Within this subset of targets, we asked whether genetic support was predictive of which indications would advance the furthest. Grouping active and historical programmes by drug–indication (D–I) pair, we found that the odds of advancing to a later stage in the pipeline are 82% higher for indications with genetic support (P = 8.6 × 10−73; Fig. 3d).

Although there has been anecdotal support—such as the HMGCR example—to argue that genetic effect size may not matter in prioritizing drug targets, here we provide systematic evidence that small effect size, recent year of discovery, increasing number of genes identified or higher associated allele frequency do not diminish the value of GWAS evidence to differentiate clinical success rates. One reason for this is probably because genetic effect size on a phenotype rarely accounts for the magnitude of genetic effect on gene expression, protein function or some other molecular intermediate. In some circumstances, genetic effect sizes can yield insights into anticipated drug effects. This is best illustrated for cardiovascular disease therapies, for which genetic effects on cholesterol and disease risk and treatment outcomes are correlated23. A limitation is that, other than Genebass, we did not include whole exome or whole genome sequencing association studies, which may be more likely to pinpoint causal variants. Moreover, all of our analyses are naive to direction of genetic effect (gain versus loss of gene function) as this is unknown or unannotated in most datasets used here.

Our results argue for continuing investment to expand GWAS-like evidence, particularly for many complex diseases with treatment options that fail to modify disease. Although genetic evidence has value across most therapy areas, its benefit is more pronounced in some areas than others. Furthermore, it is possible that the therapy areas for which genetic evidence had a lower impact have seen more focus on symptom management. If so, we would predict that for drugs aimed at disease modification, human genetics should ultimately prove highly valuable across therapy areas.

The focus of this work has been on the RS of drug programmes with and without genetic evidence, limited to drug mechanisms that have entered clinical development. This metric does not address the probability that a gene associated with a disease, if targeted, will yield a successful drug. At the early stage of target selection, is evidence of a large loss-of-function effect in one gene usually a better choice than a small non-coding single nucleotide polymorphism (SNP) effect on the same phenotype in another? We explored this question for T2D studies referenced above. When these GWASs quadrupled the number of T2D-associated genes from 217 to 862, new genetic support was identified for 7 of 95 mechanisms in clinical development whereas the number supported increased from 5 to 7 of 12 launched drug mechanisms. Thus, RS has remained high in light of new GWAS data. One can also, however, consider the proportion of genetic associations that are successful drug targets. Of the 7 targets of launched drugs with genetic evidence, 4 had Mendelian evidence (in addition to pre-2020 GWAS evidence), out of a total of 19 Mendelian genes related to T2D (21%). One launched T2D target had only GWAS (and no Mendelian) evidence among 217 GWAS-associated genes before 2020 (0.46%), whereas 2 launched targets were among 645 new GWAS associations since 2020 (0.31%). At least in this example, the ‘yield’ of genetic evidence for successful drug mechanisms was greatest for genes with Mendelian effects, but similar between earlier and later GWASs. Clearly, just because genetic associations differentiate clinical stage drug targets from launched ones, does not mean that a large fraction of associations will be fruitful. Moreover, genetically supported targets may be more likely to require upregulation, to be druggable only by more challenging modalities4,9 or to enjoy narrower use across indications. More work is required to better understand the challenges of target identification and prioritization given the genetic evidence precondition.

The utility of human genetic evidence in drug discovery has had firm theoretical and empirical footing for several years5,7,15. If the benefit of this evidence were cancelled out by competitive crowding24, then currently active clinical phases should have higher rates of genetic support than their corresponding historical phases, and might look similar to, or even higher than, launched pairs. Instead, we find that active programmes possess genetic support only slightly more often than historical programmes and remain less enriched for genetic support than launched drugs. Meanwhile, only a tiny fraction of classically druggable genetically supported G–I pairs have been pursued even among targets with clinical development reported. Human genetics thus represents a growing opportunity for novel target selection and improving indication selection for existing drugs and drug candidates. Increasing emphasis on drug mechanisms with supporting genetic evidence is expected to increase success rates and lower the cost of drug discovery and development.

Methods

Definition of metrics

Except where otherwise noted, we define genetic support of a drug mechanism (that is, a T–I pair) as a genetic association mapped to the corresponding target gene for a trait that is ≥0.8 similar to the indication (see MeSH term similarity below). We defined P(G) as the proportion of drug mechanisms satisfying the above definition of genetic support. P(S) is the proportion of programmes in one phase that advance to a subsequent phase (for instance, phase I to phase II). Overall P(S) from phase I to launched is the product of P(S) at each individual phase. RS is the ratio of P(S) for programmes with genetic support to P(S) for programmes lacking genetic support, which is equivalent to a relative risk or risk ratio. Thus, if N denotes the total number of programmes that have reached the reference phase, and X denotes the number of those that advance to a later phase of interest, and the subscripts G and!G indicate the presence or absence of genetic support, then P(G) = NG/(NG + N!G); P(S) = (XG + X!G)/(NG + N!G); RS = (XG/NG)/(X!G/N!G). RS from phase I to launched is the product of RS at each individual phase. The count of ‘programs’ for X and N is T–I pairs throughout, except for Fig. 3d, which uses D–I pairs to specifically interrogate P(G) for which the same drug has been developed for different indications. For clarity, we note that whereas other recent studies22,25 have examined the fold enrichment and overlap between genes with a human genetic support and genes encoding a drug target, without regard to similarity, herein all of our analyses are conditioned on the similarity between the drug’s indication and the genetically associated trait.

Drug development pipeline

Citeline Pharmaprojects26 is a curated database of drug development programmes including preclinical, all clinical phases and launched (approved and marketed) drugs. It was queried via API (22 December 2022) to obtain information on drugs, targets, indications, phases reached and current development status. T–I pair was the unit of analysis throughout, except where otherwise indicated in the text (D–I pairs were examined in Fig. 3d). Current development status was defined as ‘active’ if the T–I pair had at least one drug still in active development, and ‘historical’ if development of all drugs for the T–I pair had ceased. Targets were defined as genes; as most drugs do not directly target DNA, this usually refers to the gene encoding the protein target that is bound or modulated by the drug. We removed combination therapies, diagnostic indication and programmes with no human target or no indication assigned. For most analyses, only programmes added to the database since 2000 were included, whereas for the count and similarity of launched indications per target, we used all launches for all time. Indications were considered to possess ‘genetic insight’—meaning the human genetics of this trait or similar traits have been successfully studied—if they had ≥0.8 similarity to (1) an OMIM or IntOGen disease, or (2) a GWAS trait with at least 3 independently associated loci, on the basis of lead SNP positions rounded to the nearest 1 megabase. For calculating RS, we used the number of T–I pairs with genetic insight as the denominator. The rationale for this choice is to focus on indications for which there exists the opportunity for human genetic evidence, consistent with the filter applied previously5. However, we observe that our findings are not especially sensitive to the presence of this filter, with RS decreasing by just 0.17 when the filter is removed (Extended Data Fig. 3g,h). Note that the criteria for determining genetic insight are distinct from, and much looser than, the criteria for mapping GWAS hits to genes (see L2G scores under OTG below). Many drugs had more than one target assigned, in which case all targets were retained for T–I pair analyses. As a sensitivity test, running our analyses restricted to only drugs with exactly one target assigned yielded very similar results (Supplementary Figures).

OMIM

OMIM is a curated database of Mendelian gene–disease associations. The OMIM Gene Map (downloaded 21 September 2023) contained 8,671 unique gene–phenotype links. We restricted to entries with phenotype mapping code 3 (‘the molecular basis for the disorder is known; a mutation has been found in the gene’), removed phenotypes with no MIM number or no gene symbol assigned, and removed duplicate combinations of gene MIM and phenotype MIM. We used regular expression matching to further filter out phenotypes containing the terms ‘somatic’, ‘susceptibility’ or ‘response’ (drug response associations) and those flagged as questionable (‘?’), or representing non-disease phenotypes (‘[’). A set of OMIM phenotypes are flagged as denoting susceptibility rather than causation (‘{’); this category includes low-penetrance or high allele frequency association assertions that we wished to exclude, but also germline heterozygous loss-of-function mutations in tumour suppressor genes, for which the underlying mechanism of disease initiation is loss of heterozygosity, which we wished to include. We therefore also filtered out phenotypes containing ‘{’ except for those that did contain the terms ‘cancer’, ‘neoplasm’, ‘tumor’ or ‘malignant’ and did not contain the term ‘somatic’. Remaining entries present in OMIM as of 2021 were further evaluated for validity by two curators, and gene–disease combinations for which a disease association was deemed not to have been established were excluded from all analyses. All of the above filters left 5,670 unique G–T links. MeSH terms for OMIM phenotypes were then mapped using the EFO OWL database using an approach previously described27, with further mappings from Orphanet, full text matches to the full MeSH vocabulary and, finally, manual curation, for a cumulative mapping rate of 93% (5,297 of 5,670). Because sometimes distinct phenotype MIM numbers mapped to the same MeSH term, this yielded 4,510 unique gene–MeSH links.

OTG

OTG is a database of GWAS hits from published studies and biobanks. OTG version 8 (12 October 2022) variant-to-disease, L2G, variant index and study index data were downloaded from EBI. Traits with multiple EFO IDs were excluded as these generally represent conditional, epistasis or other complex phenotypes that would lack mappings in the MeSH vocabulary. Of the top 100 traits with the greatest number of genes mapped, we excluded 76 as having no clear disease relevance (for example, ‘red cell distribution width’) or no obvious marginal value (for example, excluded ‘trunk predicted mass’ because ‘body mass index’ was already included). Remaining traits were mapped to MeSH using the EFO OWL database, full text queries to the MeSH API, mappings already manually curated in PICCOLO (see below) or new manual curation. In total, 25,124 of 49,599 unique traits (51%) were successfully mapped to a MeSH ID. We included associations with P < 5 × 10−8. OTG L2G scores used for gene mapping are based on a machine learning model trained on gold standard causal genes28; inputs to that model include distance, functional annotations, expression quantitative trait loci (eQTLs) and chromatin interactions. Note that we do not use Mendelian randomization29 to map causal genes, and even gene mappings with high L2G scores are necessarily imperfect. OTG provides an L2G score for the triplet of each study or trait with each hit and each possible causal gene. We defined L2G share as the proportion of the total L2G score assigned each gene among all potentially causal genes for that trait–hit combination. In sensitivity analyses we considered L2G share thresholds from 10% to 100% (Fig. 1b and Extended Data Fig. 3a), but main analyses used only genes with ≥50% L2G share (which are also the top-ranked genes for their respective associations). OTG links were parsed to determine the source of each OTG data point: the EBI GWAS catalog30 (n = 136,503 hits with L2G share ≥0.5), Neale UK Biobank (http://www.nealelab.is/uk-biobank; n = 19,139), FinnGen R6 (ref. 31) (n = 2,338) or SAIGE (n = 1,229).

PICCOLO

PICCOLO32 is a database of GWAS hits with gene mapping based on tests for colocalization without full summary statistics by using Probabilistic Identification of Causal SNPs (PICS) and a reference dataset of SNP linkage disequilibrium values. As described32, gene mapping uses quantitative trait locus (QTL) data from GTEx (n = 7,162) and a variety of other published sources (n = 6,552). We included hits with GWAS P < 5 × 10−8, and with eQTL P < 1 × 10−5, and posterior probability H4 ≥ 0.9, as these thresholds were determined empirically32 to strongly predict colocalization results.

Genebass

Genebass33 is a database of genetic associations based on exome sequencing. Genebass data from 394,841 UK Biobank participants (the ‘500K’ release) were queried using Hail (19 October 2023). We used hits from four models: pLoF (predicted loss-of-function) or missense|LC (missense and low confidence LoF), each with sequencing kernel association test (SKAT) or burden tests, filtering for P < 1 × 10−5. Because the traits in Genebass are from UK Biobank, which is included in OTG, we used the OTG MeSH mappings established above.

IntOGen

IntOGen is a database of enrichments of somatic genetic mutations within cancer types. We used the driver genes and cohort information tables (31 May 2023). IntOGen assigns each gene a mechanism in each tumour type; occasionally, a gene will be classified as a tumour suppressor in one type and an oncogene in another. We grouped by gene and assigned each gene its modal classification across cancers. MeSH mappings were curated manually.

MeSH term similarity

MeSH terms in either Pharmaprojects or the genetic associations datasets that were Supplementary Concept Records (IDs beginning in ‘C’) were mapped to their respective preferred main headings (IDs beginning in ‘D’). A matrix of all possible combinations of drug indication MeSH IDs and genetic association MeSH IDs was constructed. MeSH term Lin and Resnik similarities were computed for each pair as described34,35. Similarities of −1, indicating infinite distance between two concepts, were assigned as 0. The two scores were regressed against each other across all term pairs, and the Resnik scores were adjusted by a multiplier such that both scores had a range from 0 to 1 and their regression had a slope of 1. The two scores were then averaged to obtain a combined similarity score. Similarity scores were successfully calculated for 1,006 of 1,013 (99.3%) unique MeSH terms for Pharmaprojects indications, corresponding to 99.67% of Pharmaprojects T–I pairs, and for 2,260 of 2,262 (99.9%) unique MeSH terms for genetic associations, corresponding to >99.9% of associations.

Therapeutic areas

MeSH terms for Pharmaprojects indications were mapped onto 16 top-level headings under the Diseases [C] and Psychiatry and Psychology [F] branches of the MeSH tree (https://meshb.nlm.nih.gov/treeView), plus an ‘other’. The signs/symptoms area corresponds to C23 Pathological Conditions, Signs and Symptoms and contains entries such as inflammation and pain. Many MeSH terms map to >1 tree positions; these multiples were retained and counted towards each therapy area, except for the following conditions: for terms mapped to oncology, we deleted their mappings to all other areas; and ‘other’ was used only for terms that mapped to no other areas.

Analysis of T2D GWASs

We included 19 genes from OMIM linked to Mendelian forms of diabetes or syndromes with diabetic features. For Vujkovic et al.18, we considered as novel any genes with a novel nearest gene, novel coding variant or a novel lead SNP colocalized with an eQTL with H4 ≥ 0.9. Non-novel nearest genes, coding variants and colocalized lead SNPs were considered established variants. For Suzuki et al.19, we used the available L2G scores that OTG had assigned for the same lead SNPs in previously reported GWASs for other phenotypes, yielding mapped genes with L2G share >0.5 for 27% of loci. Genes were considered novel if absent from the Vujkovic analysis. Together, these approaches identified 217 established GWAS genes and 645 novel ones (469 from Vujkovic and 176 from Suzuki). We identified 347 unique drug targets in Pharmaprojects reported with a T2D or diabetes mellitus indication, including 25 approved. We reviewed the list of approved drugs and eliminated those for which there were questions around the relevance of the drug or target to T2D (AKR1B1, AR, DRD1, HMGCR, IGF1R, LPL, SLC5A1). Because Pharmaprojects ordinarily specifies the receptor as target for protein or peptide replacement therapies, we also remapped the minority of programmes for which the ligand, rather than receptor, had been listed as target (changing INS to INSR, GCG to GCGR). To assess the proportion of programmes with genetic support, we first grouped by drug and selected just one target, preferring the target with the earliest genetic support (OMIM, then established GWASs, then novel GWASs, then none). Next we grouped by target and selected its highest phase reached. Finally, we grouped by highest phase reached and counted the number of unique targets.

Universe of possible genetically supported G–I pairs

In all of our analyses, targets are defined as human gene symbols, but we use the term G–I pair to refer to possible genes that one might attempt to target with a drug, and T–I pair to refer to genes that are the targets of actual drug candidates in development. To enumerate the space of possible G–I pairs, we multiplied the n = 769 Pharmaprojects indications considered here by the ‘universe’ of n = 19,338 protein-coding genes, yielding a space of n = 14,870,922 possible G–I pairs. Of these, n = 101,954 (0.69%) qualify as having genetic support per our criteria. A total of 16,808 T–I pairs have reached at least phase I in an active or historical programme, of which 1,155 (6.9%) are genetically supported. This represents an enrichment compared with random chance (OR = 11.0, P < 1.0 × 10−15, Fisher’s exact test), but in absolute terms, only 1.1% of genetically supported G–I pairs have been pursued. A genetically supported G–I pair may be less likely to attract drug development interest if the indication already has many other potential targets, and/or if the indication is but the second-most similar to the gene’s associated trait. Removing associations with many GWAS hits and restricting to the single most similar indication left a space of 34,190 possible genetically supported G–I pairs, 719 (2.1%) of which had been pursued. This small percentage might yet be perceived to reflect competitive saturation, if the vast majority of indications are undevelopable and/or the vast majority of targets are undruggable. We therefore asked what proportion of genetically supported G–I pairs had been developed to at least phase I, as a function of therapy area cross-tabulated against Open Targets predicted tractability status or membership in canonically ‘druggable’ protein families, using families from ref. 22 as well as UniProt pkinfam for kinases36. We also grouped at the level of gene, rather than G–I pair (Extended Data Fig. 8).

Druggability and protein families

Antibody and small molecule druggability status was taken from Open Targets37. For antibody tractability, Clinical Precedence, Predicted Tractable–High Confidence and Predicted Tractable–Medium to Low Confidence were included. For small molecules, Clinical Precedence, Discovery Precedence and Predicted Tractable were included. Protein families were from sources described previously22, plus the pkinfam kinase list from UniProt36. To make these lists non-overlapping, genes that were both kinases and also enzymes, ion channels or nuclear receptors were considered to be kinases only.

Statistics

Analyses were conducted in R 4.2.0. For binomial proportions P(G) and P(S), error bars are Wilson 95% CIs, except for P(S) for phase I–launch for which the Wald method is used to compute the confidence intervals on the product of the individual probabilities of success at each phase. RS uses Katz 95% CIs, with the phase I launch RS based on the number of programs entering phase I and succeeding in phase III. Effects of continuous variables on probability of launch were assessed using logistic regression. Differences in RS between therapy areas were tested using the Cochran–Mantel–Haenszel chi-squared test (cmh.test from the R lawstat package, v.3.4). Pipeline progression of D–I pairs conditioned on the highest phase reached by a drug was modelled using an ordinal logit model (polr with Hess = TRUE from the R MASS package, v.7.3-56). Correlations across therapy areas were tested by weighted Pearson’s correlation (wtd.cor from the R weights package, v.1.0.4); to control for the amount of data available in each therapy area, the number of genetically supported T–I pairs having reached at least phase I was used as the weight. Enrichments of T–I pairs in the utilization analysis were tested using Fisher’s exact test. All statistical tests were two-sided.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.