A publication-wide association study (PWAS), historical language models to prioritise novel therapeutic drug targets

Narganes-Carlón, David; Crowther, Daniel J.; Pearson, Ewan R.

doi:10.1038/s41598-023-35597-4

Download PDF

Article
Open access
Published: 24 May 2023

A publication-wide association study (PWAS), historical language models to prioritise novel therapeutic drug targets

David Narganes-Carlón^1,2,
Daniel J. Crowther² &
Ewan R. Pearson¹

Scientific Reports volume 13, Article number: 8366 (2023) Cite this article

1777 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Most biomedical knowledge is published as text, making it challenging to analyse using traditional statistical methods. In contrast, machine-interpretable data primarily comes from structured property databases, which represent only a fraction of the knowledge present in the biomedical literature. Crucial insights and inferences can be drawn from these publications by the scientific community. We trained language models on literature from different time periods to evaluate their ranking of prospective gene-disease associations and protein–protein interactions. Using 28 distinct historical text corpora of abstracts published between 1995 and 2022, we trained independent Word2Vec models to prioritise associations that were likely to be reported in future years. This study demonstrates that biomedical knowledge can be encoded as word embeddings without the need for human labelling or supervision. Language models effectively capture drug discovery concepts such as clinical tractability, disease associations, and biochemical pathways. Additionally, these models can prioritise hypotheses years before their initial reporting. Our findings underscore the potential for extracting yet-to-be-discovered relationships through data-driven approaches, leading to generalised biomedical literature mining for potential therapeutic drug targets. The Publication-Wide Association Study (PWAS) enables the prioritisation of under-explored targets and provides a scalable system for accelerating early-stage target ranking, irrespective of the specific disease of interest.

MAGPEL: an autoMated pipeline for inferring vAriant-driven Gene PanEls from the full-length biomedical literature

Article Open access 23 July 2020

Nafiseh Saberian, Adib Shafi, … Sorin Draghici

TrendyGenes, a computational pipeline for the detection of literature trends in academia and drug discovery

Article Open access 03 August 2021

Guillermo Serrano Nájera, David Narganes Carlón & Daniel J. Crowther

PGxCorpus, a manually annotated corpus for pharmacogenomics

Article Open access 02 January 2020

Joël Legrand, Romain Gogdemir, … Adrien Coulet

Introduction

Numerous machine learning strategies have demonstrated the potential to accelerate drug discovery across various stages, such as drug repurposing, de novo drug design, clinical trial optimisation, and patient selection^1,2. Target identification and prioritisation represent critical initial steps in the drug discovery process. Historically, only 10% of drug targets successfully progress through clinical trials^3,4, and the success rate appears even lower for novel targets^5,6. Traditionally, individual researchers have relied on their expertise to analyse scientific literature and prioritise potential opportunities. Recently, numerous efforts have employed machine learning techniques to automate and prioritise gene-disease associations^{7,8,9,10,11,12,13}. However, most of these studies have focused on training models and validating them on pre-existing therapeutic targets, rather than adopting a time-sliced approach to train on past data and evaluate the models' prospective predictions. Additionally, these approaches primarily rely on structured property databases as their main source of machine-interpretable data, which only encompass a limited portion of the knowledge found in the research literature. Paliwal et al. utilised a knowledge graph approach combined with link prediction methods to historically prioritise targets, claiming that their method successfully identified 1 in 4 therapeutic relationships that were eventually proven true.

Scientific publications contain valuable relationships and inferences that the research community can interpret, with PubMed housing over 35 million publications as of 2023. The volume of scientific literature will only continue to grow, creating a vast repository of undiscovered public knowledge. The biomedical domain has seen a surge of interest in pre-trained language models, owing to their remarkable success in the general natural language domain. The pre-trained models can be classified into two main categories: BERT and GPT, along with their respective variants. BERT covers models like BioBERT¹⁴ and PubMedBERT¹⁵, whereas GPT includes BioGPT¹⁶. These transformer language models have been pre-trained on biomedical literature. Typically, these models are evaluated and benchmarked on tasks such as entity recognition, relationship extraction, and question answering for known biomedical applications, following their training on the entirety of the biomedical literature. However, assessing their performance on genome-wide prioritisation of therapeutic targets presents a challenge, as these models often divide disease and gene synonyms into multiple tokens, complicating the process of consolidating them into a cohesive ranking.

In this study, we introduce a Publication-Wide Association Study (PWAS) to assess the ability of independently trained models using historical data to prioritise unstated hypotheses, which could be considered false positives or ‘hallucinations’¹⁷, and are eventually validated. Additionally, we examine the potential of historical Word2Vec models to prioritise yet-unknown hypotheses by utilising future test sets instead of relying on test datasets containing already established associations that the language models have already seen. By training unsupervised language models on historical subsets of biomedical literature, we explore whether these models could have effectively prioritised or suggested novel drug discovery scientific hypotheses in the past. To this end, we evaluate the language models on a range of drug discovery-related tasks, including target classification, gene-disease prioritisation, and protein–protein interaction prediction, within the context of a historical, retrospective study.

Protein–protein interactions are also crucial in drug discovery because data incompleteness in the human interactome hinders our understanding of the molecular roots of human disease¹⁸. Drug targets with genetic evidence are twice as likely to succeed in clinical trials^19,20. However, most immune-related targets are found in the protein–protein interaction vicinity of genetically associated genes rather than being genetically associated themselves²¹. Therefore, it is essential to understand systems biology to propose novel targets. Moreover, validating protein–protein interactions can be time-consuming and costly, but computational tools offer a cost-effective alternative to prioritise novel protein–protein interactions¹⁸ and suggest new therapeutic targets. There have been numerous methods to predict protein–protein interactions using a mixture of traditional machine learning methods²² and deep learning models^23,24,25 that include AlfaFold2 protein structure information. However, this work focuses on whether language models could have prioritised novel interactions with the limited data available in the past scientific literature.

In this work, we introduce a Publication-Wide Association Study (PWAS) that examines whether language models can retrospectively prioritise gene-disease associations before their initial statements, target-disease indications before the conception or synthesis of drug-like molecules, and protein–protein interactions before their discovery.

Results

Target-disease prioritisation

We trained multiple Word2Vec²⁶ language models on an English corpus of 19.5 million biomedical titles and abstracts to generate embeddings for a vocabulary of approximately 2 million phrases (see “Methods” for details). Language models were saved at the end of each year to enable retrospective analysis (see “Methods” for details). Language models leverage co-occurrence information from a text corpus to estimate the likelihood of two words or phrases co-occurring, even if they have never appeared together in the training set. These likelihood scores can be employed to rank gene-disease hypotheses that have not been explicitly stated before. We trained language models on literature up to various points in the past and subsequently evaluated their ranking on prospective gene-disease associations published in the literature. Specifically, we generated 28 distinct historical text corpora consisting of abstracts published between 1995 and 2022, with each incrementing the cut-off date by one year. Independent models were trained on these historical datasets and used to prioritise associations between genes and diseases that were likely to be reported in future (test) years (Figs. 1A, 2D,G).

For each year, the language models were used to make a prediction. We tested which percentage of the top-ranked genes without prior publications were subsequently reported in the literature (Fig. 1A,D,G for NAFLD, SLE and NSCLC). The grey lines correspond to the predictions made by a different language model trained with one of the 28 different historical datasets. For example, the grey line labelled ‘2012’ in Fig. 1A depicts the percentage of the top 50 novel predictions from the 2012 model that were found to be associated with the following years. The cumulative precisions are averaged (red line in Fig. 1A,D,G) and compared to random (bootstrap sampling of 50 genes without replacement in 10,000 iterations). Approximately 50% of the top 50 novel genes are reported to be associated with diseases in different therapeutic areas (metabolic disorders, immune disease and oncology) after several years (Fig. 1A,D,G for non-alcoholic fatty liver disease (NAFLD), systemic lupus erythematosus (SLE) and non-small cell lung carcinoma (NSCLC)). Overall, the language model predictions (blue lines in Fig. 1A,D,G) were 6 to 10 times more likely to be studied than a random sample of genes (red in Fig. 1A,D,G).

The genome-wide rankings from the language models were then evaluated on prioritising therapeutic drug targets before their discovery in a scientific publication and the first clinical trial in Fig. 1B,E,H for NAFLD, SLE and NSCLC, respectively. The genome-wide rankings for the human protein-coding genes PNPLA3, FGF21, and ANGPTL3 (Fig. 1B) rapidly increased before the first publications relating them to NAFLD: PNPLA3 by Romeo et al.²⁷ in 2008 and FGF21 by Inagaki et al.²⁸ and ANGPTL3 by Yilmaz et al.²⁹ in 2007 (Fig. 1B). Furthermore, the genome-wide rankings rapidly prioritised these targets before the first clinical trial with compounds modulating their activity: ION-839l to inhibit the production of PNPLA3 in 2018, pegbelfermin as a human FGF21 analogue, first reported in 2007³⁰, Vupanorsen as an N-acetyl galactosamine-conjugated antisense molecule that targets ANGPTL3 mRNA in 2015³¹ (Fig. 1B). Language models are sometimes black boxes, but we gained insights into what drove the predictions by finding ‘contextual linking words’ already published and then prioritising articles containing them. Figure 1C shows the intermediate ‘contextual linking words’ for PNPLA3 and NAFLD. These words include hyperinsulinemia, glucose intolerance, hypertriglyceridemia as terms related to the claims from research performed from 2004 until 2007, that associated PNPLA3 with obesity, insulin resistance in adipose tissues^32,33,34, the phenotypes of NAFLD.

The same applies to the immune-related trait SLE in Fig. 1E. Bruton tyrosine kinase (BTK) ranked in the top-100 genes in 1999, well before the trial of ibrutinib in 2007³⁵. CD19 ranked in the top-300 genes before the first trial of obexelimab in 2007³⁶. Both interleukin-1 receptor-associated kinase 4 (IRAK4) and Toll-like receptor 7 (TLR7) were ranked from the top-10000 to the top-1000 genes for SLE before the first publication implicating them in the pathology of lupus, in 2009 by Cohen³⁷ and 2005 by Barrat et al.³⁸, respectively. IRAK4 was consistently ranked in the top-1000 genes prior to the first PF06650833 inhibitor by Pfizer which dates back to 2014³⁹. Similarly, TLR7 was on the top ranking for several years before the first TLR7 inhibitor that entered the clinic, IMO3100, which was discontinued. Enpatoran is another TLR7 inhibitor currently in Phase 2 trials for lupus⁴⁰. The contextual linking words indicate why the model suggested IRAK4 for lupus, due to its involvement in innate immunity through TLR7 that subsequently activates Myd88^41,42 (Fig. 1F).

Figure 1G–I are related to NSCLC. Figure 1H shows that anaplastic lymphoma kinase (ALK) and AXL receptor tyrosine kinase ranked in the top 1000 suggested targets for NSCLC for several years before their first evidence by Soda et al.⁴³ in 2007 and Wimmel et al.⁴⁴ in 2001, and the first trials for the ALK inhibitor crizotinib (originally thought as a tyrosine-protein kinase MET specific inhibitor) in 2007⁴⁵ and the AXL inhibitor bemcetinib in 2010⁴⁶. The 2006 language model was aware of their role as receptor tyrosine kinase oncogenes (high similarity to ‘kinase’ and ‘target’ Fig. 1I) in other cancer indications^47,48 and their potential implication in NSCLC. The immunomodulatory role of CD274 in tumours was demonstrated by Honjo’s group, winner of the Nobel Prize in 2002⁴⁹. The language model could not prioritise this gene in the top 1000, but it did before the first monoclonal antibody nivolumab, in 2006⁵⁰. There was prior evidence of the neurotrophic receptor tyrosine kinase 1 (NTRK1) association with types of NSCLC in 1995⁵¹ but the oncogenic role in lung adenocarcinoma was not confirmed until 2008 by Ding et al.⁵¹. NRTK1 ranking was consistently in the top 100 genes (Fig. 1H) before Ding results and in the top 20 (Fig. 1H) before the first evidence of an NRTK inhibitor, entrecitibib, in 2009. Furthermore, back in 2002, language models were able to prioritise several oncogene targets of NSCLC, genes expressed in lung adenocarcinoma (LUAD, Fig. 1I) cell lines (see “Methods”, t-test with p-value 1e − 198) but not CD274. This suggests that the top-ranking targets are enriched in genes expressed in the tissue of origin of the disease.

The embeddings from 2016 were used as covariate features to estimate the clinical trial success (see “Methods”) with a multilayer perceptron model (see “Methods”). The output generated another score that was tested on prioritising incipient target-disease indications that ended up in clinical trials. On a testing set of 200 diseases described by Paliwal et al.¹², 5.81% of the top 200 targets per disease novel suggestions ended up in a clinical trial after 5 years (Fig. 1J). These results outperform Paliwal et al.¹² method with 3.54% precision for 200 targets per disease (Fig. 1 top 100a test to compare the performance of PWAS to other genome-wide ranking methods like MAGENTA⁵² (MGNT in red with Wang et al. 2021 data⁵³ in Fig. 1L), Priority index²¹ (Pi in green Fig. 1L), Open Targets (OT in purple in Fig. 1L), we found that PWAS prioritises better (mean z-score is 1.86 in Fig. 1L) the 60 targets of SLE that have approved drugs or drugs in active clinical trials in October 2022 in Pharmaprojects⁵⁴ (see “Methods”). Furthermore, it is the method that gave the highest z-score to the therapeutic targets TLR7, BTK and IRAK4 of which only TLR7 has been reported to have a genetic association with lupus⁵⁵ through a single gain-of-function mutation⁵⁶. These results suggest that language models prioritise gene-disease associations before their discovery and validation in clinical trials.

Therapeutic target representation

An in-depth exploratory analysis was conducted using the 2022 model to investigate how human genes, as a biomedical entity, were represented in a low-dimensional space. Employing Uniform Manifold Approximation and Projection (UMAP), we uncovered hidden structures within the embeddings in this lower-dimensional space (Fig. 2A–F). Distinct therapeutic drug target families were found to cluster in separate regions of this space (Fig. 2A). The mean Silhouette Coefficient, which measures the consistency of gene families in the lower-dimensional space, was calculated to be 0.11 (with a maximum possible value of 1).

The mean cosine distance, a metric used to measure the similarity between two vectors or embeddings, was calculated for different gene families: 0.14 for all human genes, 0.19 for enzymes, 0.28 for G-protein coupled receptors (GPCRs), 0.27 for kinases, 0.23 for transcription factors, and 0.38 for voltage-gated ion channels (see “Methods” for details). The gene family clusters observed in Fig. 2A overlap when human genes are coloured based on the likelihood of co-occurring with phrases describing the respective gene families (Fig. 2B).

Focusing on individual genes, the gene embeddings clustered voltage-gated ion channel genes, such as ABCC8, ABCC9, and KCNJ11, which are targets for type 2 diabetes mellitus; JAK1, JAK2, and JAK3 kinases, which are therapeutic targets in numerous immune-related disorders; and receptor tyrosine kinases ERBB2 and ERBB3. Moreover, therapeutic drug targets with programs in clinical trials or with approved drugs were predominantly located in the periphery of the UMAP plot (Fig. 2C). These vector embeddings encapsulate information from the literature as dense fingerprints, which can be utilised in subsequent classification tasks.

To assess whether the embeddings contained information regarding successful clinical trial targets, the classification experiment conducted by Ferrero et al.⁹ in 2017 was replicated. Ferrero et al.⁹ utilised several manually engineered features from the Open Targets¹³ platform to predict whether a gene was a therapeutic target or not. In this study, the features employed were the word vector embeddings trained until December 2017 (see “Methods”). A multilayer perceptron with default parameters and a balanced loss function was used as the model. Genes were labelled as targets if Pharmaprojects⁵⁴ tagged them in any active clinical trial or as registered or launched in 2017, following the approach employed by Ferrero et al.⁹. For targets with programs spanning multiple phases, the most advanced stage was considered. The choice of 2017 allowed for a direct comparison with the state of literature and clinical trial phases at the time when Ferrero et al.⁹ conducted their research.

Table 1 presents the classification metrics for the testing set with unseen genes relative to a random forest classifier. These metrics were calculated using a comparable number of negatives and positives. Precision represents the fraction of actual targets relative to all predicted targets, while recall is the fraction of predicted targets compared to the number of test targets. The f1 score, a metric that averages precision and recall, was found to be 20% higher than that reported by Ferrero et al.⁹ in the Approved or Clinical Trial column. Open Targets integrates manually engineered variables from multiple sources¹³ to define a gene as a target. These findings suggest that word vectors capture more information about therapeutic targets than expert-crafted variables. Classifiers could potentially be employed to generate a learned clinical tractability estimate.

Table 1 Evaluation of classification metrics on a balanced test set. The classification metrics for different classes are derived from Pharmaprojects⁵⁴ and are assessed on a balanced test set containing an equal number of positive and negative samples. The values enclosed in brackets represent the 5% and 95% confidence intervals, which were determined by conducting 100 bootstrap sampling experiments, each maintaining an equal distribution of positive and negative samples. This approach provides a more robust estimation of the model's performance and ensures that the observed results are not influenced by any potential sampling bias.

Full size table

The classification experiment was repeated separately for small molecule (SM) and monoclonal antibody (AB) targets (Fig. 2G). Figure 2D illustrates the predicted SM clinical tractability probabilities for the entire genome, while Fig. 2F displays probabilities for all human kinase genes. Despite the absence of monoclonal antibodies against known type-2 diabetes small molecule targets ABCC8, ABCC9, DPP4, and KCNJ11, these genes demonstrated AB tractability estimates greater than 60%, along with interleukin receptors IL4R and IL17RA (Fig. 2D). As transmembrane proteins, these genes are amenable to monoclonal antibody targeting, and the language model accurately captured this information despite the lack of evidence in the training set. Similar results were observed for ERBB2 and ERBB3, with 90% SM and AB scores, as well as FDA-approved small molecules and antibodies against them. Figure 2E contains only the 502 kinases (green Fig. 2A representing all 16,533 genes) represented in the lower dimension. In Fig. 2E,F, genes encoding protein kinases cluster based on their biological pathways and functions. Receptor-interacting serine/threonine-protein kinases RIPK1, RIPK2, RIPK3, and MLKL involved in necroptosis, as well as Janus tyrosine kinases JAK1, JAK2, and JAK3, cluster at the bottom centre of Fig. 2E. Additionally, kinases involved in DNA damage repair cluster together in Fig. 2E. Regarding small molecule tractability estimates, MLKL, TLK1, and TLK2 receive very low scores, as they are kinase-like proteins that have lost their kinase activity. The kinase families in the legend are the acronyms used by Guide to Pharmacology database⁵⁷. These results suggest that language embeddings contain features valuable for defining therapeutic drug targets. However, the critical step in drug discovery is identifying the appropriate therapeutic drug target for a specific disease.

Protein interactions

The limitations of the current human interactome pose challenges to comprehending the molecular underpinnings of human diseases^58,59,60. For identifying potential therapeutic targets, it is essential to understand the pathways impacted by the disease and pinpoint nodes within the signalling network whose modulation can alleviate the disease state^58,59,60. We investigated if language models prioritise protein–protein interactions (PPIs) before they are published. The historical data of PPIs were obtained from yeast-two-hybrid assays of the human interactome^58,59,60, which consists of four datasets: H-I-05 from 2005, HI-II-14 from 2014, HuRI from 2020, and HI-union from 2020. A bona fide negative set was absent. Combinations of 'sticky' proteins, representing the top 10% of proteins with the highest number of interactions, served as negative examples, except when they overlapped with the positive set (refer to “Methods”).

To assess the models' capacity to accurately predict PPIs that were eventually discovered, new data was treated as negative observations for older models. Predictions were evaluated both contemporaneously and prospectively (refer to “Methods”). The prospective analysis demonstrated a consistent decline in false positives across each prospective dataset (Fig. 3A), accompanied by an increase in precision (Fig. 3B). The disparity in false positives between consecutive datasets signifies the number of PPIs that could have been prioritised 6–9 years ahead of time (Fig. 3A), with the correctly predicted PPIs ranging between 500 and 2000 in some instances.

True positives that were ultimately discovered in later years, such as 2014 and 2020, were considered as negative samples in the 2005 training set. However, they were still prioritised owing to the positive-unlabeled learning strategy (refer to “Methods”) (Fig. 3A,B). For instance, the majority of HI-union positive but HuRI negative protein interaction pairs were assigned probability scores (HuRI(−) HI-union(+) red histogram in Fig. 3C). These elevated scores correspond to the reduction in false positives for the HuRI model in Fig. 3A and the increased precision for the same model in Fig. 1B.

The PWAS approach outperformed the L3 method from Kovacs et al.¹⁸ in terms of precision-recall curve at various thresholds for both training and testing sets (Fig. 3D). Despite slight overfitting in the training set (Fig. 3D), the models were able to generalise to previously unseen proteins in the test set. Sixty multilayer perceptron (MLP) models were trained using three different hyperparameters: number of hidden layers, hidden layer size, and dropout percentage (refer to “Methods”) (one dot per model in contour plot of Fig. 3E). To accurately classify PPIs (Matthews correlation coefficient or MCC > 0.8) on the test set, the models required a relatively large hidden layer size (> 10) (Fig. 3E). The best-performing model was D40-N2-S60 (black text in Fig. 3E) with 40% dropout, two hidden layers, and 60 units on the HI-union dataset.

The MLP models using embedding features were able to accurately predict PPIs because the embeddings encode information about the biochemical pathways involving the respective genes (Fig. 3F). For example, members of the peroxisome proliferator-activated receptors (PPAR) pathway, which includes 67 genes (such as PPARA, PPARD, PPARG, RCRA, RXRG in Fig. 3F) from the Kyoto Encyclopedia of Genes and Genomes⁶¹ (KEGG), clustered together with a mean cosine distance of 0.32, compared to 0.14 for any-to-any gene (Fig. 3F). Similarly, members of the JAK-signal transducer and activator of transcription (STAT) pathway, comprising 148 genes (such as JAK1, JAK2, STAT6, IL4R, IL17RA in Fig. 3F) from KEGG⁶¹, also clustered together, exhibiting a mean cosine distance of 0.31 compared to 0.14 for any-to-any gene (Fig. 3F).

The MLP model trained with embeddings not only outperformed the L3 method from Kovacs et al.¹⁸ (Fig. 3D) but also demonstrated the potential to prioritise PPIs for low-throughput experimental validation. Overall, our findings reveal that language models can be harnessed to effectively predict and prioritise PPIs before they are experimentally validated and published, thereby accelerating the discovery of new molecular targets for human diseases.

Prioritising novel mechanisms of action of drugs

Language models, akin to gene-disease associations (Fig. 4A kinase to cancer indication), can also uncover latent associations between proteins and drug-like compounds that modulate their activity (Fig. 4B kinase to kinase inhibitor). A principal component analysis demonstrates a consistent vector operation: 'kinase inhibitor' + 'kinase inhibitor of' ≈ 'kinase' and 'kinase' + 'gain of function in' ≈ 'cancer' (Fig. 4A).

We trained multiple Scikit Learn and TensorFlow models (Table 5) using gene and compound word vectors as input, with positive samples from DrugBank and random negative pairs (see “Methods”). The gene and compound vectors were concatenated to generate a single fingerprint for each target-compound pair. These models were employed to weight features and produce a single score for each gene-compound pair, estimating the likelihood of compound modulation of the protein encoded by the gene. The H100-N2-D0.2 model, a multilayer perceptron classifier with 100 hidden neurons (H100), 2 hidden layers (N2), and 0.2 dropout rate (D0.2), delivered the best performance.

The model's precision was relatively low, with only 17% of true positives correctly predicted with high probability across all positives (Table 5). However, when grouped by drug-like compound specificity (Table 2), precision increased from 17% (Table 5) to 86.1% for compounds with 2 targets reported in DrugBank and 95% for compounds that act on one human protein target (Table 2, “Pharmacologically active (1 Target)”) while still being low for unspecific compounds with more than 2 targets with a 12.7% precision (Table 2).

Table 2 Classification accuracies grouped by promiscuous drugs. Classification statistics for predicting links from compounds whose pharmacological mechanism of action is known or at least reported in DrugBank and their protein targets. The results presented in this table are only for the testing set unseen during training. The table is grouped by the number of reported protein targets in DrugBank (more than 2, 2 or 1 target).

Full size table

The H100-N2-D0.2 model demonstrated low specificity (12%) for promiscuous compounds acting on three or more targets (Table 2), such as unspecific kinase inhibitors targeting multiple kinases (Fig. 4C). The H100-N2-D0.2 multilayer perceptron classifier accurately assigned high scores to selective kinase inhibitor targets, including vemurafenib and encorafenib for BRAF, osimertinib, erlotinib, gefitinib, and neratinib for EGFR, and ibrutinib and acalabrutinib for BTK (Fig. 4B).

As demonstrated in Table 2, the language models could prioritise known mechanisms of action for drug-like compounds. Therefore, we tested these models on drugs with unresolved targets, such as metformin, a widely-used medication for improving glucose metabolism and alleviating diabetes-related complications. Although several metformin targets have been identified, including AMPK, the mitochondrial respiratory chain complex 1, and mitochondrial GPD2, its precise mechanism of action remains elusive⁶².

We examined the genome-wide target ranking for metformin in comparison to genome-wide association data (refer to “Methods”) for glycemic response (Fig. 4D), focusing on non-synonymous variants that result in amino acid changes. The putative metformin targets, such as PRKAB1 (a component of the AMPK complex), SIRT1, and GPD2, ranked highly (green dots in Fig. 4D). The language model assigned higher scores to small molecule tractable targets (CI50% tractability red line in Fig. 4D, see “Methods”). However, there was a depletion of p-values from the GWAS meta-analysis (CI99% p-value grey line in Fig. 4D), and none of the putative targets exhibited significant non-synonymous variants.

Five variants in four genes (Table 3) met several criteria: non-synonymous, p-value lower than 10e−9, consistency in the beta coefficient across two studies, and a high likelihood (> 0.3) of being a metformin target. The highest-ranked variant, rs8179206 (Glu77Gly) in the GCKR gene, has been associated with elevated triglycerides in the blood⁶³ and obesity⁶⁴. However, no publications have linked the genetic variants rs7273698 and rs7273698 in CD40, rs61737461 in AEBP1, and rs147387411 in ACLY to metformin (Table 3, Fig. 4D).

Table 3 Top-ranking genes. Gene symbol, allele change location in the Genome Reference Contig GRCh37, amino acid replacement, reference single nucleotide polymorphism (SNP) identifier, beta coefficient, − log10 of the p-value, the number of independent studies from the meta GWAS, number of samples as the sum of each study, and the direction of the effects: the sign of the beta coefficient (Minus sign for negative, plus sign for positive coefficients, interrogation sign for variants not measured and coefficients not calculated). Genes in this table were non-synonymous, had a p-value lower than 10e−9, had two studies with the same direction in the beta coefficient, and had a high likelihood of being the target of metformin according to the language model.

Full size table

ATP citrate lyase (ACLY) is involved in converting citrate and coenzyme A into acetyl-Coenzyme A and oxaloacetate⁶⁵ across various high-fat-producing tissues, including the liver, adipocytes, and pancreatic beta cells68. Acetyl-CoA contributes to lipogenesis and cholesterogenesis pathways⁶⁵. ACLY is a promising therapeutic target for cholesterol reduction and protection against atherosclerosis⁶⁶. The ubiquitination and subsequent degradation of ACLY shifts cell metabolism from synthesis to fatty acid oxidation⁶⁶, correlating with metformin's activation of AMPK, stimulation of fatty acid oxidation, inhibition of cholesterol and triglyceride synthesis⁶⁷, and increased glucose uptake and insulin sensitivity in skeletal muscle⁶⁷.

The Arg592Gln (rs147387411) mutation is located near the coenzyme A binding domain but not directly within it. Acetylation at three lysine residues (K540, K546, and K554) by lysine acetyltransferase 2B (KAT2B) has been shown to increase ACLY stability by blocking its ubiquitylation and promoting de-novo lipid synthesis⁶⁷. ACLY is deacetylated and inhibited by SIRT2, but not SIRT1⁶⁷, a putative metformin target. The Arg592Gln ACLY mutant may exhibit altered acetylation, ubiquitination, or phosphorylation upon metformin treatment. However, additional experimental work is necessary to confirm its connection with metformin.

Discussion

The Publication-Wide Association Study (PWAS) leverages language models to analyse 19.5 million publications, prioritising previously unpublished hypotheses across various contexts: gene-disease associations, target-disease clinical trials, protein–protein interactions, and unexplored yet plausible drug mechanisms of action. The study demonstrates that word embeddings, or vector representations of words, can effectively encode complex biomedical knowledge from published literature without requiring a priori biomedical knowledge. These embeddings capture intricate biomedical concepts, such as drug target classes, tractability, disease phenotype involvement, and protein–protein interactions. Furthermore, language models have successfully ranked novel hypotheses years before their publication in top-tier journals, suggesting that the foundations of future discoveries are already embedded within existing publications to some extent.

To the best of our knowledge, this is the first attempt to utilise language models for prioritising drug discovery hypotheses prior to their publication. Language-based inference methods may serve as a promising new research field at the intersection of natural language processing and target identification, propelling language models beyond their traditional use in named entity recognition and normalisation tasks to harness the wealth of associations present in biomedical literature. The various tasks for language models will be discussed in the following sections.

Target prioritisation

The vector representations of genes contain information about small molecule and antibody druggability, disease association, therapeutic target family, pathways and clinical precedence that can be generalised to unseen genes (Fig. 2). These embeddings contained more information than manually engineered features by biology experts¹³ (Table 1). Around 50% of the top-ranked genes ended up being published as associated with the disease within 10 years, a similar percentage to the results from Tshitoyan et al.⁶⁹. Furthermore, some relevant gene-disease associations validated could have been prioritised one to several years before their publication. The PWAS method outperformed the Paliwal et al.¹² method at prioritising clinical trial targets for 200 diseases in the test set. However, language models cannot make predictions for genes or diseases that are not in the training data. Genome-wide association studies have suggested for the first time disease associations for novel genes, not previously published (not in the training data). These associations cannot be prioritised with language models unless scientists write paragraphs describing the gene function based on experimental work. Furthermore, PWAS is another association study far from prioritising causal genes. Paradigm shifts in molecular biology and spurious associations between genes and diseases will bias the models. Furthermore, scientific literature is noisy and often contradictory. Despite these problems, the PWAS could prioritise some targets several years before their first publication and before their first clinical trial. Future work will require training more complex transformer models at hypothesis generation and question answering.

Protein–protein interactions

In this study, the PWAS exhibited enhanced performance in comparison to L3¹⁸, as depicted in Fig. 4E. The ability of language models to prioritise protein–protein interactions (PPIs) in advance offers a promising opportunity to supplement experimental approaches for completing the human interactome. However, there are limitations to consider. Unlike other published methods^23,24,25, the predictions in this study do not identify the interface residues essential for the interaction. Additionally, the current work heavily depends on high-throughput yeast-two-hybrid (Y2H) experiments, known for limited sensitivity and specificity^58,59,60. Y2H experiments face a significant drawback, as they cannot capture PPIs requiring post-translational processing, a common occurrence in humans, and only focus on a single isoform per gene, excluding alternative splicing variants that may have crucial roles in protein function and interactions.

Moreover, this study did not incorporate several independent and competing PPI databases, such as IntAct⁷⁰ or STRING⁷¹. These resources offer valuable information on experimentally validated and predicted PPIs, and their inclusion could potentially enhance the comprehensiveness and accuracy of human interactome mapping.

Drug target interactions

Language models demonstrated the ability to learn known target-drug relationships for specific compounds effectively. Notably, the putative targets of metformin prioritised by these models and the genome-wide ranking generated strongly correlated with independent small molecule tractability scores. It is hypothesised that the Arg592Gln ACLY mutant might undergo differential acetylation, ubiquitination, or phosphorylation in response to metformin treatment. However, this hypothesis necessitates further experimental validation to establish its veracity conclusively.

Limitations

Despite the vast potential of language models trained on biomedical literature, it is crucial to acknowledge their inherent limitations. One primary concern is the presence of biased, erroneous, irrelevant, and weak studies within the training data. These suboptimal sources may introduce noise and inaccuracies, thereby corrupting the model's knowledge base and impairing its ability to make accurate predictions or generate reliable insights. Consequently, the conclusions drawn from such models could be compromised, leading to misguided hypotheses and, ultimately, impeding scientific progress. Therefore, it is essential to exercise caution when interpreting the outputs of language models and to validate their findings through independent experimental approaches to ensure the veracity and robustness of the generated results.

Named entity recognition (NER) errors can greatly affect the quality and dependability of a publication-wide association study (PWAS) in target-disease prioritisation. These inaccuracies may result in false or missed associations, misrepresented relationships, and diminished trust in the model outputs. While advanced NLP techniques and transformer models, such as BioBERT or GPT-based architectures, may offer improvements, they still face challenges in ranking extensive lists, like 20,000 human genes. As these models tokenise input sequences into smaller units, gene names or symbols can be divided into multiple tokens, complicating the generation of a precise genome-wide ranking approach. Future research will need to be made to replicate the PWAS work with transformer models and evaluate them in a retrospective analysis to prove they can prioritise novel preclinical targets that are eventually successful in the clinic.

A further limitation concerning entity recognition lies in the representation of diseases as singular entities with a unified vector representation. However, diseases are heterogeneous, and gene alterations contribute variably to the onset and progression of the condition. Certain genes may serve as potential targets for the initial manifestation, while others play a role in exacerbating and aggravating the disease's progression. This study generates a single genome-wide ranking for each disease, regardless of its state or trajectory. Nonetheless, some disease stages are represented by distinct ontology terms within the MeSH ontology, such as the progression from NAFLD to non-alcoholic steatohepatitis and ultimately to hepatocellular carcinoma where the rankings will be slightly different. Moreover, target prioritisation encompasses a multi-objective ranking approach where the target needs to be safe, tractable and associated with the disease. Incorporating additional objectives or keywords could help prioritise targets for particular phenotypes of the disease. For example, combining associations to DNA-damage repair keywords with safety and tractability keywords along with cancer types to bias the initial ranking, similar to how one would pose a question with more detail to a transformer model.

Conclusion

The findings of this study indicate that to some extent, latent information regarding future discoveries is embedded within past scientific publications. Furthermore, language models present a compelling method for prioritising unverified hypotheses derived from the continuously growing corpus of biomedical literature. To our knowledge, this work represents the first literature-wide association study that ranks therapeutic drug hypotheses for subsequent validation utilising language models trained on historical corpora.

Efficient assimilation of existing knowledge, identification of promising research directions, and avoidance of redundant efforts are all vital components for scientific progress and hypothesis generation in target analysis. However, with the ever-increasing volume of biomedical literature, this task becomes progressively more challenging, if not unfeasible, for individual researchers.

Our research has the potential to contribute to a novel paradigm wherein the vast amount of information contained in scientific publications becomes readily accessible to individual target analysts, thereby fostering machine-assisted breakthroughs in the identification of innovative targets. The Publication-Wide Association Study (PWAS) represents a scalable system that facilitates the prioritisation of under-explored targets, expediting early-stage drug target selection irrespective of the specific disease of interest. This method holds the potential to radically transform the way researchers navigate and leverage the extensive knowledge base within the biomedical field.

Methods

Training corpus, vocabularies and preprocessing

The input data for this study comprised roughly 19.5 million English abstracts sourced from PubMed, which included mentions of any human gene or disease synonyms in their titles or abstracts. Abstracts classified as 'Commentary', 'Correction', and 'Corrigendum' in PubMed were excluded. In this study, we utilised a comprehensive set of 4,819 disease terms from the Medical Subject Headings (MeSH) ontology, which encompasses a diverse range of diseases, phenotypes, and symptoms. MeSH, developed by the National Library of Medicine, is a controlled vocabulary used for indexing and cataloguing biomedical literature in the MEDLINE/PubMed database. Furthermore, regarding the gene vocabulary, we have taken into account 19,229 human protein-coding genes.

The TrendyGenes⁷² pipeline was employed to identify biomedical entities, specifically genes and diseases, within these abstracts. Standardised identifiers were utilised to normalise genes and diseases, a process that involves canonicalising all concept synonyms. For instance, 'Her2', 'ERBB2', and 'Neu' were normalised into their Ensembl identifiers, while 'non-insulin-dependent diabetes', 'diabetes mellitus, type II', and 'T2DM' were normalised to the corresponding Medical Subject Headings (MeSH) identifier for type 2 diabetes.

Normalisation was implemented for several reasons. Firstly, it reduces the vocabulary size. Secondly, it accounts for different research groups using distinct gene synonyms in various research contexts (e.g., genetics using official gene symbols versus biochemistry articles). Canonicalization consolidates information into a single token, enabling the generation of a genome-wide ranking of 20,000 human protein-coding genes, rather than n_synonyms multiplied by the number of genes. Normalising biomedical entities led to a fivefold reduction in gene synonym mentions, from 102,719 to 19,229 human protein-coding gene identifiers. Similarly, disease synonym mentions were reduced tenfold, from 53,317 to 4819 human disease identifiers. Proper corpus preprocessing significantly enhances the performance of language models⁶⁹.

The vocabulary for word2vec models included all words that appeared more than ten times and normalised gene and disease identifiers, regardless of the number of mentions. As a result, historical models featured different vocabularies. If a gene is never mentioned in the training corpus, it is excluded from the language model's vocabulary. For instance, 13,676 distinct human genes were mentioned in the training corpus up to 2005, while 16,533 different human genes were mentioned until the present, accounting for 86% of the total 19,229 human protein-coding genes. Normalised gene and disease mentions refer to the mapping of any textual references to a unique identifier from the HUGO or the Medical Subheadings Ontology (MeSH).

Leading statements such as 'Background:', 'Abstract:', and 'Introduction:' were removed. Phrases were generated using a minimum phrase count of 10, normalised mutual information score greater than 0.7, and phrase depth up to three times. This phrasing process was repeated thrice, allowing the generation of up to 8-g. For example, the phrase 'g protein coupled receptor' appeared more than ten times in the corpus and had a normalised mutual information score greater than 0.7, which was then phrased into 'g_protein_coupled_receptor'. The text was converted to lowercase and deaccented. Floating decimals and percentage numbers were replaced with the word '< number >' using regular expressions to decrease vocabulary size. Stop words were retained as they constituted approximately 100 tokens within the half-million-token vocabulary.

Language models

We employed the Word2Vec algorithm from the Gensim Python library for our analysis. A range of hyperparameter combinations were explored to identify the optimal configuration for capturing gene target to disease indication analogies (Table 4). We selected a window size of 10, which corresponds to the median distance (in words) between diseases and genes in PubMed. Based on the work by Yin and Shen⁷³, an embedding size of 255 minimised the Pairwise Inner Product loss, though we also conducted a grid search with sizes of 128, 256, and 512. The remaining hyperparameters were set to their default values.

Table 4 Hyperparameter grid search for various Word2Vec models. The hyperparameters explored include the Word2Vec model task, either contextual bag-of-words (CBOW) or skip-gram (SG); the size of the embedding vectors, represented by the number of variables in the word vectors as multiples of 2 (128, 256, and 512); and the window distance, which is the number of tokens (words or phrases) between the central pivot word and the surrounding words that the algorithm attempts to predict. Table entries indicate the vocabulary ranking of the correct answer for analogies of the form: target A is to disease A as target B is to disease B, as originally demonstrated by Mikolov in their publication²⁶. The best score is highlighted in bold and with an asterisk.

Full size table

During the 50 training epochs, the learning rate was linearly reduced from 0.01 to 0.0001. We experimented with context or distance window sizes of 5, 10, and 15 tokens. Subsampling was performed with a 10⁻⁴ threshold, targeting approximately the 400 most frequent tokens in the vocabulary. Both Word2Vec variants, skip-gram and continuous bag of words (CBOW), were considered. The chosen model, which maximise performance on a curated dataset of target-to-disease analogies (Table 4), was the CBOW model with an embedding size of 256 and a window distance of 10.