Introduction

Machine learning (ML) models are invaluable tools for predicting the absorption, distribution, metabolism, and excretion (ADME) properties of small molecules1,2,3,4. For ADME predictions, ML models relate compound structural information to molecular properties, which is also referred to as quantitative structure-property relationship (QSPR) models5. QSPR models and ADME predictions play a pivotal role in drug discovery, assisting in the early identification of lead compounds with favorable pharmacokinetics and reduced potential for toxicity5,6. By accurately predicting ADME profiles, models can accelerate compound characterization and potentially reduce the costs associated with synthesis and experimental testing7,8.

ML-based QSPR models can be created using either all available data for a certain property (global models) or smaller data sets relating to a particular discovery project or chemical series (local models)9,10,11,12. Local models focus on predicting ADME properties within specific chemical series or compound classes, utilizing specialized knowledge and chemical features relevant to those compounds. In contrast, global ML models capture the complex relationships between molecular structures and ADME properties across the chemical space3,13. As it was shown in Di Lascio et al., this broader applicability makes global models more advisable, despite the common intuition that local models might capture series- or project-specific QSPRs more accurately2. Therefore, global ADME models should generally be the ones influencing prioritization and selection of lead compounds early in pharmaceutical research. However, global ML models have been predominantly utilized for the prediction of ADME properties of traditional small molecules and it is still an open question whether they are applicable to more recent drug modalities14. With the emergence of targeted protein degradation (TPD) as a promising therapeutic strategy, the development and evaluation of ML models for predicting molecular properties of TPDs has gained attention to assist in degraders’ design15,16. Recent works, including a pharmaceutical industry perspective paper by Volak et al.17, have highlighted the knowledge gap in how ML models perform for ADME property predictions in the TPD space.

TPDs agents represent an innovative therapeutic strategy to induce the selective degradation of disease-causing proteins through the intracellular ubiquitin-proteasome system18. These agents simultaneously bind the target protein and an E3 ligase, facilitating the recruitment of the protein to the cellular degradation machinery. By modulating previously ‘undruggable’ targets, TPD agents offer new opportunities for therapeutic intervention. Molecular glues and heterobifunctionals are two TPD submodalities. Glues are a class of molecules that directly bind to both the target protein and an E3 ubiquitin ligase, promoting the formation of a ternary complex that facilitates the ubiquitination and target degradation. Heterobifunctionals consist of a ligand to the target, a ligand to the E3 ubiquitin ligase, and a linker19. This complex brings the target protein in proximity to E3 ligase leading to ubiquitination and its degradation18. The structural features, mechanisms of action, and target engagement modes that distinguish TPDs from traditional small molecules challenge ML models’ performance and generalizability in the context of TPD agents, which remain uncertain16. It has been recently reported that computational approaches might not be suited to TPD molecules and data limitations prevent ML-based QSPR modeling assessments16. Therefore, it is not yet known whether reliable predictions are possible or, in contrast, TPDs might be outside the applicability domain of ADME models14,16. Given the promising clinical results of recent TPDs, investigating whether ML-based QSPR models can leverage existing data to effectively predict ADME properties of TPDs is of utmost interest for pharmaceutical research16.

Herein, ML models are generated and evaluated for the prediction of ADME properties of TPD compounds, with special focus on glues and heterobifunctionals. By leveraging on global models’ generalization capability, the potential of ML models to capture the QSPR across diverse compound classes and physicochemical and ADME properties is investigated. Predictive performance of property predictions for glues and heterobifunctionals is compared and put in context to all compound modalities. Moreover, transfer learning techniques are adopted with the aim of refining ML models and improving predictions on TPD compounds.

Results

Assay data and global models

A data set with twenty-five ADME endpoints was utilized for ML modeling. ML-based property predictions were carried out with global QSPR models, which learn from all available data for a given ADME property or assay2. Here, four multi-task (MT) global models were generated to predict related properties or assays2,20,21. This algorithm was selected because MT learning enables the modeling of multiple properties, assays or, more generally, prediction tasks simultaneously22,23,24. The assays or tasks included in the four global MT models were:

  • Permeability model (5-task model): Apparent permeability (Papp) from low-efflux MDCK (LE-MDCK) permeability assay (versions 1 and 2), PAMPA and Caco-2 permeability assay, and efflux ratio from MDCK-MDR1 permeability assay.

  • Clearance model (6-task model): Intrinsic clearance (CLint) from CYP metabolic stability in liver microsomes assays for rat, human, mouse, dog, cynomolgus monkey, and minipig20.

  • Binding/Lipophilicity model (10-task model): Plasma protein binding (PPB) for rat, human, mouse, dog and cynomolgus monkey, human serum albumin (HSA) binding, microsomal binding, brain binding, and octanol-water partition and distribution coefficients (LogP and LogD).

  • Cytochrome P450 (CYP) inhibition model (4-task model): time-dependent inhibition of CYP3A4 and reversible inhibition of CYP3A4, CYP2C9, and CYP2D6.

These models are ensembles of a message-passing neural network (MPNN) coupled with a feed-forward deep neural network (DNN)25,26. More details can be found in the Methods section.

Due to data availability, prospective evaluation was done for a subset of endpoints. Table 1 lists fifteen physicochemical and ADME assays and properties that were considered for models’ evaluation. Experiments for molecules registered until the end of 2021 were used for model generation, whereas performance was evaluated with the most recent ADME experiments, following a temporal validation.

Table 1 Assays, models, and prediction tasks

TPDs belonging to the submodalities of glues and heterobifunctionals were identified in the data set. Global models’ performance was assessed for glue and heterobifunctional TPDs separately. Figure 1 reports the number of training and test compounds across all modalities, for heterobifunctionals, and glues. For all endpoints, TPD compounds constitute less than 6% than the rest of drug modalities. Supplementary Fig. 1 shows the distribution of assay values for each modality.

Fig. 1: Training and test set statistics.
figure 1

The number of compounds per assay is reported both for the (A) training and (B) test sets. Shown are the number of compounds across all modalities (green), heterobifunctionals (orange), and glues (blue). Assays are described in Table 1. Source data are provided as a Source Data file.

Figure 2 characterizes the data set distribution and chemical space per each compound modality. Figure 2A shows the distribution of calculated descriptors used in the Lipinski’s rule of five (Ro5) for glues, heterobifunctionals, and the rest of compounds in the test set. Those calculated descriptors include molecular weight (MW), hydrogen bond acceptors (HBA), hydrogen bond donors (HBD), topological polar surface area (TPSA), calculated LogP (cLogP), and number of rotatable bonds. Heterobifunctional TPDs have a larger molecular weight than the glues and are always beyond the Ro5 (bRo5). The rest of compounds tested on these ADME assays, which can belong to different drug modalities, have a molecular weight distribution more similar to that of glues. The percentage of compounds bRo5 is 19% for glues, and 34% for the rest of modalities. Since ML models have been traditionally applied to compounds with molecular weight lower than 900 or 1000 Da, and mostly for compounds following the Ro5, one could anticipate that heterobifunctional TPDs might be outside the applicability of those standard ML-based QSPR models. Calculated properties’ distributions (MW, HBA, HBD, TPSA, cLogP, and rotatable bonds) are also reported for the training and test sets in Supplementary Fig. 2.

Fig. 2: Distribution of calculated properties and chemical space representation.
figure 2

A The distributions of molecular weight (MW), number of hydrogen bond acceptors (HBA) and donors (HBD), topological polar surface area (TPSA), calculated LogP (cLogP), and number of rotational bonds are reported for glues (nglues = 1851, blue), heterobifunctionals (nheterobifunctionals = 1064, orange), and all the rest of modalities (nother modalities = 28886, green). Boxplots show the median (center line), and 1st and 3rd quartiles (Q1 and Q3, respectively). The error bars correspond to the Q1-(1.5*IQR) and Q3 + (1.5*IQR) range (IQR = Inter-Quartile Range). Datapoints below Q1 – (1.5*IQR) or above Q3 + (1.5*IQR) are considered outliers and not shown in the boxplots. B A Uniform Manifold Approximation and Projection (UMAP) based on Tanimoto distance and MACCS keys is shown per modality (glues, heterobifunctionals, and others) and for the test set (nglues = 1851, nheterobifunctionals = 1064, nother modalities = 28886). Source data are provided as a Source Data file.

Figure 2B reports a chemical space representation based on Uniform Manifold Approximation and Projection (UMAP), which shows the distribution of TPDs and compounds from other modalities utilizing Tanimoto as the distance metric and MACCS (Molecular ACCess System) keys27 as molecular representation. The UMAP illustrates that chemical spaces of TPDs and the rest of the compounds in the test data set only partly overlap, and TPD compounds tend to cluster together. Clusters of heterobifunctional TPDs overlap with glue compounds.

Prediction errors on TPDs and all modalities

First, model performance was assessed for properties with at least five compounds in the test sets. Figure 3A reports the mean absolute error (MAE) for fifteen ADME endpoints, in order of increasing model error (across all modalities). This figure shows the prediction error estimations for glue and heterobifunctional TPDs separately, as well as the average errors for the rest of compounds. As a control, models’ errors were compared to a baseline predictor. For all test compounds, the baseline model gave a constant prediction value which corresponded to the mean property value in the training set. Such baseline prediction was consistently less accurate than ML predictions for any of the modalities. MAE values for the baseline model ranged from 0.28 (for CYP2C9 IC50) to 0.96 (for LogD). In contrast, the largest MAE values for ML models were 0.33 for the test set with all compound modalities (LogD), 0.39 for the heterobifunctionals’ test set (LE-MDCK v2 Papp), and 0.31 for the glues’ test set (rPPB). Therefore, predictions were consistently lower than ~2-fold for glues and all modalities’ compounds. For heterobifunctionals, average prediction errors were smaller than 2.5-fold across all studied ADME properties. The largest differences between average ML errors and baseline predictor errors were observed for lipophilicity (∆MAE = 0.63 for LogD and ∆MAE = 0.57 for LogP). For cynomolgus monkey clearance predictions (CynLM CLint), there was also a large error difference between ML and baseline predictor (∆MAE = 0.33). For CYP3A4 TDI, ML-based predictions were closer to the control baseline (∆MAE = 0.05), which highlights limited predictive ability for kobs values.

Fig. 3: Global models’ performance on targeted protein degraders (TPDs) and other modalities.
figure 3

Global machine learning (ML) model results are shown for fifteen absorption, distribution, metabolism, and excretion (ADME) assays. Reported are the mean absolute error (MAE) distributions for glues (blue), heterobifunctionals (orange) and all the other compounds (green). Results are reported for the complete test set (A) and bootstrap samples (n = 1000) (B). In (A), global models are compared to a baseline prediction (gray), i.e. mean of the training set. Boxplots in (B) show the median (center line), and 1st and 3rd quartiles (Q1 and Q3, respectively). The error bars correspond to the Q1-(1.5*IQR) and Q3 + (1.5*IQR) range (IQR = Inter-Quartile Range). Assays are described in Table 1. Source data are provided as a Source Data file.

To further assess predictive performance, Fig. 3B shows a distribution of average errors for each endpoint  on bootstrap samples (n = 1000). This analysis also helps to incorporate the uncertainty of the errors’ estimation due to the small sample size in some of the data sets. Similar trends were obtained both in Fig. 3A, B. Overall, results show consistency between the ML models’ performance on all modalities and TPDs. Even though some property predictions were less accurate for heterobifunctional TPDs, model errors were not consistently larger than those for other modalities. Perhaps surprisingly, for the majority of the considered ADME assays, glue molecules had predictions with the lowest errors. For the bootstrap results, at least 75% of the glues had predictions with lower errors than the other molecules in the test set (either heterobifunctionals or all modalities) for nine out of fifteen properties (cynoPPB, CYP3A4 IC50, LE-MDCK v2 Papp, CYP3A4 kobs, RLM CLint, LogP, MLM CLint, HLM CLint, LogD). This is illustrated by the third quartiles of glues’ MAE distributions in Fig. 3B. Depending on the property to predict, models had larger or smaller errors in TPDs or other modalities, but overall results suggest that TPDs are inside the domain of applicability of ML-based QSPR models.

Performance evaluation for larger data sets

For the five ADME endpoints with most data points available, a detailed evaluation was carried out. Apart from the regression predictions, performance was estimated for categorical predictions. A compound with an experimental readout in a medium range could likely be assigned to another category if the experiment was repeated. Therefore, three classes are often utilized to categorize experimental results into high and low risk, while incorporating experimental variability. Similarly, focusing predictions on the extremes of the distribution, higher agreement with the experimental readout is ensured (higher precision). Thus, medium-range predictions (between the low and high thresholds) were set to ‘inconclusive (medium)’ to avoid making decisions based on those predictions. This approach helps flagging low-confidence predictions. Property thresholds are defined in Table 1.

Table 2 reports the number of test compounds, MAE values, misclassification errors for the low and high classes, and percentage of inconclusive (medium) predictions. Importantly, LE-MDCK permeability, TDI and reversible inhibition of CYP3A4, and CLint in human and rat liver microsomes were predicted with errors lower than 2-fold (MAE < 0.3). Figure 4 shows the classification predictions for these five selected assays in all modalities, heterobifunctional and glue TPDs. Results show better models’ performance on glues than heterobifunctional TPDs. Moreover, misclassification errors were at most 14.5% and always lower than 4% for glues. These results indicate that when models provide a high or low classification prediction, there is a high confidence that it is correct. The percentage of 'inconclusive (medium)' predictions was generally less than 35% and it was also lower for glues than heterobifunctional TPDs. Specifically, the percentage of inconclusive (medium) predictions ranged from 2% to 20% for glues, and from 19% to 51% for heterobifunctional TPDs. Supplementary Table 1 reports the class distributions according to the assay values. There are also experimental measurements that fall into the medium category and ranged from 7% (CYP2C9 IC50) to 45% (hPPB) of the compounds. As highlighted above, due to experimental variability, such medium-range experiments could also switch category (to low or high property values) if the assay was repeated20.

Table 2 Global models’ regression and classification performance for all modalities and targeted protein degraders (TPDs)
Fig. 4: Classification results for all modalities and targeted protein degraders (TPDs).
figure 4

Reported are the percentage of compounds (y-axes) that had a given prediction (x-axes) by the global machine learning (ML) models and five properties. Prediction outputs are high risk, inconclusive (medium) or low risk categories. Results are shown for all modalities (left panel), glues (middle panel), and heterobifunctionals (right panel). Colors indicate the experimental three-class readout. Classification predictions are shown for passive permeability (A; LE-MDCK Papp), metabolic clearance in rat liver microsomes (B; RLM CLint) and human liver microsomes (C; HLM CLint), CYP3A4 TDI (D; CYP3A4 kobs) and reversible inhibition (E; CYP3A4 IC50). The number of tested compounds were 17960 (A), 18322 (B), 18420 (C), 3270 (D) and 4377 (E) across all modalities; 1395 (A), 1311 (B), 1348 (C), 123 (D), 128 (E) glues; and 863 (A), 602 (B), 598 (C), 388 (D), 293 (E) heterobifunctionals. Assays are described in Table 1. Source data are provided as a Source Data file.

Models’ refinement for TPD compounds

Since ADME properties for heterobifunctional compounds were more challenging to predict, models’ refinements were carried out with the aim of improving ADME predictions. Since data for model refinement and testing was required, investigation focused on the five properties previously evaluated: passive permeability (LE-MDCK Papp), metabolic clearance in rat liver microsomes (RLM CLint) and human liver microsomes (HLM CLint), CYP3A4 TDI (CYP3A4 kobs) and reversible inhibition of CYP3A4 (CYP3A4 IC50),

Fine-tuning strategies were investigated. Deep learning algorithms trained in one or more domains can be adapted to a different but related target domain28,29. This concept of transfer learning for domain adaptation was applied herein to adapt global models to a specific area of the chemical space. Existing global ML models were adapted to the TPD modality using fine-tuning with weights initialization28,30. Two approaches were investigated: (i) fine-tuning ML models with all new compounds registered in 2022, and (ii) fine-tuning ML models on specific chemistry (heterobifunctional TPDs registered before 2023). These two strategies are schematized in Fig. 5.

Fig. 5: Global model and fine-tuning strategies set-up.
figure 5

Reported are the data splitting settings for global model building and fine-tuning. Global models’ training set is constituted by all compounds registered in the database and measured in assays until 2021 (blue). In strategy 1, all compounds (cpds) registered and measured during 2022 were utilized for model fine-tuning (pink). In strategy 2, heterobifunctional targeted protein degraders (TPDs) that were registered and measured before 2023 were utilized for model fine-tuning (pink). In the three cases, the test set was identical and was composed by heterobifunctional TPDs registered and measured in absorption, distribution, metabolism, and excretion (ADME) assays from 1st January 2023 until 13th July 2023.

For the original MT-GNN models and the two fine-tuning approaches, Supplementary Fig. 3 reports classification predictions for permeability (LogPapp from LE-MDCK), rat and human CLint (LogCLint), TDI of CYP3A4 (Logkobs), and reversible inhibition of CYP3A4 (pIC50). Moreover, Fig. 6 shows the average regression errors (MAE values) for the same models and assays. For all ADME endpoints except permeability, model fine-tuning with heterobifunctional compounds yielded equivalent or lower prediction errors than using all new data. In contrast, permeability predictions were more accurate when the MT-GNN model was fine-tuned with all new data (across all modalities) instead of heterobifunctional TPDs only. Interestingly, for all evaluated properties, fine-tuned models consistently led to lower prediction errors compared to the original MT-GNN models.

Fig. 6: Performance of original and refined models on heterobifunctional targeted protein degraders (TPDs).
figure 6

Reported are mean absolute error (MAE) values for two fine-tuning strategies: (i) on new data (yellow) and (ii) only heterobifunctional data (purple), as well as the original (red) global machine learning (ML) models. Shown are bootstrapping results (n = 1000) for heterobifunctional TPD compounds, and five assays. Boxplots show the median (center line), and 1st and 3rd quartiles (Q1 and Q3, respectively). The error bars correspond to the Q1-(1.5*IQR) and Q3 + (1.5*IQR) range (IQR = Inter-Quartile Range). Assays are described in Table 1. Source data are provided as a Source Data file.

Newer experiments can also be utilized for model retraining, where the model is generated again from scratch. Global MT-GNN models were retrained with compounds registered before 2023 and tested on the same heterobifunctional molecules. Table 3 reports regression and classification prediction performance for the fine-tuned model for heterobifunctional TPDs (fine-tuning strategy 2), and the original and retrained MT-GNN global models. Results show that using most recent data for modeling consistently decreases prediction errors both in numerical property predictions and misclassifications. However, fine-tuning with heterobifunctional TPD data yielded the lowest misclassification errors across all assays and risk categories, except for low LE-MDCK permeability values. Moreover, these results indicate that when a prediction is reported by the fine-tuned ML model, it is of high confidence. Errors were lower than 4% for LE-MDCK permeability, TDI of CYP3A4, and rat CLint, lower than 13% for human CLint, and no errors were observed for the reversible inhibition of CYP3A4.

Table 3 Regression and classification prediction performance for refined and original ML models on heterobifunctional targeted protein degraders (TPDs)

Hence, despite the larger efforts of model retraining (more computationally intensive and time-consuming), it did not yield performance improvements compared to fine-tuning. Even though both types of model refinements were successful in improving predictions, using a pre-trained global model and refining predictions with a relevant data set (i.e. TPD modality) can be a more promising strategy.

Public surrogate data and ML model

Due to the recent emergence of this new therapeutic modality, there is a lack of TPD data in the public domain. This limits the possibility of generating and evaluating data-driven ML models for property prediction, especially applicable to TPDs. To accelerate ML-based QSPR for TPDs, a surrogate data set was generated and used for model building31,32. Public compound structures were extracted from ChEMBL33, ZINC34, and PROTAC-DB35, as detailed in the Methods section, and annotated with our in-house MT-GNN models’ predictions. This surrogate data set contains ~274,000 compounds with predicted data for twenty-five properties, which were included as tasks in the original MT-GNN models. The same MT-GNN approaches (equivalent architecture and hyperparameters) were trained with the surrogate data to generate new models. The code to generate the models and get predictions is provided as Supplementary Software.

The quality of the public surrogate ML models was evaluated with the same internal test set and performance was compared to the original MT-GNN models. Figure 7 shows the MAE values for the fifteen assays under evaluation for the public surrogate models. Prediction errors were estimated for glues, heterobifunctionals, and all modalities independently. As observed with the internal MT-GNN models, property predictions for heterobifunctionals were often associated to larger errors. On the other hand, average performance for glue TPDs was generally similar to the one observed across all modalities, and even higher for some assays. A control baseline was also included, where the average in the training set (in this case, predictions from the original models) was predicted for all test compounds. Such baseline often yielded higher prediction errors, but in a few cases ML-based predictions were of equivalent quality. Hence, surrogate models were not always applicable, especially to predict some properties for heterobifunctional TPDs. For instance, predictions of time-dependent inhibition of CYP3A4 (Logkobs values) had MAE values larger than the baseline for heterobifunctional compounds.

Fig. 7: Public global models’ performance on targeted protein degraders (TPDs) and other modalities.
figure 7

Public surrogate global machine learning (ML) model results are shown for fifteen assays. Reported are the mean absolute error (MAE) values for glues (blue), heterobifunctionals (orange) and all the other compounds. Assays are described in Table 1. Source data are provided as a Source Data file.

The applicability of the original global models and surrogate models might not be equivalent due to differences in chemical space coverage and labels’ quality (experiment vs. prediction). However, results suggest that surrogate models’ predictions can be successful for many properties. Across all modalities, original and surrogate models had an average MAE difference of 0.04 log units, ranging from 0.01 (CYP2C6 IC50) to 0.09 (CyLM CLint) log units. For the prediction of glues’ and heterobifunctionals’ properties, MAE differences were 0.05 and 0.07 on average, respectively. Supplementary Fig. 4 reports the comparison of original and surrogate models’ predictions. Interestingly, despite the presence of some outliers, the correlation between predictions of the original and surrogate MT-GNN models was consistently high across the different assays, ranging from 0.95 to 0.98 (Pearson’s coefficient) and from 0.90 to 0.98 (Spearman’s coefficient).

These results suggest the potential of the surrogate data sets for ML model building and applications for TPDs, and highlight which endpoints are more accurately predicted and can be useful in practice. While trained surrogate models establish a proof of principle, additional hyperparameters’ optimizations and algorithms could be tested to improve QSPR models for specific properties and compounds sets. Overall, this large surrogate data set with annotated properties, including TPDs, provides new opportunities for ML-based QSPR model developments in the public domain.

Discussion

Herein, a comprehensive evaluation of ML for the prediction of ADME and physicochemical properties of TPD molecules is firstly presented, including heterobifunctional and molecular glues submodalities. Deep learning models were generated, and prediction results showed that ADME properties such as permeability, metabolic intrinsic clearance, CYP inhibition, and lipophilicity can be successfully predicted for TPDs. Interestingly, lowest prediction errors were obtained for glues, ranging from MAE values of 0.11 (for reversible inhibition of CYP3A4) to 0.28 (for mouse metabolic clearance). Moreover, misclassification errors for high and low risk predictions were between 0 to 3.1%. For permeability, CYP3A4 inhibition, and human and rat microsomal clearance, classification errors ranged from 0 to 14.5% for heterobifunctionals. Our results suggest that predicting ADME properties for heterobifunctionals is more challenging than for glues. Transfer learning strategies were implemented to adapt the domain of ML models and improve TPD predictions. More specifically, fine-tuning of MT-GNN models with heterobifunctionals’ ADME data improved predictive performance across different ADME endpoints. The generation of a surrogate dataset based on >270,000 publicly available chemical structures, including TPDs, has also shown the potential of ML-based QSPR model building and applications for TPDs, especially glues. Predictions were highly correlated with the original in-house model predictions and were accurate for relevant endpoints such as LogD or rat metabolic clearance (RLM CLint).

Taken together, this work indicates that ML-based QSPR models are applicable to the new modality of TPDs even though they represent a small fraction of the training set and can be further refined when additional data becomes available. With increasing TPD data availability, additional modeling strategies could be explored to further refine ADME predictions, and potentially move towards the prediction of other relevant properties from molecular structure. ML-based QSPR models are already influencing the design-make-test-analyze (DMTA) cycle in drug discovery, where only the most promising ideas are synthesized, and informative experiments are carried out. However, the use of ML models for TPDs remained marginal compared to other traditional modalities. Our findings have implications in pharmaceutical research and should increase the use of ML models for property predictions in TPD programs, potentially accelerating degraders’ design with favorable ADME properties.

Methods

Assays’ description

LE-MDCK

For passive permeability determination, 96-well plate permeable inserts were plated with Madin-Darby Canine Kidney (MDCK) cells and cultured for three days. The test article in dimethyl sulfoxide (DMSO) stock solution (10 mM) was added to Hanks’ balanced salt solution (HBSS) to result in a final concentration of 10 μM. The HBSS buffer contained 0.02% bovine serum albumin (BSA) and 10 mM HEPES. The acceptor compartment was HBSS with 5% BSA and 10 mM HEPES. The assay was run for 120 min, determining the donor concentration at time zero and after 120 min the donor and acceptor concentrations. The difference between version 1 (v1) and version 2 (v2) of the assays consisted in the addition of BSA36.

CYP metabolic stability in liver microsomes

Microsomal incubations were performed in 384-well PCR plates at 37 °C. Test articles at a concentration of 10 mM in pure DMSO were dispensed by an acoustic dispenser to 25 µL 100 mM phosphate buffer (pH 7.4) containing 2 mM NADPH. This solution (12.5 μL, equilibrated for 10 min at 37 °C) was added to 12.5 μL liver microsomes (1 mg/mL) suspended in 100 mM phosphate buffer. At 0.5, 5, 15, and 30 min, the reactions were terminated by the addition of 10 µL acetonitrile/formic acid (93:7) containing the analytical internal standards (1 μM alprenolol and 1 µM warfarin) and transferred to a new 384-well plate containing 15 µL acetonitrile/formic acid (93:7). The stopped incubations were centrifuged at 5000 x g for 15 min at 4 °C and the supernatants were analyzed by high-performance liquid chromatography–tandem mass spectrometry (LC-MS) to measure the percentage of test article remaining relative to time zero-minute incubation and determine the in vitro elimination-rate constant (kmic). Intrinsic clearance (CLint) was calculated by dividing kmic by the concentration of microsomal protein.

PPB

Plasma protein binding (PPB) values were mainly determined through equilibrium dialysis. Binding to proteins was measured using rapid equilibrium dialysis (RED device from ThermoFisher). Test articles were dissolved in matrix (plasma, human liver microsomes or brain homogenate). 300 µL of the matrix solutions were dispensed to the red chamber of a RED device and 500 µL 100 mM phosphate buffer to the white chamber. The RED device was sealed with a gas permeable membrane and incubated for 4 h on an orbital shaker (750 rpm) at 37 °C under 5% CO2. 50 µL aliquots from both compartments were transferred to 600 µL acetonitrile containing the analytical internal standard (0.2 µM glyburide) and 50 µL buffer or matrix for a matrix match. The samples were centrifuged at 5000 x g for 15 min at 4 °C and the supernatant was analyzed by LC-MS analysis for measuring test article and internal standard. The free fraction (fu) was calculated by dividing the area ratio of the receiver compartment to the area ratio of the donor compartment. For large molecular weight compounds, PPB was measured using ultracentrifugation (UC). Test articles (5 µM) were added to 1000 µL plasma and incubated for 10 min at 37 °C in a glass vial. For the determination of the total concentration, 3 times 50 µL were added to a 96 deep-well-plate pre-filled with 600 µL acetonitrile/water (9/1) containing the analytical internal standard (0.2 µM glyburide) and 50 µL phosphate buffer. For the free fraction, an aliquot of 700 µL was centrifuged (Beckman UC Optima Max-XP) at 436,000 x g for 5 h at 37 °C. At the end of the centrifugation, 3 times 50 µL of the supernatant were carefully removed and added to the 96 deep-well-plate pre-filled with acetonitrile containing the internal standard and 50 µL blank plasma for a matrix match. The 96 deep-well plate was shaked for 10 min at 300 rpm and stored over-night in a freezer at −20C° to help protein precipitation. The next day the 96 deep-well plate was centrifuged at 4500 rpm for 1 h at 4 °C. Supernatant (50 µL) was transferred in a 384 well plate with 30 µL water. The samples were analyzed by LC-MS for the measurement of test article and internal standard. High throughput dialysis (HTD) was used as a second alternative method for strong binders (% bound >99). Test articles (5 µM) were added to 700 µL plasma and incubated for 10 min at 37 °C. For the determination of the total concentration, 3 times 50 µL were added to a 96 deep-well-plate pre-filled with 600 µL acetonitrile containing the analytical internal standard (0.2 µM glyburide) and 50 µL phosphate buffer. For the free fraction, an aliquot of 100 µL was dialyzed against 100 mM phosphate buffer at pH 7.4 for 6 h in the HTD96b device (HTDialysis LLC). At the end of the incubation, 3 times 50 µL of the plasma (buffer) compartment were removed and added to the 96 deep-well-plate pre-filled with acetonitrile containing the internal standard and 50 µL blank buffer (plasma) for a matrix match. The 96 deep-well plate was shaked for 10 min at 300 rpm and stored over-night in a freezer at −20C° to help protein precipitation. The next day the 96 deep-well plate was centrifuged at 4500 rpm for 1 h at 4 °C. Supernatant (50 µL) was transferred in a 384 well plate with 30 µL water. The samples were analyzed by LC-MS for the measurement of test article and internal standard. A calibration curve was used to define the LLOQ.

Most of the utilized data comes from RED devices but, when available, HTD or UC data was utilized instead (i.e., some >99% qualifiers were replaced). Specifically, 3–6% and 1-3% of the data was generated with HTD and UC, respectively.

Lipophilicity

The 1-octanol/water partitioning coefficient (LogP) was determined using a miniaturized Shake-Flask equilibrium method. Prior to start the experiment the two phases were pre-saturated, so “water-saturated 1-octanol” and “1-octanol-saturated water” were used. Samples were initially dissolved in DMSO as a 10 mM stock concentration. The samples and an internal standard were dispensed in a 1 ml deepwell plate and DMSO is evaporated prior to be dissolved in 1-octanol at a target concentration of 150 µM by shaking at 1000 rpm for 8 h. The pH 7.4 buffer was added with a phase ratio K of 1 (where K = Vwater/Voctanol) and then the samples were shaken four hours on a shaker at 1000 rpm. The deepwell plate was centrifuged at 3000 rpm prior to phase separation. A x10 dilution for the aqueous phase and a x1000 dilution for the octanol phase are prepared and quantified by LC-HRMS against an internal standard (Dexamethasone) with a known logD = 1.9 with the following equation:

$$\log D=\log \left(\frac{{{{{{{\rm{Analyte}}}}}}\; {{{{{\rm{peak}}}}}}\; {{{{{\rm{area}}}}}}\; {{{{{\rm{in}}}}}}\; {{{{{\rm{octanol}}}}}}} * 1000/\frac{{{{{{\rm{IS}}}}}}\; {{{{{\rm{peak}}}}}}\; {{{{{\rm{area}}}}}}\; {{{{{\rm{in}}}}}}\; {{{{{\rm{octanol}}}}}}}{0.794}}{{{{{{\rm{Analyte}}}}}}\; {{{{{\rm{peak}}}}}}\; {{{{{\rm{area}}}}}}\; {{{{{\rm{in}}}}}}\; {{{{{\rm{aqueous}}}}}} * 10/{{{{{\rm{IS}}}}}}\; {{{{{\rm{peak}}}}}}\; {{{{{\rm{area}}}}}}\; {{{{{\rm{in}}}}}}\; {{{{{\rm{aqueous}}}}}}}\right)$$

This protocol was adapted from Low et al.37.

CYP inhibition

CYP3A4 time-dependent inhibition (TDI)

The TDI assay was utilized to determine the first order inactivation rate (kobs) values. Test articles were dispensed to 96-well plates, and human liver microsomes supplemented with NADPH were added to initiate the pre-incubation. Residual CYP3A4 activity was determined after 0, 7, 16 and 32 min by the addition of midazolam (including d4-1-hydroxy-midazolam as internal standard) and incubated for six additional minutes before adding acetonitrile. Supernatants were analyzed for the CYP3A4 selective metabolite 1-hydroxymidazolam and d4-1-hydroxymidazolam using LC-MS. TDI CYP3A enzyme activity was calculated using normalized area ratios of 1-hydroxymidazolam to internal standard and plotted over the pre-incubation time. A one parameter fit using a range of 80% and a background of 20% was utilized to determine kobs. The percentage of reversible inhibition was calculated by the area ratio at 0 min (pre-incubation) in relation to the area ratio of the control with DMSO only. In cases of strong reversible inhibition ( > 50%), kobs values were not calculated.

Reversible inhibition of CYP3A4, CYP2C9, and CYP2D6

Formation rate of enzyme-specific metabolites from midazolam (CYP3A4), diclofenac (CYP2C9) and bufuralol (CYP2D6) in human liver microsomes was utilized. Substrates, internals standards and test compounds were dispensed by acoustic dispensing to a 384-well microplate. Human liver microsomes supplemented with NADPH were added to start the incubation. Plates were immediately transferred to an incubator. Incubations were stopped by the addition of acetonitrile/formic acid (93:7) and the supernatant was analyzed by LC-MS for the enzyme-specific metabolites and internal standards. The area ratios of test compounds were normalized to the average area ratio of DMSO (100% activity) and an inhibitor cocktail (0% activity) to determine the IC50 (test compound concentration causing an inhibition of 50%) using a dose-response-model with a two-parameter fit in which 100% and 0% activity were constrained.

Data quality

To deliver the best possible data quality, different processes and acceptance criteria were considered. For instance, in LE-MDCK and PPB assays, data were rejected if the recovery rate was too low ( < 60%). Moreover, to minimize unspecific binding issues, enzymatic incubations are done using one-well-per-data. Therefore, it is expected to extract the (bound) compound completely once the incubation is stopped (high content of organic solvent) in contrast to serial sampling approaches. This does not regard the free fraction in the incubation but increases the probability to get consistent data. The LE-MDCK assay protocol was adapted for outside rule-of-five molecules36. In CLint and CYP inhibition assays, non-specific binding is less problematic since protein is present in the incubation medium. Moreover, Fu,mic is measured to correct CLint for microsomal binding. For CYP inhibition, the presence of protein decreases the non-specific binding to labware. If non-specific binding interferes too much with the assay, the data does not fit the model and no IC50 is reported.

Data sets for modeling

ADME data from twenty-five assays were extracted from Novartis database and pre-processed, including apparent permeability (Papp) from two versions of the low-efflux Madin-Darby canine kidney cell line (MDCK) permeability assay, parallel artificial membrane permeability assay (PAMPA), Caco-2 permeability assay, efflux ratio from MDCK-multidrug resistance protein 1 (MDCK-MDR1) permeability assay, intrinsic clearance (CLint) from CYP metabolic stability in liver microsomes assays for rat, human, mouse, dog, cynomolgus monkey, and minipig, plasma protein binding (PPB) for rat, human, mouse, dog and cynomolgus monkey, human serum albumin (HSA) binding, microsomal binding, brain binding, octanol-water partition (LogP) and distribution coefficients (LogP), time-dependent inhibition (TDI) of CYP3A4 (inactivation rate, kobs), and reversible inhibition of CYP3A4, CYP2C9, and CYP2D6 (half-inhibitory constant, IC50). Experimental data were aggregated (geometric mean) when multiple measurements were available for the same compound and assay’s endpoint. Moreover, values outside the dynamic range of the assays were excluded, and qualified values (‘<’/‘>’) were discarded for permeability, PPB, LogP, and LogD. PPB values were transformed to fraction unbound (Fu) and IC50 values from CYP reversible inhibition assays were converted to pIC50 with a negative logarithmic transformation. Logarithmic transformations were applied to the rest of the assay endpoints, except to LogP and LogD. All above-mentioned assays were utilized for model training, but only a fraction of them were used for model evaluation due to data availability. For instance, some assays are requested less often (e.g. monkey CLint compared to rat CLint) or were deprecated in favor of newer version (e.g. LE-MDCK version 2 assay) or other technologies (e.g. Caco-2 was deprecated). Data set statistics are discussed and reported below.

Global QSPR models’ description

Four multi-task graph neural networks (MT-GNN) global models were generated and evaluated herein: Permeability (5-task model), Clearance (6-task model), Binding/Lipophilicity (10-task model), and CYP inhibition (4-task model). The models were ensembles of a message-passing neural network (MPNN) followed by a feed-forward deep neural network (DNN)25,26. These DNNs facilitate MT through the consideration of multiple output neurons (one per task). To enable MT model training with sparse labels, a masked loss function was utilized, and missing values were not considered for backpropagation22. Previous investigations indicated that especially for sparse experimental data, MT learning can provide an advantage compared to single-task models38. For all models, rectified linear unit (ReLU) was the activation function, a batch size of 50 was considered, and the models were trained with the consideration of early stopping (with a validation set of 10% with scaffold split). Learning rate was varied from an initial value of 0.0001 to a maximum of 0.001 linearly and decreased until 0.0001 exponentially. Mean aggregation was applied to convert atomic vectors into molecular vectors. Supplementary Table 2 reports details about each model’s architecture. Models evaluated herein are available for ADME assays’ predictions internally at Novartis. Prior to selecting MT-GNN as a modeling approach, a variety of molecular features, ML methods, and hyperparameters were benchmarked2,20.

Prediction tasks

Table 1 reports the fifteen prediction tasks that were evaluated, including the assay, measured property, and MT-GNN model where they were included. Numerical assay thresholds were used to categorize experimental read-outs into ‘high risk’ and ‘low risk’ classes1. Compound optimization accounts for multiple properties, which are measured with assays of varying resolution (different experimental errors)7. Therefore, this assay risk categorization defined by assay experts accounts for experimental variability20 and facilitates decisions during multiparameter optimization. Following this three-class concept, property predictions were also converted to a three-output classification (‘low risk’, ‘medium’, ‘high risk’) using the same thresholds utilized for experimental read-outs, which are reported in Table 1. The percentage of compounds belonging to each category is reported in Supplementary Table 1. When applying such assay thresholds to ML outputs, predictions in the ‘medium’ category can be disregarded. By only considering ‘high risk’ and ‘low risk’ predictions for decision-making, ML models’ precision improves. As in other works20,21,39, predictions in the medium range can also be termed ‘inconclusive (medium)’ predictions, which ideally are not a large fraction. This regression-based classification approach was previously proposed20,40.

Data splitting

ML models were evaluated prospectively (with newly registered and measured molecules)2. Such evaluation scenario is commonly referred to as temporal validation or time-split41. All MT-GNN global models were trained with compounds registered until the end of 2021 and evaluated with compounds registered from 1st January 2022 until 13th July 2023. Global models were evaluated on glue and heterobifunctional TPDs, and predictive performance on these two TPD modalities was compared to performance across all modalities. For the prediction tasks under evaluation, Fig. 1 reports the number of training and test set compounds per each property and modality. The Permeability model was trained on 206,347 compounds, which included 2,732 heterobifunctionals and 2673 glues. From those compounds, 20,041 had LE-MDCK v2 Papp measurements, including 1,608 heterobifunctionals and 1404 glues. Clearance, Binding/Lipophilicity, and CYP inhibition models were generated with 223,025, 92,464, and 65,701 compounds, respectively. Supplementary Table 3 reports the number of training compounds for all the tasks included in global models.

Performance metrics

For regression models, performance was estimated with the mean absolute error (MAE).

$${{{{{\rm{MAE}}}}}}=\,\frac{1}{n}{\sum }_{i=1}^{n}{||}{y}_{i}-{\hat{y}}_{i}{||}$$
(1)

where \(y\): experimental value, \(\hat{y}\): prediction, and n: number of compounds.

Classification models were evaluated with the average precision on low and high classes. Precision is the percentage of compounds that were predicted ‘high’ (‘low’) and indeed had a ‘high’ (‘low’) property value.

$${{{{{\rm{Precision}}}}}}=\frac{{{{{{\rm{TP}}}}}}}{{{{{{\rm{TP}}}}}}+{{{{{\rm{FP}}}}}}}$$
(2)

where TP: true positives, and FP: false positives.

Misclassification or error rates were also calculated, as the percentage of compounds that were predicted to have a ‘high’ (‘low’) property value and had a ‘low’ (‘high’) measured value. Finally, the percentage of inconclusive (medium) predictions is computed as the fraction of molecules with a predicted property value in the ‘medium’ range. For those compounds, no ‘high’ or ‘low’ property prediction is given.

Models’ refinement

Some prediction tasks showing margin for improvement were identified and, when data availability allowed, model refinement was carried out. Specifically, three MT-GNN models were optimized with new data attempting at improving LE-MDCK Papp (permeability model); RLM CLint and HLM CLint (clearance model); and CYP3A4 kobs and CYP3A4 IC50 (CYP model) predictions.

Transfer learning was adopted with the purpose of optimizing model parameters for heterobifunctional TPD compounds. Instead of focusing on transferring knowledge to previously unseen tasks, which is more common in the field42,43,44, transfer learning was applied under the paradigm of domain adaptation29,45. The transfer learning approach utilized was model fine-tuning with weights initialization, where the weights of the original model (pre-trained model) were adjusted with new data28,46. Two fine-tuning strategies were investigated and are illustrated in Fig. 5.

  • Strategy 1 (New data). Global ML models were fine-tuned with all new data generated during the previous year. Here, the model was fine-tuned with compounds registered and measured in 2022.

  • Strategy 2 (TPD-specific data). The original global ML model was fine-tuned with heterobifunctional TPDs’ data (registered and measured before 2023).

Both strategies were evaluated on 328 heterobifunctional TPDs synthesized and measured in 2023. Supplementary Table 4 reports the number of compounds in the training, fine-tuning, and test sets.

Public structures and surrogate data set

A data set of public structures was gathered from ChEMBL33, ZINC34, and PROTAC-DB35. ChEMBL and ZINC data sets were prepared according to Rodriguez-Perez and Bajorath47. Structures from the three data sources were standardized, including hydrogen bonds removal, metal disconnection, molecule normalization, reionization, and stereochemistry assignation, with the RDKit module rdMolStandardize, and canonicalized48. After such pre-processing, the data set was composed by 273,706 molecules from ChEMBL (70,465), ZINC (199,972) and PROTAC-DB (3,269). To get physicochemical and ADME property values for this large set of molecules, the internal Novartis ML models were utilized. Twenty-five compound properties were predicted for each of the molecules to generate a surrogate data set. This surrogate data set is provided as Supplementary Data.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.